2024-06-05 09:18:22

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v5 0/3] SCSI: Fix issues between removing device and error handle

2 issues are triggered because devices in removing would be skipped
when calling shost_for_each_device(), these issues are mainly in error
recovery path, which are:

1. statistic info printed at beginning of scsi_error_handler is wrong;
2. device reset is not triggered. drivers like smartpqi only implement
eh_device_reset_handler, if device reset is skipped, the commands
which had been sent to firmware or devices hardware are not cleared.
The error handle would flush all these commands in scsi_unjam_host().
When the commands are finished by hardware, use after free issue is
triggered.
The issue first happened with smartpqi devices, and can be reproduced
with scsi_debug. I did not see any description about SDEV_DEL state
can not perform device, so this is should be addressed.

A new macro shost_for_each_device_include_deleted() is added to address
these issues. The newly added macro would not skip scsi_device which is
in removing when iterate host's scsi_device and is called when statistic
host's error info and trying to reset scsi_device in error recovery path.

V5:
- Rewrite cover letter and add fixes tag to each patch

V4:
- Remove the forth patch which fix IO hang when device removing
becaust the issue is fixed by commit '6df0e077d76bd (scsi: core:
Kick the requeue list after inserting when flushing)'

V3:
- Update patch description
- Update comments of functions added

V2:
- Fix IO hang by run all devices' queue after error handler
- Do not modify shost_for_each_device() directly but add a new
helper to iterate devices but do not skip devices in removing

Wenchao Hao (3):
scsi: core: Add new helper to iterate all devices of host
scsi: scsi_error: Fix wrong statistic when print error info
scsi: scsi_error: Fix device reset is not triggered

drivers/scsi/scsi.c | 46 ++++++++++++++++++++++++++------------
drivers/scsi/scsi_error.c | 4 ++--
include/scsi/scsi_device.h | 25 ++++++++++++++++++---
3 files changed, 56 insertions(+), 19 deletions(-)

--
2.38.1



2024-06-05 09:18:41

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v5 2/3] scsi: scsi_error: Fix wrong statistic when print error info

shost_for_each_device() would skip devices which is in progress of
removing, so commands of these devices would be ignored in
scsi_eh_prt_fail_stats().

Fix this issue by using shost_for_each_device_include_deleted()
to iterate devices in scsi_eh_prt_fail_stats().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Wenchao Hao <[email protected]>
---
drivers/scsi/scsi_error.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 612489afe8d2..a61fd8af3b1f 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -409,7 +409,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
int cmd_cancel = 0;
int devices_failed = 0;

- shost_for_each_device(sdev, shost) {
+ shost_for_each_device_include_deleted(sdev, shost) {
list_for_each_entry(scmd, work_q, eh_entry) {
if (scmd->device == sdev) {
++total_failures;
--
2.38.1


2024-06-05 09:42:10

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v5 1/3] scsi: core: Add new helper to iterate all devices of host

shost_for_each_device() would skip devices which is in SDEV_CANCEL or
SDEV_DEL state, for some scenarios, we donot want to skip these devices,
so add a new macro shost_for_each_device_include_deleted() to handle it.

Following changes are introduced:

1. Rework scsi_device_get(), add new helper __scsi_device_get() which
determine if skip deleted scsi_device by parameter "skip_deleted".
2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
is used when calling __scsi_device_get()
3. Update shost_for_each_device() to call __scsi_iterate_devices() with
"skip_deleted" true
4. Add new macro shost_for_each_device_include_deleted() which call
__scsi_iterate_devices() with "skip_deleted" false

Signed-off-by: Wenchao Hao <[email protected]>
---
drivers/scsi/scsi.c | 46 ++++++++++++++++++++++++++------------
include/scsi/scsi_device.h | 25 ++++++++++++++++++---
2 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 3e0c0381277a..5913de543d93 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
return 0;
}

-/**
- * scsi_device_get - get an additional reference to a scsi_device
+/*
+ * __scsi_device_get - get an additional reference to a scsi_device
* @sdev: device to get a reference to
- *
- * Description: Gets a reference to the scsi_device and increments the use count
- * of the underlying LLDD module. You must hold host_lock of the
- * parent Scsi_Host or already have a reference when calling this.
- *
- * This will fail if a device is deleted or cancelled, or when the LLD module
- * is in the process of being unloaded.
+ * @skip_deleted: when true, would return failed if device is deleted
*/
-int scsi_device_get(struct scsi_device *sdev)
+static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted)
{
- if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
+ /*
+ * if skip_deleted is true and device is in removing, return failed
+ */
+ if (skip_deleted &&
+ (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL))
goto fail;
if (!try_module_get(sdev->host->hostt->module))
goto fail;
@@ -761,6 +759,22 @@ int scsi_device_get(struct scsi_device *sdev)
fail:
return -ENXIO;
}
+
+/**
+ * scsi_device_get - get an additional reference to a scsi_device
+ * @sdev: device to get a reference to
+ *
+ * Description: Gets a reference to the scsi_device and increments the use count
+ * of the underlying LLDD module. You must hold host_lock of the
+ * parent Scsi_Host or already have a reference when calling this.
+ *
+ * This will fail if a device is deleted or cancelled, or when the LLD module
+ * is in the process of being unloaded.
+ */
+int scsi_device_get(struct scsi_device *sdev)
+{
+ return __scsi_device_get(sdev, 0);
+}
EXPORT_SYMBOL(scsi_device_get);

/**
@@ -780,9 +794,13 @@ void scsi_device_put(struct scsi_device *sdev)
}
EXPORT_SYMBOL(scsi_device_put);

-/* helper for shost_for_each_device, see that for documentation */
+/**
+ * helper for shost_for_each_device, see that for documentation
+ * @skip_deleted: if true, sdev in progress of removing would be skipped
+ */
struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost,
- struct scsi_device *prev)
+ struct scsi_device *prev,
+ bool skip_deleted)
{
struct list_head *list = (prev ? &prev->siblings : &shost->__devices);
struct scsi_device *next = NULL;
@@ -792,7 +810,7 @@ struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost,
while (list->next != &shost->__devices) {
next = list_entry(list->next, struct scsi_device, siblings);
/* skip devices that we can't get a reference to */
- if (!scsi_device_get(next))
+ if (!__scsi_device_get(next, skip_deleted))
break;
next = NULL;
list = list->next;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 9c540f5468eb..5cb294ff0a41 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -412,7 +412,8 @@ extern void __starget_for_each_device(struct scsi_target *, void *,

/* only exposed to implement shost_for_each_device */
extern struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *,
- struct scsi_device *);
+ struct scsi_device *,
+ bool);

/**
* shost_for_each_device - iterate over all devices of a host
@@ -422,11 +423,29 @@ extern struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *,
* Iterator that returns each device attached to @shost. This loop
* takes a reference on each device and releases it at the end. If
* you break out of the loop, you must call scsi_device_put(sdev).
+ *
+ * Note: this macro would skip sdev which is in progress of removing
*/
#define shost_for_each_device(sdev, shost) \
- for ((sdev) = __scsi_iterate_devices((shost), NULL); \
+ for ((sdev) = __scsi_iterate_devices((shost), NULL, 1); \
+ (sdev); \
+ (sdev) = __scsi_iterate_devices((shost), (sdev), 1))
+
+/*
+ * shost_for_each_device_include_deleted- iterate over all devices of a host
+ * @sdev: the &struct scsi_device to use as a cursor
+ * @shost: the &struct scsi_host to iterate over
+ *
+ * Iterator that returns each device attached to @shost. This loop
+ * takes a reference on each device and releases it at the end. If
+ * you break out of the loop, you must call scsi_device_put(sdev).
+ *
+ * Note: this macro would include sdev which is in progress of removing
+ */
+#define shost_for_each_device_include_deleted(sdev, shost) \
+ for ((sdev) = __scsi_iterate_devices((shost), NULL, 0); \
(sdev); \
- (sdev) = __scsi_iterate_devices((shost), (sdev)))
+ (sdev) = __scsi_iterate_devices((shost), (sdev), 0))

/**
* __shost_for_each_device - iterate over all devices of a host (UNLOCKED)
--
2.38.1


2024-06-05 09:43:08

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v5 3/3] scsi: scsi_error: Fix device reset is not triggered

shost_for_each_device() would skip devices which is in progress of
removing, so scsi_try_bus_device_reset() for these devices would be
skipped in scsi_eh_bus_device_reset() with following order:

T1: T2:scsi_error_handle
__scsi_remove_device
scsi_device_set_state(sdev, SDEV_DEL)
// would skip device with SDEV_DEL state
shost_for_each_device()
scsi_try_bus_device_reset
flush all commands
...
releasing and free scsi_device

Some drivers like smartpqi only implement eh_device_reset_handler,
if device reset is skipped, the commands which had been sent to
firmware or devices hardware are not cleared. The error handle
would flush all these commands in scsi_unjam_host().

When the commands are finished by hardware, use after free issue is
triggered.

Fix this issue by using shost_for_each_device_include_deleted()
to iterate devices in scsi_eh_bus_device_reset().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Wenchao Hao <[email protected]>
---
drivers/scsi/scsi_error.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index a61fd8af3b1f..ab4a58f92838 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1571,7 +1571,7 @@ static int scsi_eh_bus_device_reset(struct Scsi_Host *shost,
struct scsi_device *sdev;
enum scsi_disposition rtn;

- shost_for_each_device(sdev, shost) {
+ shost_for_each_device_include_deleted(sdev, shost) {
if (scsi_host_eh_past_deadline(shost)) {
SCSI_LOG_ERROR_RECOVERY(3,
sdev_printk(KERN_INFO, sdev,
--
2.38.1


2024-06-12 08:34:04

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] scsi: core: Add new helper to iterate all devices of host

On 6/5/24 11:17, Wenchao Hao wrote:
> shost_for_each_device() would skip devices which is in SDEV_CANCEL or
> SDEV_DEL state, for some scenarios, we donot want to skip these devices,
> so add a new macro shost_for_each_device_include_deleted() to handle it.
>
> Following changes are introduced:
>
> 1. Rework scsi_device_get(), add new helper __scsi_device_get() which
> determine if skip deleted scsi_device by parameter "skip_deleted".
> 2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
> is used when calling __scsi_device_get()
> 3. Update shost_for_each_device() to call __scsi_iterate_devices() with
> "skip_deleted" true
> 4. Add new macro shost_for_each_device_include_deleted() which call
> __scsi_iterate_devices() with "skip_deleted" false
>
> Signed-off-by: Wenchao Hao <[email protected]>
> ---
> drivers/scsi/scsi.c | 46 ++++++++++++++++++++++++++------------
> include/scsi/scsi_device.h | 25 ++++++++++++++++++---
> 2 files changed, 54 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 3e0c0381277a..5913de543d93 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
> return 0;
> }
>
> -/**
> - * scsi_device_get - get an additional reference to a scsi_device
> +/*
> + * __scsi_device_get - get an additional reference to a scsi_device
> * @sdev: device to get a reference to
> - *
> - * Description: Gets a reference to the scsi_device and increments the use count
> - * of the underlying LLDD module. You must hold host_lock of the
> - * parent Scsi_Host or already have a reference when calling this.
> - *
> - * This will fail if a device is deleted or cancelled, or when the LLD module
> - * is in the process of being unloaded.
> + * @skip_deleted: when true, would return failed if device is deleted
> */
> -int scsi_device_get(struct scsi_device *sdev)
> +static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted)
> {
> - if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
> + /*
> + * if skip_deleted is true and device is in removing, return failed
> + */
> + if (skip_deleted &&
> + (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL))
> goto fail;

Nack.
SDEV_DEL means the device is about to be deleted, so we _must not_
access it at all.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-06-12 08:35:11

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] scsi: scsi_error: Fix wrong statistic when print error info

On 6/5/24 11:17, Wenchao Hao wrote:
> shost_for_each_device() would skip devices which is in progress of
> removing, so commands of these devices would be ignored in
> scsi_eh_prt_fail_stats().
>
> Fix this issue by using shost_for_each_device_include_deleted()
> to iterate devices in scsi_eh_prt_fail_stats().
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Wenchao Hao <[email protected]>
> ---
> drivers/scsi/scsi_error.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 612489afe8d2..a61fd8af3b1f 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -409,7 +409,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
> int cmd_cancel = 0;
> int devices_failed = 0;
>
> - shost_for_each_device(sdev, shost) {
> + shost_for_each_device_include_deleted(sdev, shost) {
> list_for_each_entry(scmd, work_q, eh_entry) {
> if (scmd->device == sdev) {
> ++total_failures;

That is wrong. We should rather add a failure counter to the SCSI host,
and have the scsi device increase it every time a failure occurs.
Then we can avoid this loop completely.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-06-12 15:10:33

by Wenchao Hao

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] scsi: core: Add new helper to iterate all devices of host

On 6/12/24 4:33 PM, Hannes Reinecke wrote:
> On 6/5/24 11:17, Wenchao Hao wrote:
>> shost_for_each_device() would skip devices which is in SDEV_CANCEL or
>> SDEV_DEL state, for some scenarios, we donot want to skip these devices,
>> so add a new macro shost_for_each_device_include_deleted() to handle it.
>>
>> Following changes are introduced:
>>
>> 1. Rework scsi_device_get(), add new helper __scsi_device_get() which
>>     determine if skip deleted scsi_device by parameter "skip_deleted".
>> 2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
>>     is used when calling __scsi_device_get()
>> 3. Update shost_for_each_device() to call __scsi_iterate_devices() with
>>     "skip_deleted" true
>> 4. Add new macro shost_for_each_device_include_deleted() which call
>>     __scsi_iterate_devices() with "skip_deleted" false
>>
>> Signed-off-by: Wenchao Hao <[email protected]>
>> ---
>>   drivers/scsi/scsi.c        | 46 ++++++++++++++++++++++++++------------
>>   include/scsi/scsi_device.h | 25 ++++++++++++++++++---
>>   2 files changed, 54 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
>> index 3e0c0381277a..5913de543d93 100644
>> --- a/drivers/scsi/scsi.c
>> +++ b/drivers/scsi/scsi.c
>> @@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
>>       return 0;
>>   }
>>   -/**
>> - * scsi_device_get  -  get an additional reference to a scsi_device
>> +/*
>> + * __scsi_device_get  -  get an additional reference to a scsi_device
>>    * @sdev:    device to get a reference to
>> - *
>> - * Description: Gets a reference to the scsi_device and increments the use count
>> - * of the underlying LLDD module.  You must hold host_lock of the
>> - * parent Scsi_Host or already have a reference when calling this.
>> - *
>> - * This will fail if a device is deleted or cancelled, or when the LLD module
>> - * is in the process of being unloaded.
>> + * @skip_deleted: when true, would return failed if device is deleted
>>    */
>> -int scsi_device_get(struct scsi_device *sdev)
>> +static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted)
>>   {
>> -    if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
>> +    /*
>> +     * if skip_deleted is true and device is in removing, return failed
>> +     */
>> +    if (skip_deleted &&
>> +        (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL))
>>           goto fail;
>
> Nack.
> SDEV_DEL means the device is about to be deleted, so we _must not_ access it at all.
>

Sorry I added SDEV_DEL here at hand without understanding what it means.
Actually, just include scsi_device which is in SDEV_CANCEL would fix the
issues I described.

The issues are because device removing concurrent with error handle.
Normally, error handle would not be triggered when scsi_device is in
SDEV_DEL. Below is my analysis, if it is wrong, please correct me.

If there are scsi_cmnd remain unfinished when removing scsi_device,
the removing process would waiting for all commands to be finished.
If commands error happened and trigger error handle, the removing
process would be blocked until error handle finished, because
__scsi_remove_device called del_gendisk() which would wait all
requests to be finished. So now scsi_device is in SDEV_CANCEL.

If the scsi_device is already in SDEV_DEL, then no scsi_cmnd has been
dispatched to this scsi_device, then error handle would never triggered.

I want to change the new function __scsi_device_get() as following,
please help to review.

/*
* __scsi_device_get - get an additional reference to a scsi_device
* @sdev: device to get a reference to
* @skip_canceled: when true, would return failed if device is deleted
*/
static int __scsi_device_get(struct scsi_device *sdev, bool skip_canceled)
{
/*
* if skip_canceled is true and device is in removing, return failed
*/
if (sdev->sdev_state == SDEV_DEL ||
(sdev->sdev_state == SDEV_CANCEL && skip_canceled))
goto fail;
if (!try_module_get(sdev->host->hostt->module))
goto fail;
if (!get_device(&sdev->sdev_gendev))
goto fail_put_module;
return 0;

fail_put_module:
module_put(sdev->host->hostt->module);
fail:
return -ENXIO;
}

> Cheers,
>
> Hannes


2024-06-12 15:19:49

by Wenchao Hao

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] scsi: scsi_error: Fix wrong statistic when print error info

On 6/12/24 4:34 PM, Hannes Reinecke wrote:
> On 6/5/24 11:17, Wenchao Hao wrote:
>> shost_for_each_device() would skip devices which is in progress of
>> removing, so commands of these devices would be ignored in
>> scsi_eh_prt_fail_stats().
>>
>> Fix this issue by using shost_for_each_device_include_deleted()
>> to iterate devices in scsi_eh_prt_fail_stats().
>>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>> Signed-off-by: Wenchao Hao <[email protected]>
>> ---
>>   drivers/scsi/scsi_error.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index 612489afe8d2..a61fd8af3b1f 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -409,7 +409,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
>>       int cmd_cancel = 0;
>>       int devices_failed = 0;
>>   -    shost_for_each_device(sdev, shost) {
>> +    shost_for_each_device_include_deleted(sdev, shost) {
>>           list_for_each_entry(scmd, work_q, eh_entry) {
>>               if (scmd->device == sdev) {
>>                   ++total_failures;
>
> That is wrong. We should rather add a failure counter to the SCSI host, and have the scsi device increase it every time a failure occurs.
> Then we can avoid this loop completely.
>

This function would print the total failure commands and the number of device like following:

scsi host4: Total of 3 commands on 2 devices require eh work

Just add a failure counter to the SCSI host can not record the number of devices.

> Cheers,
>
> Hannes


2024-06-13 06:28:01

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] scsi: core: Add new helper to iterate all devices of host

On 6/12/24 17:06, Wenchao Hao wrote:
> On 6/12/24 4:33 PM, Hannes Reinecke wrote:
>> On 6/5/24 11:17, Wenchao Hao wrote:
>>> shost_for_each_device() would skip devices which is in SDEV_CANCEL or
>>> SDEV_DEL state, for some scenarios, we donot want to skip these devices,
>>> so add a new macro shost_for_each_device_include_deleted() to handle it.
>>>
>>> Following changes are introduced:
>>>
>>> 1. Rework scsi_device_get(), add new helper __scsi_device_get() which
>>>     determine if skip deleted scsi_device by parameter "skip_deleted".
>>> 2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
>>>     is used when calling __scsi_device_get()
>>> 3. Update shost_for_each_device() to call __scsi_iterate_devices() with
>>>     "skip_deleted" true
>>> 4. Add new macro shost_for_each_device_include_deleted() which call
>>>     __scsi_iterate_devices() with "skip_deleted" false
>>>
>>> Signed-off-by: Wenchao Hao <[email protected]>
>>> ---
>>>   drivers/scsi/scsi.c        | 46 ++++++++++++++++++++++++++------------
>>>   include/scsi/scsi_device.h | 25 ++++++++++++++++++---
>>>   2 files changed, 54 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
>>> index 3e0c0381277a..5913de543d93 100644
>>> --- a/drivers/scsi/scsi.c
>>> +++ b/drivers/scsi/scsi.c
>>> @@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
>>>       return 0;
>>>   }
>>>   -/**
>>> - * scsi_device_get  -  get an additional reference to a scsi_device
>>> +/*
>>> + * __scsi_device_get  -  get an additional reference to a scsi_device
>>>    * @sdev:    device to get a reference to
>>> - *
>>> - * Description: Gets a reference to the scsi_device and increments the use count
>>> - * of the underlying LLDD module.  You must hold host_lock of the
>>> - * parent Scsi_Host or already have a reference when calling this.
>>> - *
>>> - * This will fail if a device is deleted or cancelled, or when the LLD module
>>> - * is in the process of being unloaded.
>>> + * @skip_deleted: when true, would return failed if device is deleted
>>>    */
>>> -int scsi_device_get(struct scsi_device *sdev)
>>> +static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted)
>>>   {
>>> -    if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
>>> +    /*
>>> +     * if skip_deleted is true and device is in removing, return failed
>>> +     */
>>> +    if (skip_deleted &&
>>> +        (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL))
>>>           goto fail;
>>
>> Nack.
>> SDEV_DEL means the device is about to be deleted, so we _must not_ access it at all.
>>
>
> Sorry I added SDEV_DEL here at hand without understanding what it means.
> Actually, just include scsi_device which is in SDEV_CANCEL would fix the
> issues I described.
>
> The issues are because device removing concurrent with error handle.
> Normally, error handle would not be triggered when scsi_device is in
> SDEV_DEL. Below is my analysis, if it is wrong, please correct me.
>
> If there are scsi_cmnd remain unfinished when removing scsi_device,
> the removing process would waiting for all commands to be finished.
> If commands error happened and trigger error handle, the removing
> process would be blocked until error handle finished, because
> __scsi_remove_device called del_gendisk() which would wait all
> requests to be finished. So now scsi_device is in SDEV_CANCEL.
>
> If the scsi_device is already in SDEV_DEL, then no scsi_cmnd has been
> dispatched to this scsi_device, then error handle would never triggered.
>
> I want to change the new function __scsi_device_get() as following,
> please help to review.
>
> /*
> * __scsi_device_get - get an additional reference to a scsi_device
> * @sdev: device to get a reference to
> * @skip_canceled: when true, would return failed if device is deleted
> */
> static int __scsi_device_get(struct scsi_device *sdev, bool skip_canceled)
> {
> /*
> * if skip_canceled is true and device is in removing, return failed
> */
> if (sdev->sdev_state == SDEV_DEL ||
> (sdev->sdev_state == SDEV_CANCEL && skip_canceled))
> goto fail;
> if (!try_module_get(sdev->host->hostt->module))
> goto fail;
> if (!get_device(&sdev->sdev_gendev))
> goto fail_put_module;
> return 0;
>
> fail_put_module:
> module_put(sdev->host->hostt->module);
> fail:
> return -ENXIO;
> }
>
I don't think that's required.
With your above analysis, wouldn't the problem be solved with:

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 775df00021e4..911fcfa80d69 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1470,6 +1470,8 @@ void __scsi_remove_device(struct scsi_device *sdev)
if (sdev->sdev_state == SDEV_DEL)
return;

+ scsi_block_when_processing_errors(sdev);
+
if (sdev->is_visible) {
/*
* If scsi_internal_target_block() is running concurrently,

Hmm?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


2024-06-13 07:13:02

by Wenchao Hao

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] scsi: core: Add new helper to iterate all devices of host

On 2024/6/13 14:27, Hannes Reinecke wrote:
> On 6/12/24 17:06, Wenchao Hao wrote:
>> On 6/12/24 4:33 PM, Hannes Reinecke wrote:
>>> On 6/5/24 11:17, Wenchao Hao wrote:
>>>> shost_for_each_device() would skip devices which is in SDEV_CANCEL or
>>>> SDEV_DEL state, for some scenarios, we donot want to skip these
>>>> devices,
>>>> so add a new macro shost_for_each_device_include_deleted() to handle
>>>> it.
>>>>
>>>> Following changes are introduced:
>>>>
>>>> 1. Rework scsi_device_get(), add new helper __scsi_device_get() which
>>>>      determine if skip deleted scsi_device by parameter "skip_deleted".
>>>> 2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
>>>>      is used when calling __scsi_device_get()
>>>> 3. Update shost_for_each_device() to call __scsi_iterate_devices() with
>>>>      "skip_deleted" true
>>>> 4. Add new macro shost_for_each_device_include_deleted() which call
>>>>      __scsi_iterate_devices() with "skip_deleted" false
>>>>
>>>> Signed-off-by: Wenchao Hao <[email protected]>
>>>> ---
>>>>    drivers/scsi/scsi.c        | 46
>>>> ++++++++++++++++++++++++++------------
>>>>    include/scsi/scsi_device.h | 25 ++++++++++++++++++---
>>>>    2 files changed, 54 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
>>>> index 3e0c0381277a..5913de543d93 100644
>>>> --- a/drivers/scsi/scsi.c
>>>> +++ b/drivers/scsi/scsi.c
>>>> @@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev,
>>>> bool enable)
>>>>        return 0;
>>>>    }
>>>>    -/**
>>>> - * scsi_device_get  -  get an additional reference to a scsi_device
>>>> +/*
>>>> + * __scsi_device_get  -  get an additional reference to a scsi_device
>>>>     * @sdev:    device to get a reference to
>>>> - *
>>>> - * Description: Gets a reference to the scsi_device and increments
>>>> the use count
>>>> - * of the underlying LLDD module.  You must hold host_lock of the
>>>> - * parent Scsi_Host or already have a reference when calling this.
>>>> - *
>>>> - * This will fail if a device is deleted or cancelled, or when the
>>>> LLD module
>>>> - * is in the process of being unloaded.
>>>> + * @skip_deleted: when true, would return failed if device is deleted
>>>>     */
>>>> -int scsi_device_get(struct scsi_device *sdev)
>>>> +static int __scsi_device_get(struct scsi_device *sdev, bool
>>>> skip_deleted)
>>>>    {
>>>> -    if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state ==
>>>> SDEV_CANCEL)
>>>> +    /*
>>>> +     * if skip_deleted is true and device is in removing, return
>>>> failed
>>>> +     */
>>>> +    if (skip_deleted &&
>>>> +        (sdev->sdev_state == SDEV_DEL || sdev->sdev_state ==
>>>> SDEV_CANCEL))
>>>>            goto fail;
>>>
>>> Nack.
>>> SDEV_DEL means the device is about to be deleted, so we _must not_
>>> access it at all.
>>>
>>
>> Sorry I added SDEV_DEL here at hand without understanding what it means.
>> Actually, just include scsi_device which is in SDEV_CANCEL would fix the
>> issues I described.
>>
>> The issues are because device removing concurrent with error handle.
>> Normally, error handle would not be triggered when scsi_device is in
>> SDEV_DEL. Below is my analysis, if it is wrong, please correct me.
>>
>> If there are scsi_cmnd remain unfinished when removing scsi_device,
>> the removing process would waiting for all commands to be finished.
>> If commands error happened and trigger error handle, the removing
>> process would be blocked until error handle finished, because
>> __scsi_remove_device called  del_gendisk() which would wait all
>> requests to be finished. So now scsi_device is in SDEV_CANCEL.
>>
>> If the scsi_device is already in SDEV_DEL, then no scsi_cmnd has been
>> dispatched to this scsi_device, then error handle would never triggered.
>>
>> I want to change the new function __scsi_device_get() as following,
>> please help to review.
>>
>> /*
>>   * __scsi_device_get  -  get an additional reference to a scsi_device
>>   * @sdev:    device to get a reference to
>>   * @skip_canceled: when true, would return failed if device is deleted
>>   */
>> static int __scsi_device_get(struct scsi_device *sdev, bool
>> skip_canceled)
>> {
>>     /*
>>      * if skip_canceled is true and device is in removing, return failed
>>      */
>>     if (sdev->sdev_state == SDEV_DEL ||
>>         (sdev->sdev_state == SDEV_CANCEL && skip_canceled))
>>         goto fail;
>>     if (!try_module_get(sdev->host->hostt->module))
>>         goto fail;
>>     if (!get_device(&sdev->sdev_gendev))
>>         goto fail_put_module;
>>     return 0;
>>
>> fail_put_module:
>>     module_put(sdev->host->hostt->module);
>> fail:
>>     return -ENXIO;
>> }
>>
> I don't think that's required.
> With your above analysis, wouldn't the problem be solved with:
>
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 775df00021e4..911fcfa80d69 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1470,6 +1470,8 @@ void __scsi_remove_device(struct scsi_device *sdev)
>         if (sdev->sdev_state == SDEV_DEL)
>                 return;
>
> +       scsi_block_when_processing_errors(sdev);
> +
>         if (sdev->is_visible) {
>                 /*
>                  * If scsi_internal_target_block() is running
> concurrently,
>
> Hmm?
>

We can not make sure no scsi_cmnd remain unfinished after
scsi_block_when_processing_errors(). For example, there is a
command has been dispatched but it's not timeouted when removing
device, no error happened. After scsi_device is set to SDEV_CANCEL,
the removing process would be blocked by del_gendisk() because there
is still a request.

Then the request timeout and abort failed, error handle would be
triggered, now scsi_device is SDEV_CANCEL.

The error handle would skip this scsi_device when doing device reset.

> Cheers,
>
> Hannes