2022-03-29 10:19:26

by Wenchao Hao

[permalink] [raw]
Subject: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

SCSI timeout would call scsi_eh_scmd_add() on some conditions, host
would be set
to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted
to all
devices in this host would not succeed until the scsi_error_handler()
finished.
The scsi_error_handler() might takes long time to be done, it's
unbearable when
host has massive devices.

I want to ask is anyone applying another error handler flow to address this
phenomenon?

I think we can move some operations(like scsi get sense, scsi send startunit
and scsi device reset) out of scsi_unjam_host(), to perform these operations
without setting host to SHOST_RECOVERY? It would reduce the time of
block the
whole host.

Waiting for your discussion.


2022-03-29 15:12:41

by Wenchao Hao

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 2022/3/29 18:56, Steffen Maier wrote:
> On 3/29/22 11:06, Wenchao Hao wrote:
>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>> devices in this host would not succeed until the scsi_error_handler() finished.
>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>> host has massive devices.
>>
>> I want to ask is anyone applying another error handler flow to address this
>> phenomenon?
>>
>> I think we can move some operations(like scsi get sense, scsi send startunit
>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>> whole host.
>>
>> Waiting for your discussion.
>
> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>

Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.

> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>
>

The deadline seems not helpful. What we want to see is a single LUN's command error
would not stop other LUNs which share the same host. So my plan is to move reset LUN out
from scsi_unjam_host() which run with host set to SHOST_RECOVERY.

2022-03-30 12:29:39

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 3/29/22 14:40, Wenchao Hao wrote:
> On 2022/3/29 18:56, Steffen Maier wrote:
>> On 3/29/22 11:06, Wenchao Hao wrote:
>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host
>>> would be set
>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs
>>> submitted to all
>>> devices in this host would not succeed until the scsi_error_handler()
>>> finished.
>>> The scsi_error_handler() might takes long time to be done, it's
>>> unbearable when
>>> host has massive devices.
>>>
>>> I want to ask is anyone applying another error handler flow to
>>> address this
>>> phenomenon?
>>>
>>> I think we can move some operations(like scsi get sense, scsi send
>>> startunit
>>> and scsi device reset) out of scsi_unjam_host(), to perform these
>>> operations
>>> without setting host to SHOST_RECOVERY? It would reduce the time of
>>> block the
>>> whole host.
>>>
>>> Waiting for your discussion.
>>
>> We already have "async" aborts before even entering scsi_eh. So your
>> use case seems to imply that those aborts fail and we enter scsi_eh?
>>
>
> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is
> called.
>
>> There's eh_deadline for limiting the time spent in escalation of
>> scsi_eh, and instead directly go to host reset. Would this help?
>>
>>
>
> The deadline seems not helpful. What we want to see is a single LUN's
> command error
> would not stop other LUNs which share the same host. So my plan is to
> move reset LUN out
> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.

Nope. One of the key points of scsi_unjam_host() is that is has to stop
all I/O before proceeding. Without doing so basically all SCSI parallel
HBAs will fail EH as they _require_ I/O to be stopped.

And even on modern HBAs we have the challenge that 99% of every EH
invocation is triggered by command timeouts, where 'LUN reset' is only
of limited usability.

Cheers,

Hannes

2022-03-30 13:08:38

by Wenchao Hao

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 2022/3/30 2:56, Hannes Reinecke wrote:
> On 3/29/22 14:40, Wenchao Hao wrote:
>> On 2022/3/29 18:56, Steffen Maier wrote:
>>> On 3/29/22 11:06, Wenchao Hao wrote:
>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>>>> devices in this host would not succeed until the scsi_error_handler() finished.
>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>>>> host has massive devices.
>>>>
>>>> I want to ask is anyone applying another error handler flow to address this
>>>> phenomenon?
>>>>
>>>> I think we can move some operations(like scsi get sense, scsi send startunit
>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>>>> whole host.
>>>>
>>>> Waiting for your discussion.
>>>
>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>>>
>>
>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
>>
>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>>>
>>>
>>
>> The deadline seems not helpful. What we want to see is a single LUN's command error
>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out
>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
>
> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped.
>

I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O
is because we might reset host during scsi_error_handler() and we must wait host's number of
failed command equal to number of busy command then we can wake up scsi_error_handler().

If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need
stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed,
we can then call in the large scale error handle.

Here is a brief flow:

abort command
||
|| failed
||
\/
stop single LUN's I/O (need to wait LUN's failed command number equal to busy command number)
||
|| failed (according to our statistic, 90% reset LUN would succeed)
||
\/
stop all LUN's I/O (stage to call in origin error handler)
||
||
||
\/
wait host's failed command number equal to busy command number
||
||
||
\/
perform target reset, bus reset and host reset and so on. (do not need to reset LUN since it already failed)

As mentioned, we can reduce the time of setting host to SHOST_RECOVERY which would stop all I/Os.


How do you think? I look forward to hearing from you.

2022-03-30 15:08:30

by Steffen Maier

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 3/29/22 11:06, Wenchao Hao wrote:
> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
> devices in this host would not succeed until the scsi_error_handler() finished.
> The scsi_error_handler() might takes long time to be done, it's unbearable when
> host has massive devices.
>
> I want to ask is anyone applying another error handler flow to address this
> phenomenon?
>
> I think we can move some operations(like scsi get sense, scsi send startunit
> and scsi device reset) out of scsi_unjam_host(), to perform these operations
> without setting host to SHOST_RECOVERY? It would reduce the time of block the
> whole host.
>
> Waiting for your discussion.

We already have "async" aborts before even entering scsi_eh. So your use case
seems to imply that those aborts fail and we enter scsi_eh?

There's eh_deadline for limiting the time spent in escalation of scsi_eh, and
instead directly go to host reset. Would this help?


--
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z and LinuxONE

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschaeftsfuehrung: David Faller
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

2022-03-31 03:22:27

by Wenchao Hao

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 2022/3/30 17:32, Hannes Reinecke wrote:
> On 3/30/22 11:11, Wenchao Hao wrote:
>> On 2022/3/30 2:56, Hannes Reinecke wrote:
>>> On 3/29/22 14:40, Wenchao Hao wrote:
>>>> On 2022/3/29 18:56, Steffen Maier wrote:
>>>>> On 3/29/22 11:06, Wenchao Hao wrote:
>>>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>>>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>>>>>> devices in this host would not succeed until the scsi_error_handler() finished.
>>>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>>>>>> host has massive devices.
>>>>>>
>>>>>> I want to ask is anyone applying another error handler flow to address this
>>>>>> phenomenon?
>>>>>>
>>>>>> I think we can move some operations(like scsi get sense, scsi send startunit
>>>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>>>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>>>>>> whole host.
>>>>>>
>>>>>> Waiting for your discussion.
>>>>>
>>>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>>>>>
>>>>
>>>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
>>>>
>>>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>>>>>
>>>>>
>>>>
>>>> The deadline seems not helpful. What we want to see is a single LUN's command error
>>>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out
>>>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
>>>
>>> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped.
>>>
>>
>> I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O
>> is because we might reset host during scsi_error_handler() and we must wait host's number of
>> failed command equal to number of busy command then we can wake up scsi_error_handler().
>>
>> If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need
>> stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed,
>> we can then call in the large scale error handle.
>>
> I know the EH flow.
>
> Problem here is the way parallel SCSI operates. Remember, parallel SCSI is a _bus_, and there can be only one command at a time on the bus.
> So if one command on the bus misfires and you have to start EH you have to stop all I/O on the bus to ensure that your EH command is the only one active on the bus.
>

Thank you for you explanation, it's clear to me now.

> For modern HBAs we sure can device other ways and means of error recovery, but I can't really see how we would do that on legacy HBAs.
>

How about define a new return value of scsi_host_template's eh_timed_out callback which indicate this timeout
is totally handled by LLDs. Like following:

--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -359,6 +359,8 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
}
+ } else if (rtn == EH_HANDLED_BY_DRIVERS) {
+ return BLK_EH_DONE;
}

Or scsi_host_template's eh_timed_out should not do this, we can define another callback?
In the LLDs's timeout handler callback, apply single LUN reset first flow as previous mail metioned.

Anyway, what we need is a way to reduce the time of setting host to SHOST_RECOVERY.

>> Here is a brief flow:
>>
>> abort command
>>     ||
>>     || failed
>>     ||
>>     \/
>> stop single LUN's I/O (need to wait LUN's failed command number equal to busy command  number)
>>     ||
>>     || failed  (according to our statistic, 90% reset LUN would succeed)
>>     ||
>>     \/
>
> Interesting. This does not match up with my experience, where 99% of the errors were due to a command timeout.
>
> So which errors do you see here? What are the causes?

These error statistic are from our consumers' environment,they told me about 90% timeout triggered errors can be
handled by reset LUN.

Thanks.

2022-03-31 03:43:34

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 3/30/22 11:11, Wenchao Hao wrote:
> On 2022/3/30 2:56, Hannes Reinecke wrote:
>> On 3/29/22 14:40, Wenchao Hao wrote:
>>> On 2022/3/29 18:56, Steffen Maier wrote:
>>>> On 3/29/22 11:06, Wenchao Hao wrote:
>>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>>>>> devices in this host would not succeed until the scsi_error_handler() finished.
>>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>>>>> host has massive devices.
>>>>>
>>>>> I want to ask is anyone applying another error handler flow to address this
>>>>> phenomenon?
>>>>>
>>>>> I think we can move some operations(like scsi get sense, scsi send startunit
>>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>>>>> whole host.
>>>>>
>>>>> Waiting for your discussion.
>>>>
>>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>>>>
>>>
>>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
>>>
>>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>>>>
>>>>
>>>
>>> The deadline seems not helpful. What we want to see is a single LUN's command error
>>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out
>>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
>>
>> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped.
>>
>
> I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O
> is because we might reset host during scsi_error_handler() and we must wait host's number of
> failed command equal to number of busy command then we can wake up scsi_error_handler().
>
> If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need
> stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed,
> we can then call in the large scale error handle.
>
I know the EH flow.

Problem here is the way parallel SCSI operates. Remember, parallel SCSI
is a _bus_, and there can be only one command at a time on the bus.
So if one command on the bus misfires and you have to start EH you have
to stop all I/O on the bus to ensure that your EH command is the only
one active on the bus.

For modern HBAs we sure can device other ways and means of error
recovery, but I can't really see how we would do that on legacy HBAs.

> Here is a brief flow:
>
> abort command
> ||
> || failed
> ||
> \/
> stop single LUN's I/O (need to wait LUN's failed command number equal to busy command number)
> ||
> || failed (according to our statistic, 90% reset LUN would succeed)
> ||
> \/

Interesting. This does not match up with my experience, where 99% of the
errors were due to a command timeout.

So which errors do you see here? What are the causes?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

2022-04-04 23:57:22

by Mike Christie

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 3/30/22 5:59 AM, Wenchao Hao wrote:
> On 2022/3/30 17:32, Hannes Reinecke wrote:
>> On 3/30/22 11:11, Wenchao Hao wrote:
>>> On 2022/3/30 2:56, Hannes Reinecke wrote:
>>>> On 3/29/22 14:40, Wenchao Hao wrote:
>>>>> On 2022/3/29 18:56, Steffen Maier wrote:
>>>>>> On 3/29/22 11:06, Wenchao Hao wrote:
>>>>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>>>>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>>>>>>> devices in this host would not succeed until the scsi_error_handler() finished.
>>>>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>>>>>>> host has massive devices.
>>>>>>>
>>>>>>> I want to ask is anyone applying another error handler flow to address this
>>>>>>> phenomenon?
>>>>>>>
>>>>>>> I think we can move some operations(like scsi get sense, scsi send startunit
>>>>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>>>>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>>>>>>> whole host.
>>>>>>>
>>>>>>> Waiting for your discussion.
>>>>>>
>>>>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>>>>>>
>>>>>
>>>>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
>>>>>
>>>>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>>>>>>
>>>>>>
>>>>>
>>>>> The deadline seems not helpful. What we want to see is a single LUN's command error
>>>>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out
>>>>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
>>>>
>>>> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped.
>>>>
>>>
>>> I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O
>>> is because we might reset host during scsi_error_handler() and we must wait host's number of
>>> failed command equal to number of busy command then we can wake up scsi_error_handler().
>>>
>>> If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need
>>> stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed,
>>> we can then call in the large scale error handle.
>>>
>> I know the EH flow.
>>
>> Problem here is the way parallel SCSI operates. Remember, parallel SCSI is a _bus_, and there can be only one command at a time on the bus.
>> So if one command on the bus misfires and you have to start EH you have to stop all I/O on the bus to ensure that your EH command is the only one active on the bus.
>>
>
> Thank you for you explanation, it's clear to me now.
>
>> For modern HBAs we sure can device other ways and means of error recovery, but I can't really see how we would do that on legacy HBAs.
>>
>
> How about define a new return value of scsi_host_template's eh_timed_out callback which indicate this timeout
> is totally handled by LLDs. Like following:
>
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -359,6 +359,8 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
> set_host_byte(scmd, DID_TIME_OUT);
> scsi_eh_scmd_add(scmd);
> }
> + } else if (rtn == EH_HANDLED_BY_DRIVERS) {
> + return BLK_EH_DONE;
> }
>
> Or scsi_host_template's eh_timed_out should not do this, we can define another callback?
> In the LLDs's timeout handler callback, apply single LUN reset first flow as previous mail metioned.
>

You probably want to add a scsi_host_template field or new callouts
because some drivers only preallocate enough resources for one TMF
at a time so we have to do it on a driver by driver basis.

For driver's setting/implementing it then to escalate you can do something
like:

1. Block the queue for just the LU with scsi_internal_device_block.
2. You can then call the new callout or old one if we just added some
new field.
3. If device/lun reset fails, then block the target. Then you can do
the host.

We could share code with scsi_ioctl_reset as well. Drivers that support
TMFs via that ioctl already expect queuecommand to be possibly in the
middle of a run and IO not yet timed out. For example, the code to
block a queue and reset the device could be used for the new EH and
SG_SCSI_RESET_DEVICE handling.



> Anyway, what we need is a way to reduce the time of setting host to SHOST_RECOVERY.
>
>>> Here is a brief flow:
>>>
>>> abort command
>>>     ||
>>>     || failed
>>>     ||
>>>     \/
>>> stop single LUN's I/O (need to wait LUN's failed command number equal to busy command  number)
>>>     ||
>>>     || failed  (according to our statistic, 90% reset LUN would succeed)
>>>     ||
>>>     \/
>>
>> Interesting. This does not match up with my experience, where 99% of the errors were due to a command timeout.
>>
>> So which errors do you see here? What are the causes?
>
> These error statistic are from our consumers' environment,they told me about 90% timeout triggered errors can be
> handled by reset LUN.
>

Is this with specific drivers, transport or targets?

2022-04-05 00:41:07

by Mike Christie

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 4/3/22 12:14 PM, Mike Christie wrote:
> We could share code with scsi_ioctl_reset as well. Drivers that support
> TMFs via that ioctl already expect queuecommand to be possibly in the
> middle of a run and IO not yet timed out. For example, the code to
> block a queue and reset the device could be used for the new EH and
> SG_SCSI_RESET_DEVICE handling.
>

Hannes or others,

How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
supported and more only used for controlled testing?

2022-04-05 01:55:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On Mon, Apr 04, 2022 at 07:28:00AM +0200, Hannes Reinecke wrote:
> But really, I'd rather get my EH rework in before we're start discussing
> modifying EH behaviour.

Full agreed. We need to sort that mess out first.

2022-04-05 02:53:54

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 4/3/22 19:17, Mike Christie wrote:
> On 4/3/22 12:14 PM, Mike Christie wrote:
>> We could share code with scsi_ioctl_reset as well. Drivers that support
>> TMFs via that ioctl already expect queuecommand to be possibly in the
>> middle of a run and IO not yet timed out. For example, the code to
>> block a queue and reset the device could be used for the new EH and
>> SG_SCSI_RESET_DEVICE handling.
>>
>
> Hannes or others,
>
> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
> supported and more only used for controlled testing?

That's actually a problem in scsi_ioctl_reset(); it really should wait
for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls
into the various reset functions.

But really, I'd rather get my EH rework in before we're start discussing
modifying EH behaviour.
Let me repost it ...

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2022-04-06 15:17:03

by Wenchao Hao

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 2022/4/4 13:28, Hannes Reinecke wrote:
> On 4/3/22 19:17, Mike Christie wrote:
>> On 4/3/22 12:14 PM, Mike Christie wrote:
>>> We could share code with scsi_ioctl_reset as well. Drivers that support
>>> TMFs via that ioctl already expect queuecommand to be possibly in the
>>> middle of a run and IO not yet timed out. For example, the code to
>>> block a queue and reset the device could be used for the new EH and
>>> SG_SCSI_RESET_DEVICE handling.
>>>
>>
>> Hannes or others,
>>
>> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
>> supported and more only used for controlled testing?
>
> That's actually a problem in scsi_ioctl_reset(); it really should wait for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls into the various reset functions.
>
> But really, I'd rather get my EH rework in before we're start discussing modifying EH behaviour.
> Let me repost it ...
>

Would you take fast EH(such as single LUN reset) into consideration, maybe
a second but lightweight EH? It means a lot.

Or give a way drivers can branch out the general timeout and EH handle logic?

2022-04-06 15:19:18

by Wenchao Hao

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 2022/4/4 1:14, Mike Christie wrote:
> On 3/30/22 5:59 AM, Wenchao Hao wrote:
>> On 2022/3/30 17:32, Hannes Reinecke wrote:
>>> On 3/30/22 11:11, Wenchao Hao wrote:
>>>> On 2022/3/30 2:56, Hannes Reinecke wrote:
>>>>> On 3/29/22 14:40, Wenchao Hao wrote:
>>>>>> On 2022/3/29 18:56, Steffen Maier wrote:
>>>>>>> On 3/29/22 11:06, Wenchao Hao wrote:
>>>>>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>>>>>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>>>>>>>> devices in this host would not succeed until the scsi_error_handler() finished.
>>>>>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>>>>>>>> host has massive devices.
>>>>>>>>
>>>>>>>> I want to ask is anyone applying another error handler flow to address this
>>>>>>>> phenomenon?
>>>>>>>>
>>>>>>>> I think we can move some operations(like scsi get sense, scsi send startunit
>>>>>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>>>>>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>>>>>>>> whole host.
>>>>>>>>
>>>>>>>> Waiting for your discussion.
>>>>>>>
>>>>>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>>>>>>>
>>>>>>
>>>>>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
>>>>>>
>>>>>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> The deadline seems not helpful. What we want to see is a single LUN's command error
>>>>>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out
>>>>>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
>>>>>
>>>>> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped.
>>>>>
>>>>
>>>> I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O
>>>> is because we might reset host during scsi_error_handler() and we must wait host's number of
>>>> failed command equal to number of busy command then we can wake up scsi_error_handler().
>>>>
>>>> If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need
>>>> stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed,
>>>> we can then call in the large scale error handle.
>>>>
>>> I know the EH flow.
>>>
>>> Problem here is the way parallel SCSI operates. Remember, parallel SCSI is a _bus_, and there can be only one command at a time on the bus.
>>> So if one command on the bus misfires and you have to start EH you have to stop all I/O on the bus to ensure that your EH command is the only one active on the bus.
>>>
>>
>> Thank you for you explanation, it's clear to me now.
>>
>>> For modern HBAs we sure can device other ways and means of error recovery, but I can't really see how we would do that on legacy HBAs.
>>>
>>
>> How about define a new return value of scsi_host_template's eh_timed_out callback which indicate this timeout
>> is totally handled by LLDs. Like following:
>>
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -359,6 +359,8 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
>> set_host_byte(scmd, DID_TIME_OUT);
>> scsi_eh_scmd_add(scmd);
>> }
>> + } else if (rtn == EH_HANDLED_BY_DRIVERS) {
>> + return BLK_EH_DONE;
>> }
>>
>> Or scsi_host_template's eh_timed_out should not do this, we can define another callback?
>> In the LLDs's timeout handler callback, apply single LUN reset first flow as previous mail metioned.
>>
>
> You probably want to add a scsi_host_template field or new callouts
> because some drivers only preallocate enough resources for one TMF
> at a time so we have to do it on a driver by driver basis.
>
> For driver's setting/implementing it then to escalate you can do something
> like:
>
> 1. Block the queue for just the LU with scsi_internal_device_block.
> 2. You can then call the new callout or old one if we just added some
> new field.
> 3. If device/lun reset fails, then block the target. Then you can do
> the host.
>

Thanks a lot for your guide. I want to wait hannes's change.

> We could share code with scsi_ioctl_reset as well. Drivers that support
> TMFs via that ioctl already expect queuecommand to be possibly in the
> middle of a run and IO not yet timed out. For example, the code to
> block a queue and reset the device could be used for the new EH and
> SG_SCSI_RESET_DEVICE handling.
>
>
>
>> Anyway, what we need is a way to reduce the time of setting host to SHOST_RECOVERY.
>>
>>>> Here is a brief flow:
>>>>
>>>> abort command
>>>>     ||
>>>>     || failed
>>>>     ||
>>>>     \/
>>>> stop single LUN's I/O (need to wait LUN's failed command number equal to busy command  number)
>>>>     ||
>>>>     || failed  (according to our statistic, 90% reset LUN would succeed)
>>>>     ||
>>>>     \/
>>>
>>> Interesting. This does not match up with my experience, where 99% of the errors were due to a command timeout.
>>>
>>> So which errors do you see here? What are the causes?
>>
>> These error statistic are from our consumers' environment,they told me about 90% timeout triggered errors can be
>> handled by reset LUN.
>>
>
> Is this with specific drivers, transport or targets?
> .
>

Yes, it's based on specific hardware. Details about the statistic seems secret which can not send out.

2022-04-06 15:31:05

by Wenchao Hao

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 2022/4/4 13:28, Hannes Reinecke wrote:
> On 4/3/22 19:17, Mike Christie wrote:
>> On 4/3/22 12:14 PM, Mike Christie wrote:
>>> We could share code with scsi_ioctl_reset as well. Drivers that support
>>> TMFs via that ioctl already expect queuecommand to be possibly in the
>>> middle of a run and IO not yet timed out. For example, the code to
>>> block a queue and reset the device could be used for the new EH and
>>> SG_SCSI_RESET_DEVICE handling.
>>>
>>
>> Hannes or others,
>>
>> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
>> supported and more only used for controlled testing?
>
> That's actually a problem in scsi_ioctl_reset(); it really should wait for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls into the various reset functions.
>
> But really, I'd rather get my EH rework in before we're start discussing modifying EH behaviour.
> Let me repost it ...
>
> Cheers,
>
> Hannes

Hi hannes:

According to the statistic, following scenario would cause an abort failed can be handled by LUN reset:
1. The task execute of disk's FW is abnormal;
2. Intermittent bit errors or intermittent disconnection;
3. FW do not response IO;

Following scenario can not be handled by LUN reset:
1. Disk HW issue, LUN reset can not be handled;
2. DDR UNC in disk, can not fix, the only way is to power off then power on
3. FW of disk is out of service, can not fix, the only way is to power off then power on

And the statistic shows most command abort failed can be handled by LUN reset.

So we plan to design a lightweight timeout handle flow as following:

if disable lightweight EH(default)
scsi_times_out ====================================> origin EH flow
||
|| if enable lightweight EH
||
\/
do not using current timeout flow, and branch to another flow which perform following steps:

abort command
||
|| failed
||
\/
stop single LUN's I/O (need to wait LUN's failed command number equal to busy command number)
||
|| failed (according to our statistic, 90% reset LUN would succeed)
||
\/
reset single LUN
||
|| if host with multi LUNs timeout
|| failed =====================================> perform Host reset
|| ||
|| || failed
|| ||
|| <=================================================//
||
\/
offline disk

Since it's a lightweight EH, we prefer offline disk once reset LUN failed.
These changes would not affect origin EH flow. The advantage of this design is it would not affect
other LUNs of same host.

2023-10-12 14:51:19

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On 4/6/22 11:40, Wenchao Hao wrote:
> On 2022/4/4 13:28, Hannes Reinecke wrote:
>> On 4/3/22 19:17, Mike Christie wrote:
>>> On 4/3/22 12:14 PM, Mike Christie wrote:
>>>> We could share code with scsi_ioctl_reset as well. Drivers that support
>>>> TMFs via that ioctl already expect queuecommand to be possibly in the
>>>> middle of a run and IO not yet timed out. For example, the code to
>>>> block a queue and reset the device could be used for the new EH and
>>>> SG_SCSI_RESET_DEVICE handling.
>>>>
>>>
>>> Hannes or others,
>>>
>>> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
>>> supported and more only used for controlled testing?
>>
>> That's actually a problem in scsi_ioctl_reset(); it really should wait
>> for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls
>> into the various reset functions.
>>
>> But really, I'd rather get my EH rework in before we're start discussing
>> modifying EH behaviour.
>> Let me repost it ...
>>
>
> Would you take fast EH(such as single LUN reset) into consideration, maybe
> a second but lightweight EH? It means a lot.
>
> Or give a way drivers can branch out the general timeout and EH handle logic?

(Re-reading the thread:)

If it's just about device reset I guess we can implement an asynchronous
version. Based on my EH rework we could / should do:

Have a 'eh_cmd_q' list per 'struct scsi_device' and 'struct
scsi_target'. So Instead of always moving a failed command to the
'eh_cmq_q' list of the host, move it onto the list of the next higher
level (eg a failed abort would move it to the eh_cmq_q of 'struct
scsi_device', a failed device reset would move it to the eh_cmq_q of
'struct scsi_target' etc).
That would actually make the code in SCSI EH easier to read as we
could do away with constantly moving and splitting the per-host
eh_cmq_q list.

And then, as a second step, implement a new eh callback for
asynchronous SCSI device aborts. That callback would need to
stop I/O to the device first, send the TMF, and either
restart the device upon successful completion or splice
the list of failed commands onto the target and call
the normal escalation with skipping eh_device_reset().

Hmm?

Cheers,

Hannes

2023-10-12 16:10:21

by Wenchao Hao

[permalink] [raw]
Subject: Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

On Thu, Oct 12, 2023 at 10:51 PM Hannes Reinecke <[email protected]> wrote:
>
> On 4/6/22 11:40, Wenchao Hao wrote:
> > On 2022/4/4 13:28, Hannes Reinecke wrote:
> >> On 4/3/22 19:17, Mike Christie wrote:
> >>> On 4/3/22 12:14 PM, Mike Christie wrote:
> >>>> We could share code with scsi_ioctl_reset as well. Drivers that support
> >>>> TMFs via that ioctl already expect queuecommand to be possibly in the
> >>>> middle of a run and IO not yet timed out. For example, the code to
> >>>> block a queue and reset the device could be used for the new EH and
> >>>> SG_SCSI_RESET_DEVICE handling.
> >>>>
> >>>
> >>> Hannes or others,
> >>>
> >>> How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
> >>> supported and more only used for controlled testing?
> >>
> >> That's actually a problem in scsi_ioctl_reset(); it really should wait
> >> for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls
> >> into the various reset functions.
> >>
> >> But really, I'd rather get my EH rework in before we're start discussing
> >> modifying EH behaviour.
> >> Let me repost it ...
> >>
> >
> > Would you take fast EH(such as single LUN reset) into consideration, maybe
> > a second but lightweight EH? It means a lot.
> >
> > Or give a way drivers can branch out the general timeout and EH handle logic?
>
> (Re-reading the thread:)
>
> If it's just about device reset I guess we can implement an asynchronous
> version. Based on my EH rework we could / should do:
>
> Have a 'eh_cmd_q' list per 'struct scsi_device' and 'struct
> scsi_target'. So Instead of always moving a failed command to the
> 'eh_cmq_q' list of the host, move it onto the list of the next higher
> level (eg a failed abort would move it to the eh_cmq_q of 'struct
> scsi_device', a failed device reset would move it to the eh_cmq_q of
> 'struct scsi_target' etc).
> That would actually make the code in SCSI EH easier to read as we
> could do away with constantly moving and splitting the per-host
> eh_cmq_q list.
>
> And then, as a second step, implement a new eh callback for
> asynchronous SCSI device aborts. That callback would need to
> stop I/O to the device first, send the TMF, and either
> restart the device upon successful completion or splice
> the list of failed commands onto the target and call
> the normal escalation with skipping eh_device_reset().
>

Yes, the RFC patch I sent before is based on this idea, and the details
of implementation may be different from what you described:

Here are how I did:

Three key operations are abstracted for scsi_device and scsi_target:

- mark device/target recovery: called in command dispatch path to stop I/O
- adding error command: called after abort failed to add error command
to error list
- waking up error handling : called in scsi_device_unbusy() to wake up
error handling work

Add struct scsi_device_eh and scsi_target_eh that encapsulate 3
callbacks for the above three key operations above, and invokes these 3
callbacks in the process mentioned above.

For details, please refer to the patch I posted before:
https://lore.kernel.org/linux-scsi/[email protected]/

The following two patches implement each of the three previously defined
callback functions, using kernel work to implement asynchrony handle.
https://lore.kernel.org/linux-scsi/[email protected]/
https://lore.kernel.org/linux-scsi/[email protected]/

For example, define following struct for error handling of scsi_device:

struct scsi_lun_eh {
spinlock_t eh_lock;
unsigned int eh_num;
struct list_head eh_cmd_q;
struct scsi_device *sdev;
struct work_struct eh_handle_work;
unsigned int fallback:1; /* If fallback to further */
/* recovery on failure */
};

The processing logic after awakening is as follows:

sdev_eh_work()
{
try device reset
if device reset succeed(including TUR)
mark error command of this device handled
else if fallback flag is false
finish commands and mark this device offline
fallback to target/host recovery
}

A flag fallback is defined here to determine whether to continue the advanced
reset after the device reset failed, because some drivers actually only
define the callback of the device reset, it is meaningless to continue
advanced reset for such drivers.

Note: the version I posted has bugs in adding commands which would be fixed
in the next version.

Looking for your response, thanks.

> Hmm?
>
> Cheers,
>
> Hannes
>