LinuxLists.cc - [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

2022-12-16 10:46:38

Subject: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

If the ATA device fell off, call sas_ata_device_link_abort() directly and
mark all outstanding QCs as failed and kick-off EH Immediately. This avoids
having to wait for block layer timeouts.

Signed-off-by: Xingui Yang <[email protected]>
---
Changes to v1:
- Use dev_is_sata() to check ATA device type
drivers/scsi/libsas/sas_discover.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
index d5bc1314c341..a12b65eb4a2a 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port *port)

void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev)
{
+ if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
+ sas_ata_device_link_abort(dev, false);
+
if (!test_bit(SAS_DEV_DESTROY, &dev->state) &&
!list_empty(&dev->disco_list_node)) {
/* this rphy never saw sas_rphy_add */
--
2.17.1

2022-12-19 02:41:18

by Jason Yan

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 2022/12/16 18:03, Xingui Yang wrote:
> If the ATA device fell off, call sas_ata_device_link_abort() directly and
> mark all outstanding QCs as failed and kick-off EH Immediately. This avoids
> having to wait for block layer timeouts.
>
> Signed-off-by: Xingui Yang <[email protected]>
> ---
> Changes to v1:
> - Use dev_is_sata() to check ATA device type
> drivers/scsi/libsas/sas_discover.c | 3 +++
> 1 file changed, 3 insertions(+)

Looks good,
Reviewed-by: Jason Yan <[email protected]>

2022-12-19 09:37:33

by John Garry

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 16/12/2022 10:03, Xingui Yang wrote:
> If the ATA device fell off, call sas_ata_device_link_abort() directly and
> mark all outstanding QCs as failed and kick-off EH Immediately. This avoids
> having to wait for block layer timeouts.
>
> Signed-off-by: Xingui Yang <[email protected]>
> ---
> Changes to v1:
> - Use dev_is_sata() to check ATA device type
> drivers/scsi/libsas/sas_discover.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
> index d5bc1314c341..a12b65eb4a2a 100644
> --- a/drivers/scsi/libsas/sas_discover.c
> +++ b/drivers/scsi/libsas/sas_discover.c
> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port *port)
>
> void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev)
> {
> + if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
> + sas_ata_device_link_abort(dev, false);

Firstly, I think that there is a bug in sas_ata_device_link_abort() ->
ata_link_abort() code in that the host lock in not grabbed, as the
comment in ata_port_abort() mentions. Having said that, libsas had
already some dodgy host locking usage - specifically dropping the lock
for the queuing path (that's something else to be fixed up ... I think
that is due to queue command CB calling task_done() in some cases), but
I still think that sas_ata_device_link_abort() should be fixed (to grab
the host lock).

Secondly, this just seems like a half solution to the age-old problem -
that is, EH eventually kicking in only after 30 seconds when a disk is
removed with active IO. I say half solution as SAS disks still have this
issue for libsas. Can we instead push to try to solve both of them now?

There was a broad previous discussion on this:
https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/Ykqg0kr0F*[email protected]/__;JQ!!ACWV5N9M2RV99hQ!MwAZFXXIwuP0lv-kuUIJ0ekUiGBWlTBhU3oQjyOf_yuP1rHDJb8UKMzJjndXNQ-W1PQGJXzgc0bQUsHh4NGh21EOc50$

From that discussion, Hannes was doing some related prep work series,
but I don't think it got completed.

Thanks,
John

> +
> if (!test_bit(SAS_DEV_DESTROY, &dev->state) &&
> !list_empty(&dev->disco_list_node)) {
> /* this rphy never saw sas_rphy_add */

2022-12-19 14:04:48

by Xingui Yang

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 2022/12/19 17:23, John Garry wrote:
> On 16/12/2022 10:03, Xingui Yang wrote:
>> If the ATA device fell off, call sas_ata_device_link_abort() directly and
>> mark all outstanding QCs as failed and kick-off EH Immediately. This
>> avoids
>> having to wait for block layer timeouts.
>>
>> Signed-off-by: Xingui Yang <[email protected]>
>> ---
>> Changes to v1:
>> - Use dev_is_sata() to check ATA device type
>> drivers/scsi/libsas/sas_discover.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_discover.c
>> b/drivers/scsi/libsas/sas_discover.c
>> index d5bc1314c341..a12b65eb4a2a 100644
>> --- a/drivers/scsi/libsas/sas_discover.c
>> +++ b/drivers/scsi/libsas/sas_discover.c
>> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port
>> *port)
>> void sas_unregister_dev(struct asd_sas_port *port, struct
>> domain_device *dev)
>> {
>> +    if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>> +        sas_ata_device_link_abort(dev, false);
>
Hi, John
> Firstly, I think that there is a bug in sas_ata_device_link_abort() ->
> ata_link_abort() code in that the host lock in not grabbed, as the
> comment in ata_port_abort() mentions. Having said that, libsas had
> already some dodgy host locking usage - specifically dropping the lock
> for the queuing path (that's something else to be fixed up ... I think
> that is due to queue command CB calling task_done() in some cases), but
> I still think that sas_ata_device_link_abort() should be fixed (to grab
> the host lock).
ok, I agree with you very much for this, I had doubts about whether we
needed to grab lock before.
>
> Secondly, this just seems like a half solution to the age-old problem -
> that is, EH eventually kicking in only after 30 seconds when a disk is
> removed with active IO. I say half solution as SAS disks still have this
> issue for libsas. Can we instead push to try to solve both of them now?

Jason said you must have such an opinion "a half solution". As libsas
does not have any interface to mark all outstanding commands as failed
for SAS disk currently and SAS disk support I/O resumable transmission
after intermittent disconnections, so I want to optimize sata disk first.
If we want to achieve a complete solution, perhaps we need to define
such an interface in libsas and implement it by lldd. My current idea is
to call sas_abort_task() for all outstanding commands in lldd. I wonder
if you approve of this?

Thanks,
Xingui
>
> There was a broad previous discussion on this:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/Ykqg0kr0F*[email protected]/__;JQ!!ACWV5N9M2RV99hQ!MwAZFXXIwuP0lv-kuUIJ0ekUiGBWlTBhU3oQjyOf_yuP1rHDJb8UKMzJjndXNQ-W1PQGJXzgc0bQUsHh4NGh21EOc50$
>
>
> From that discussion, Hannes was doing some related prep work series,
> but I don't think it got completed.
>
> Thanks,
> John
>
>> +
>>       if (!test_bit(SAS_DEV_DESTROY, &dev->state) &&
>>           !list_empty(&dev->disco_list_node)) {
>>           /* this rphy never saw sas_rphy_add */
>
> .

2022-12-19 15:07:53

by John Garry

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 19/12/2022 12:59, yangxingui wrote:
>> Firstly, I think that there is a bug in sas_ata_device_link_abort() ->
>> ata_link_abort() code in that the host lock in not grabbed, as the
>> comment in ata_port_abort() mentions. Having said that, libsas had
>> already some dodgy host locking usage - specifically dropping the lock
>> for the queuing path (that's something else to be fixed up ... I think
>> that is due to queue command CB calling task_done() in some cases),
>> but I still think that sas_ata_device_link_abort() should be fixed (to
>> grab the host lock).
> ok, I agree with you very much for this, I had doubts about whether we
> needed to grab lock before.

ok, I hope that you can fix this up separately.

>>
>> Secondly, this just seems like a half solution to the age-old problem
>> - that is, EH eventually kicking in only after 30 seconds when a disk
>> is removed with active IO. I say half solution as SAS disks still have
>> this issue for libsas. Can we instead push to try to solve both of
>> them now?
>
> Jason said you must have such an opinion "a half solution". As libsas
> does not have any interface to mark all outstanding commands as failed
> for SAS disk currently and SAS disk support I/O resumable transmission
> after intermittent disconnections

I don't know what you mean by "resumable transmission after intermittent
disconnections".

> , so I want to optimize sata disk first.
> If we want to achieve a complete solution, perhaps we need to define
> such an interface in libsas and implement it by lldd. My current idea is
> to call sas_abort_task() for all outstanding commands in lldd. I wonder
> if you approve of this?

Are you sure you mean sas_abort_task()? That is for the LLDD to issue an
abort TMF. I assume that you mean sas_task_abort(). If so, I am not too
keen on the idea of libsas calling into the LLDD to inform of such an
event. Note that maybe a tagset iter function could be used by libsas to
abort each active IO, but I don't like libsas messing with such a thing;
in addition, there may be some conflict between libsas aborting the IO
and the IO completing with error in the LLDD.

Please note that I need to refresh my memory on this whole EH topic...

Thanks,
John

2022-12-19 15:58:31

by Jason Yan

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

2022-12-19 16:06:52

by John Garry

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 19/12/2022 15:28, Jason Yan wrote:
>>> + if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>>> + sas_ata_device_link_abort(dev, false);
>>
>> Firstly, I think that there is a bug in sas_ata_device_link_abort() ->
>> ata_link_abort() code in that the host lock in not grabbed, as the
>> comment in ata_port_abort() mentions. Having said that, libsas had
>> already some dodgy host locking usage - specifically dropping the lock
>> for the queuing path (that's something else to be fixed up ... I think
>
> Taking big locks in queuing path is not a good idea. This will bring
> down performance.

But it is expected that ata_qc_issue() should be called with that the
host lock grabbed (and keep it).

I think that the reason libsas drops the lock is because some LLDD
queuecommand CBs calls task_done() in some error paths. If we kept the
lock held, then we could have a deadlock, for example:

sas_ata_qc_issue (has lock) -> lldd_execute_task() =
pm8001_queue_command() -> task_done() = sas_ata_task_done() -> grab host
lock => deadlock.

Thanks,
John

2022-12-19 23:32:59

by Damien Le Moal

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 12/20/22 00:28, Jason Yan wrote:
> On 2022/12/19 17:23, John Garry wrote:
>> On 16/12/2022 10:03, Xingui Yang wrote:
>>> If the ATA device fell off, call sas_ata_device_link_abort() directly and
>>> mark all outstanding QCs as failed and kick-off EH Immediately. This
>>> avoids
>>> having to wait for block layer timeouts.
>>>
>>> Signed-off-by: Xingui Yang <[email protected]>
>>> ---
>>> Changes to v1:
>>> - Use dev_is_sata() to check ATA device type
>>> drivers/scsi/libsas/sas_discover.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/scsi/libsas/sas_discover.c
>>> b/drivers/scsi/libsas/sas_discover.c
>>> index d5bc1314c341..a12b65eb4a2a 100644
>>> --- a/drivers/scsi/libsas/sas_discover.c
>>> +++ b/drivers/scsi/libsas/sas_discover.c
>>> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port
>>> *port)
>>> void sas_unregister_dev(struct asd_sas_port *port, struct
>>> domain_device *dev)
>>> {
>>> + if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>>> + sas_ata_device_link_abort(dev, false);
>>
>> Firstly, I think that there is a bug in sas_ata_device_link_abort() ->
>> ata_link_abort() code in that the host lock in not grabbed, as the
>> comment in ata_port_abort() mentions. Having said that, libsas had
>> already some dodgy host locking usage - specifically dropping the lock
>> for the queuing path (that's something else to be fixed up ... I think
>
> Taking big locks in queuing path is not a good idea. This will bring
> down performance.

With HDDs ? You will not see any difference (and SATA SSDs are not a thing
anymore, enough that we should worry too much. NVMe took over). And that
"big lock" is libata is really an integral part of the design. To remove
it, you will need to rewrite libata entirely...

>
>
>> that is due to queue command CB calling task_done() in some cases), but
>> I still think that sas_ata_device_link_abort() should be fixed (to grab
>> the host lock).
>
> For sas_ata_device_link_abort(), it should grab ap->lock.

Which is what libata code comments (mistakenly in many places) always
refer as host lock.

>
> Thanks,
> Jason

--
Damien Le Moal
Western Digital Research

2022-12-19 23:47:11

by Damien Le Moal

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 12/20/22 00:55, John Garry wrote:
> On 19/12/2022 15:28, Jason Yan wrote:
>>>> + if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>>>> + sas_ata_device_link_abort(dev, false);
>>>
>>> Firstly, I think that there is a bug in sas_ata_device_link_abort() ->
>>> ata_link_abort() code in that the host lock in not grabbed, as the
>>> comment in ata_port_abort() mentions. Having said that, libsas had
>>> already some dodgy host locking usage - specifically dropping the lock
>>> for the queuing path (that's something else to be fixed up ... I think
>>
>> Taking big locks in queuing path is not a good idea. This will bring
>> down performance.
>
> But it is expected that ata_qc_issue() should be called with that the
> host lock grabbed (and keep it).
>
> I think that the reason libsas drops the lock is because some LLDD
> queuecommand CBs calls task_done() in some error paths. If we kept the
> lock held, then we could have a deadlock, for example:
>
> sas_ata_qc_issue (has lock) -> lldd_execute_task() =
> pm8001_queue_command() -> task_done() = sas_ata_task_done() -> grab host
> lock => deadlock.

That should be easily solvable using a workqueue for doing task_done(), no ?

>
> Thanks,
> John

--
Damien Le Moal
Western Digital Research

2022-12-20 02:47:57

by Xingui Yang

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 2022/12/19 22:53, John Garry wrote:
> On 19/12/2022 12:59, yangxingui wrote:
>>> Firstly, I think that there is a bug in sas_ata_device_link_abort()
>>> -> ata_link_abort() code in that the host lock in not grabbed, as the
>>> comment in ata_port_abort() mentions. Having said that, libsas had
>>> already some dodgy host locking usage - specifically dropping the
>>> lock for the queuing path (that's something else to be fixed up ... I
>>> think that is due to queue command CB calling task_done() in some
>>> cases), but I still think that sas_ata_device_link_abort() should be
>>> fixed (to grab the host lock).
>> ok, I agree with you very much for this, I had doubts about whether we
>> needed to grab lock before.
>
> ok, I hope that you can fix this up separately.
>
>>>
>>> Secondly, this just seems like a half solution to the age-old problem
>>> - that is, EH eventually kicking in only after 30 seconds when a disk
>>> is removed with active IO. I say half solution as SAS disks still
>>> have this issue for libsas. Can we instead push to try to solve both
>>> of them now?
>>
>> Jason said you must have such an opinion "a half solution". As libsas
>> does not have any interface to mark all outstanding commands as failed
>> for SAS disk currently and SAS disk support I/O resumable transmission
>> after intermittent disconnections
>
> I don't know what you mean by "resumable transmission after intermittent
> disconnections".
I mean if sas disk plug-in in 2 seconds after plug-out with power
supply. sas disk can continue response for the active io.
such as: disk's phy up in 2 seconds after phy down.
>
>> , so I want to optimize sata disk first.
>> If we want to achieve a complete solution, perhaps we need to define
>> such an interface in libsas and implement it by lldd. My current idea
>> is to call sas_abort_task() for all outstanding commands in lldd. I
>> wonder if you approve of this?
>
> Are you sure you mean sas_abort_task()? That is for the LLDD to issue an
> abort TMF. I assume that you mean sas_task_abort(). If so, I am not too

Yes, I mean sas_task_abort(), the two function names are confusing to
me. ^_^
> keen on the idea of libsas calling into the LLDD to inform of such an
> event. Note that maybe a tagset iter function could be used by libsas to
> abort each active IO, but I don't like libsas messing with such a thing;
> in addition, there may be some conflict between libsas aborting the IO
> and the IO completing with error in the LLDD.

I agree with you. Since we have a ready-made interface for mark all
acive io to failed for sata disks, it may be easier to optimize sata
disks first. If we don't implement similar interfaces in libsas or lldd,
what good suggestions do you have?

Thanks,
Xingui
>
> Please note that I need to refresh my memory on this whole EH topic...
>
> Thanks,
> John
>
> .

2022-12-20 02:48:11

by Jason Yan

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 2022/12/19 17:23, John Garry wrote:
> On 16/12/2022 10:03, Xingui Yang wrote:
>> If the ATA device fell off, call sas_ata_device_link_abort() directly and
>> mark all outstanding QCs as failed and kick-off EH Immediately. This
>> avoids
>> having to wait for block layer timeouts.
>>
>> Signed-off-by: Xingui Yang <[email protected]>
>> ---
>> Changes to v1:
>> - Use dev_is_sata() to check ATA device type
>> drivers/scsi/libsas/sas_discover.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_discover.c
>> b/drivers/scsi/libsas/sas_discover.c
>> index d5bc1314c341..a12b65eb4a2a 100644
>> --- a/drivers/scsi/libsas/sas_discover.c
>> +++ b/drivers/scsi/libsas/sas_discover.c
>> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port
>> *port)
>> void sas_unregister_dev(struct asd_sas_port *port, struct
>> domain_device *dev)
>> {
>> + if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>> + sas_ata_device_link_abort(dev, false);
>
> Firstly, I think that there is a bug in sas_ata_device_link_abort() ->
> ata_link_abort() code in that the host lock in not grabbed, as the
> comment in ata_port_abort() mentions. Having said that, libsas had
> already some dodgy host locking usage - specifically dropping the lock
> for the queuing path (that's something else to be fixed up ... I think
> that is due to queue command CB calling task_done() in some cases), but
> I still think that sas_ata_device_link_abort() should be fixed (to grab
> the host lock).
>
> Secondly, this just seems like a half solution to the age-old problem -
> that is, EH eventually kicking in only after 30 seconds when a disk is
> removed with active IO. I say half solution as SAS disks still have this
> issue for libsas. Can we instead push to try to solve both of them now?
>
> There was a broad previous discussion on this:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/Ykqg0kr0F*[email protected]/__;JQ!!ACWV5N9M2RV99hQ!MwAZFXXIwuP0lv-kuUIJ0ekUiGBWlTBhU3oQjyOf_yuP1rHDJb8UKMzJjndXNQ-W1PQGJXzgc0bQUsHh4NGh21EOc50$
>
>
> From that discussion, Hannes was doing some related prep work series,
> but I don't think it got completed.

That discussion is not exactly the same with our issue. That discussion
focused on whether one device's error handling can not suspend the other
other devices's IO dispatching on the same host. That is something like
parallelize the error handling for different device.

However what we are trying to resolve here is to shorten the timeout
handling of a unplugged device. The scsi middle layer doesn't know the
device is gone and still waiting for the IO until timeout kicks in and
start the error handling. This made the applications stuck for a
significant long time.But libsas knows that because it receives the phy
down event, it knows that device will not come back and there is no need
to wait for the timeout.

It's true that this is a half solution. I'd like to have a complete
solution too. So we will try to solve both of them.

Thanks,
Jason

2022-12-20 08:58:09

by John Garry

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 19/12/2022 23:00, Damien Le Moal wrote:
>> But it is expected that ata_qc_issue() should be called with that the
>> host lock grabbed (and keep it).
>>
>> I think that the reason libsas drops the lock is because some LLDD
>> queuecommand CBs calls task_done() in some error paths. If we kept the
>> lock held, then we could have a deadlock, for example:
>>
>> sas_ata_qc_issue (has lock) -> lldd_execute_task() =
>> pm8001_queue_command() -> task_done() = sas_ata_task_done() -> grab host
>> lock => deadlock.
> That should be easily solvable using a workqueue for doing task_done(), no ?
>

I don't see why we cannot just return an error code directly from the
lldd_execute_task CB always - we end up calling scsi_done() directly
then. But I am suspicious why it is not already done this way.

Looking at the code history, this fiddling with the ap->lock actually
looks related to commit 312d3e56119a4bc5c36a96818f87f650c069ddc2
("[SCSI] libsas: remove ata_port.lock management duties from lldds"). I
will check that further.

Thanks,
John

2022-12-20 10:53:55

by Jason Yan

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 2022/12/19 22:53, John Garry wrote:
> Are you sure you mean sas_abort_task()? That is for the LLDD to issue an
> abort TMF. I assume that you mean sas_task_abort(). If so, I am not too
> keen on the idea of libsas calling into the LLDD to inform of such an
> event. Note that maybe a tagset iter function could be used by libsas to
> abort each active IO, but I don't like libsas messing with such a thing;
> in addition, there may be some conflict between libsas aborting the IO
> and the IO completing with error in the LLDD.

Itering tagset in libsas is odd.

The question is, shall we implement the aborting from the driver side,
such as what sas_ata_device_link_abort() do. Or shall we implement the
aborting from the upper side(scsi middle layer or block layer), such as
trigger block layer time out handler immediately after we found device
is gone?

Thanks,
Jason

2022-12-21 09:49:41

by John Garry

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 20/12/2022 09:49, Jason Yan wrote:
>
> Itering tagset in libsas is odd.

Itering with block layer APIs is just a method to deal with each active
IO. However, libsas should not be aborting IO directly. It may provide
helper routines, but the LLDD should be dealing with aborting IO.

>
> The question is, shall we implement the aborting from the driver side,
> such as what sas_ata_device_link_abort() do. Or shall we implement the
> aborting from the upper side(scsi middle layer or block layer), such as
> trigger block layer time out handler immediately after we found device
> is gone?

As mentioned, aborting each IO should be the job of the LLDD. However,
just making the IO timeout will lead to EH kicking in earlier, and EH
will do usual per-IO handling in sas_eh_handle_sas_errors() that would
happen when the IO timesout normally - so what are we really gaining
here? Just EH kicks in earlier. But we still have the problem of all
other per-host IO being blocked while EH is active.

Thanks,
John

2022-12-21 10:18:48

by Xingui Yang

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 2022/12/19 22:53, John Garry wrote:
> On 19/12/2022 12:59, yangxingui wrote:
>>> Firstly, I think that there is a bug in sas_ata_device_link_abort()
>>> -> ata_link_abort() code in that the host lock in not grabbed, as the
>>> comment in ata_port_abort() mentions. Having said that, libsas had
>>> already some dodgy host locking usage - specifically dropping the
>>> lock for the queuing path (that's something else to be fixed up ... I
>>> think that is due to queue command CB calling task_done() in some
>>> cases), but I still think that sas_ata_device_link_abort() should be
>>> fixed (to grab the host lock).
>> ok, I agree with you very much for this, I had doubts about whether we
>> needed to grab lock before.
>
> ok, I hope that you can fix this up separately.
>
>>>
>>> Secondly, this just seems like a half solution to the age-old problem
>>> - that is, EH eventually kicking in only after 30 seconds when a disk
>>> is removed with active IO. I say half solution as SAS disks still
>>> have this issue for libsas. Can we instead push to try to solve both
>>> of them now?
>>
>> Jason said you must have such an opinion "a half solution". As libsas
>> does not have any interface to mark all outstanding commands as failed
>> for SAS disk currently and SAS disk support I/O resumable transmission
>> after intermittent disconnections
>
> I don't know what you mean by "resumable transmission after intermittent
> disconnections".
>
>> , so I want to optimize sata disk first.
>> If we want to achieve a complete solution, perhaps we need to define
>> such an interface in libsas and implement it by lldd. My current idea
>> is to call sas_abort_task() for all outstanding commands in lldd. I
>> wonder if you approve of this?
>
> Are you sure you mean sas_abort_task()? That is for the LLDD to issue an
> abort TMF. I assume that you mean sas_task_abort(). If so, I am not too
> keen on the idea of libsas calling into the LLDD to inform of such an
> event.
I've implemented this solution. The verification seems to be ok both for
sas/sata device. I'll update the version again. Please have a look?

Thanks,
Xingui
Note that maybe a tagset iter function could be used by libsas to
> abort each active IO, but I don't like libsas messing with such a thing;
> in addition, there may be some conflict between libsas aborting the IO
> and the IO completing with error in the LLDD.
>
> Please note that I need to refresh my memory on this whole EH topic...
>
> Thanks,
> John
>
> .

2022-12-21 10:56:02

by Jason Yan

[permalink] [raw]

Subject: Re: [PATCH V2] scsi: libsas: Directly kick-off EH when ATA device fell off

On 2022/12/21 17:40, John Garry wrote:
> On 20/12/2022 09:49, Jason Yan wrote:
>>
>> Itering tagset in libsas is odd.
>
> Itering with block layer APIs is just a method to deal with each active
> IO. However, libsas should not be aborting IO directly. It may provide
> helper routines, but the LLDD should be dealing with aborting IO.
>
> >
> > The question is, shall we implement the aborting from the driver side,
> > such as what sas_ata_device_link_abort() do. Or shall we implement the
> > aborting from the upper side(scsi middle layer or block layer), such as
> > trigger block layer time out handler immediately after we found device
> > is gone?
>
> As mentioned, aborting each IO should be the job of the LLDD. However,
> just making the IO timeout will lead to EH kicking in earlier, and EH
> will do usual per-IO handling in sas_eh_handle_sas_errors() that would
> happen when the IO timesout normally - so what are we really gaining
> here? Just EH kicks in earlier. But we still have the problem of all
> other per-host IO being blocked while EH is active.

This is not the same issue as I replied yesterday.
https://lkml.org/lkml/2022/12/19/1034

Thanks,
Jason