LinuxLists.cc - [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link

2022-08-17 15:32:57

Subject: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()

Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
to call to initiate a link abort.

This will mark all outstanding QCs as failed and kick-off EH.

Note:
The ATA_EH_RESET flag is set for following reasons:
- For hisi_sas, SATA device resources during error handling will only be
released during reset for ATA EH.
ATA EH could decide during autopsy that EH would not be required, so
ensure that it happens (by setting the flag).
- Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
ensure necessary recovery commands are sent (so again we require flag
ATA_EH_RESET to be set as an insurance policy).

Suggested-by: Damien Le Moal <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
include/scsi/sas_ata.h | 5 +++++
2 files changed, 16 insertions(+)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index d35c9296f738..9daae64be37e 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
ata_port_wait_eh(ap);
}

+void sas_ata_device_link_abort(struct domain_device *device)
+{
+ struct ata_port *ap = device->sata_dev.ap;
+ struct ata_link *link = &ap->link;
+
+ link->eh_info.err_mask |= AC_ERR_DEV;
+ link->eh_info.action |= ATA_EH_RESET;
+ ata_link_abort(link);
+}
+EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
+
int sas_execute_ata_cmd(struct domain_device *device, u8 *fis, int force_phy_id)
{
struct sas_tmf_task tmf_task = {};
diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h
index a1df4f9d57a3..cad0b33064a5 100644
--- a/include/scsi/sas_ata.h
+++ b/include/scsi/sas_ata.h
@@ -32,6 +32,7 @@ void sas_probe_sata(struct asd_sas_port *port);
void sas_suspend_sata(struct asd_sas_port *port);
void sas_resume_sata(struct asd_sas_port *port);
void sas_ata_end_eh(struct ata_port *ap);
+void sas_ata_device_link_abort(struct domain_device *dev);
int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
int force_phy_id);
int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline);
@@ -87,6 +88,10 @@ static inline void sas_ata_end_eh(struct ata_port *ap)
{
}

+static inline void sas_ata_device_link_abort(struct domain_device *dev)
+{
+}
+
static inline int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
int force_phy_id)
{
--
2.35.3

2022-08-17 16:28:45

by Damien Le Moal

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()

On 2022/08/17 7:52, John Garry wrote:
> Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
> ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
> to call to initiate a link abort.
>
> This will mark all outstanding QCs as failed and kick-off EH.
>
> Note:
> The ATA_EH_RESET flag is set for following reasons:
> - For hisi_sas, SATA device resources during error handling will only be
> released during reset for ATA EH.
> ATA EH could decide during autopsy that EH would not be required, so
> ensure that it happens (by setting the flag).
> - Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
> ensure necessary recovery commands are sent (so again we require flag
> ATA_EH_RESET to be set as an insurance policy).
>
> Suggested-by: Damien Le Moal <[email protected]>
> Signed-off-by: John Garry <[email protected]>
> ---
> drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
> include/scsi/sas_ata.h | 5 +++++
> 2 files changed, 16 insertions(+)
>
> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
> index d35c9296f738..9daae64be37e 100644
> --- a/drivers/scsi/libsas/sas_ata.c
> +++ b/drivers/scsi/libsas/sas_ata.c
> @@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
> ata_port_wait_eh(ap);
> }
>
> +void sas_ata_device_link_abort(struct domain_device *device)
> +{
> + struct ata_port *ap = device->sata_dev.ap;
> + struct ata_link *link = &ap->link;
> +
> + link->eh_info.err_mask |= AC_ERR_DEV;
> + link->eh_info.action |= ATA_EH_RESET;

I am still not convinced that we should set this here. ata_eh_link_autopsy() and
ata_eh_analyze_serror() are supposed to set the action based on the error. Can't
you reuse the link autopsy function ?

> + ata_link_abort(link);
> +}
> +EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
> +
> int sas_execute_ata_cmd(struct domain_device *device, u8 *fis, int force_phy_id)
> {
> struct sas_tmf_task tmf_task = {};
> diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h
> index a1df4f9d57a3..cad0b33064a5 100644
> --- a/include/scsi/sas_ata.h
> +++ b/include/scsi/sas_ata.h
> @@ -32,6 +32,7 @@ void sas_probe_sata(struct asd_sas_port *port);
> void sas_suspend_sata(struct asd_sas_port *port);
> void sas_resume_sata(struct asd_sas_port *port);
> void sas_ata_end_eh(struct ata_port *ap);
> +void sas_ata_device_link_abort(struct domain_device *dev);
> int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
> int force_phy_id);
> int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline);
> @@ -87,6 +88,10 @@ static inline void sas_ata_end_eh(struct ata_port *ap)
> {
> }
>
> +static inline void sas_ata_device_link_abort(struct domain_device *dev)
> +{
> +}
> +
> static inline int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
> int force_phy_id)
> {

--
Damien Le Moal
Western Digital Research

2022-08-17 17:41:52

by Damien Le Moal

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()

On 2022/08/17 9:54, John Garry wrote:
> On 17/08/2022 17:04, Damien Le Moal wrote:
>> On 2022/08/17 7:52, John Garry wrote:
>>> Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
>>> ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
>>> to call to initiate a link abort.
>>>
>>> This will mark all outstanding QCs as failed and kick-off EH.
>>>
>>> Note:
>>> The ATA_EH_RESET flag is set for following reasons:
>>> - For hisi_sas, SATA device resources during error handling will only be
>>> released during reset for ATA EH.
>>> ATA EH could decide during autopsy that EH would not be required, so
>>> ensure that it happens (by setting the flag).
>>> - Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
>>> ensure necessary recovery commands are sent (so again we require flag
>>> ATA_EH_RESET to be set as an insurance policy).
>>>
>>> Suggested-by: Damien Le Moal <[email protected]>
>>> Signed-off-by: John Garry <[email protected]>
>>> ---
>>> drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
>>> include/scsi/sas_ata.h | 5 +++++
>>> 2 files changed, 16 insertions(+)
>>>
>>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>>> index d35c9296f738..9daae64be37e 100644
>>> --- a/drivers/scsi/libsas/sas_ata.c
>>> +++ b/drivers/scsi/libsas/sas_ata.c
>>> @@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
>>> ata_port_wait_eh(ap);
>>> }
>>>
>>> +void sas_ata_device_link_abort(struct domain_device *device)
>>> +{
>>> + struct ata_port *ap = device->sata_dev.ap;
>>> + struct ata_link *link = &ap->link;
>>> +
>>> + link->eh_info.err_mask |= AC_ERR_DEV;
>>> + link->eh_info.action |= ATA_EH_RESET;
>>
>
> Hi Damien,
>
> Warning: I'm going to write a bit of an essay here... :)
>
>> I am still not convinced that we should set this here. ata_eh_link_autopsy() and
>> ata_eh_analyze_serror() are supposed to set the action based on the error. Can't
>> you reuse the link autopsy function ?
>
> The link autopsy code works ok, but it may not decide to do the reset
> which we may rely on.
>
> For hisi_sas, consider this log from an NCQ error:
>
> 174.385322] ata10.00: failed command: READ FPDMA QUEUED
> [ 174.385336] ata10.00: cmd 60/08:00:11:27:00/00:00:00:00:00/40 tag 0
> ncq dma 4096 in
> res 41/40:08:11:27:00/00:00:00:00:00/00 Emask 0x409 (media error) <F>
> [ 174.385339] ata10.00: status: { DRDY ERR }
> [ 174.385343] ata10.00: error: { UNC }
> [ 174.385641] ata10.00: configured for UDMA/133
> [ 174.385697] sd 6:0:1:0: [sdb] tag#701 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [ 174.385710] sd 6:0:1:0: [sdb] tag#701 Sense Key : Medium Error [current]
> [ 174.385726] sd 6:0:1:0: [sdb] tag#701 Add. Sense: Unrecovered read
> error - auto reallocate failed
> [ 174.385740] sd 6:0:1:0: [sdb] tag#701 CDB: Read(10) 28 00 00 00 27 11
> 00 00 08 00
> [ 174.385745] print_req_error: I/O error, dev sdb, sector 10001
> [ 174.385827] ata10: EH complete
>
> The ATA autopsy has a found medium error and decided that reset is not
> required - this is similar in that regard to the "unaligned write to a
> sequential write required zone on SMR" error you mentioned from your
> test previously. The problem in this is that for hisi_sas we depend on
> disk reset to release driver resources associated with ATA QCs. That is
> because it is only after reset that we can guarantee that no IO
> associated with the disk will complete in HW and it is safe to release
> the resources.

If you had an error, then you already are guaranteed that you will not see any
completion at all since the SATA drive is in error mode already. But I see the
point here. The HBA internal qc resources need to be cleared and that seems to
be done only with a device reset big hammer.

> But pm8001 seems different here with regards releasing resources. I find
> that when EH kicks in from NCQ error and libsas tries to abort missing
> commands, the pm8001_abort_task() -> sas_execute_internal_abort_single()
> causes the original IO to complete as aborted - this is good, as then we
> may release the resources there. hisi_sas has no such feature.
>
> But the pm8001 manual and current driver indicate that the
> OPC_INB_SATA_ABORT command should be sent after read log ext when
> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()

You lost me: ata_eh_recover() will call ata_eh_reset() only if the ATA_EH_RESET
action flag is set. So are you saying that even though it is not needed, you
still need to set ATA_EH_RESET for pm8001 ?

> As I mentioned before, I saw nowhere better to call
> pm8001_send_abort_all() for this. I would rather not do it in
> ata_eh_reset() -> pm8001_I_T_nexus_reset()

We could add a new op ->eh_link_autopsy which we can call if defined after the
call to ata_eh_analyze_ncq_error() in ata_eh_link_autopsy(). With that, you can
set ATA_EH_RESET for the hisi driver and only do pm8001_send_abort_all() for
pm8001 (that will be done after the read log 10h).

>
> How about this modified approach:
> - Continue to set ATA_EH_RESET in sas_ata_device_link_abort()
> - pm8001_I_T_nexus_reset() will only call pm8001_send_abort_all() when
> the driver is in NCQ error state and not do a hard reset in that case.
> - But I am not sure if that works as the autopsy for NCQ error may have
> decided that a hardreset was really required. Hmmm..

See the above. the new op may decided a reset is needed (hisi case) and even if
the standard autopsy does not make that decision, the flag is set and
ata_eh_recover() will reset the device. For the pm8001 case, no reset set with
the new op, only pm8001_send_abort_all(). And even if ata_eh_link_autopsy()
decides that a reset is needed after calling the new op, that would still be OK
I think.

--
Damien Le Moal
Western Digital Research

2022-08-17 17:54:17

by John Garry

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()

On 17/08/2022 17:04, Damien Le Moal wrote:
> On 2022/08/17 7:52, John Garry wrote:
>> Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
>> ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
>> to call to initiate a link abort.
>>
>> This will mark all outstanding QCs as failed and kick-off EH.
>>
>> Note:
>> The ATA_EH_RESET flag is set for following reasons:
>> - For hisi_sas, SATA device resources during error handling will only be
>> released during reset for ATA EH.
>> ATA EH could decide during autopsy that EH would not be required, so
>> ensure that it happens (by setting the flag).
>> - Similar to hisi_sas, pm8001 NCQ error handling requires a hardreset to
>> ensure necessary recovery commands are sent (so again we require flag
>> ATA_EH_RESET to be set as an insurance policy).
>>
>> Suggested-by: Damien Le Moal <[email protected]>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> drivers/scsi/libsas/sas_ata.c | 11 +++++++++++
>> include/scsi/sas_ata.h | 5 +++++
>> 2 files changed, 16 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>> index d35c9296f738..9daae64be37e 100644
>> --- a/drivers/scsi/libsas/sas_ata.c
>> +++ b/drivers/scsi/libsas/sas_ata.c
>> @@ -861,6 +861,17 @@ void sas_ata_wait_eh(struct domain_device *dev)
>> ata_port_wait_eh(ap);
>> }
>>
>> +void sas_ata_device_link_abort(struct domain_device *device)
>> +{
>> + struct ata_port *ap = device->sata_dev.ap;
>> + struct ata_link *link = &ap->link;
>> +
>> + link->eh_info.err_mask |= AC_ERR_DEV;
>> + link->eh_info.action |= ATA_EH_RESET;
>

Hi Damien,

Warning: I'm going to write a bit of an essay here... :)

> I am still not convinced that we should set this here. ata_eh_link_autopsy() and
> ata_eh_analyze_serror() are supposed to set the action based on the error. Can't
> you reuse the link autopsy function ?

The link autopsy code works ok, but it may not decide to do the reset
which we may rely on.

For hisi_sas, consider this log from an NCQ error:

174.385322] ata10.00: failed command: READ FPDMA QUEUED
[ 174.385336] ata10.00: cmd 60/08:00:11:27:00/00:00:00:00:00/40 tag 0
ncq dma 4096 in
res 41/40:08:11:27:00/00:00:00:00:00/00 Emask 0x409 (media error) <F>
[ 174.385339] ata10.00: status: { DRDY ERR }
[ 174.385343] ata10.00: error: { UNC }
[ 174.385641] ata10.00: configured for UDMA/133
[ 174.385697] sd 6:0:1:0: [sdb] tag#701 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 174.385710] sd 6:0:1:0: [sdb] tag#701 Sense Key : Medium Error [current]
[ 174.385726] sd 6:0:1:0: [sdb] tag#701 Add. Sense: Unrecovered read
error - auto reallocate failed
[ 174.385740] sd 6:0:1:0: [sdb] tag#701 CDB: Read(10) 28 00 00 00 27 11
00 00 08 00
[ 174.385745] print_req_error: I/O error, dev sdb, sector 10001
[ 174.385827] ata10: EH complete

The ATA autopsy has a found medium error and decided that reset is not
required - this is similar in that regard to the "unaligned write to a
sequential write required zone on SMR" error you mentioned from your
test previously. The problem in this is that for hisi_sas we depend on
disk reset to release driver resources associated with ATA QCs. That is
because it is only after reset that we can guarantee that no IO
associated with the disk will complete in HW and it is safe to release
the resources.

But pm8001 seems different here with regards releasing resources. I find
that when EH kicks in from NCQ error and libsas tries to abort missing
commands, the pm8001_abort_task() -> sas_execute_internal_abort_single()
causes the original IO to complete as aborted - this is good, as then we
may release the resources there. hisi_sas has no such feature.

But the pm8001 manual and current driver indicate that the
OPC_INB_SATA_ABORT command should be sent after read log ext when
handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()

As I mentioned before, I saw nowhere better to call
pm8001_send_abort_all() for this. I would rather not do it in
ata_eh_reset() -> pm8001_I_T_nexus_reset()

How about this modified approach:
- Continue to set ATA_EH_RESET in sas_ata_device_link_abort()
- pm8001_I_T_nexus_reset() will only call pm8001_send_abort_all() when
the driver is in NCQ error state and not do a hard reset in that case.
- But I am not sure if that works as the autopsy for NCQ error may have
decided that a hardreset was really required. Hmmm..

Thanks,
John

>
>> + ata_link_abort(link);
>> +}
>> +EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
>> +
>> int sas_execute_ata_cmd(struct domain_device *device, u8 *fis, int force_phy_id)
>> {
>> struct sas_tmf_task tmf_task = {};
>> diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h
>> index a1df4f9d57a3..cad0b33064a5 100644
>> --- a/include/scsi/sas_ata.h
>> +++ b/include/scsi/sas_ata.h
>> @@ -32,6 +32,7 @@ void sas_probe_sata(struct asd_sas_port *port);
>> void sas_suspend_sata(struct asd_sas_port *port);
>> void sas_resume_sata(struct asd_sas_port *port);
>> void sas_ata_end_eh(struct ata_port *ap);
>> +void sas_ata_device_link_abort(struct domain_device *dev);
>> int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
>> int force_phy_id);
>> int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline);
>> @@ -87,6 +88,10 @@ static inline void sas_ata_end_eh(struct ata_port *ap)
>> {
>> }
>>
>> +static inline void sas_ata_device_link_abort(struct domain_device *dev)
>> +{
>> +}
>> +
>> static inline int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
>> int force_phy_id)
>> {
>
>

2022-08-18 12:45:33

by John Garry

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()

On 17/08/2022 18:14, Damien Le Moal wrote:
>> The ATA autopsy has a found medium error and decided that reset is not
>> required - this is similar in that regard to the "unaligned write to a
>> sequential write required zone on SMR" error you mentioned from your
>> test previously. The problem in this is that for hisi_sas we depend on
>> disk reset to release driver resources associated with ATA QCs. That is
>> because it is only after reset that we can guarantee that no IO
>> associated with the disk will complete in HW and it is safe to release
>> the resources.
> If you had an error, then you already are guaranteed that you will not see any
> completion at all since the SATA drive is in error mode already. But I see the
> point here. The HBA internal qc resources need to be cleared and that seems to
> be done only with a device reset big hammer.

Yeah, unfortunately

>
>> But pm8001 seems different here with regards releasing resources. I find
>> that when EH kicks in from NCQ error and libsas tries to abort missing
>> commands, the pm8001_abort_task() -> sas_execute_internal_abort_single()
>> causes the original IO to complete as aborted - this is good, as then we
>> may release the resources there. hisi_sas has no such feature.
>>
>> But the pm8001 manual and current driver indicate that the
>> OPC_INB_SATA_ABORT command should be sent after read log ext when
>> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
>> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()
> You lost me: ata_eh_recover() will call ata_eh_reset() only if the ATA_EH_RESET
> action flag is set. So are you saying that even though it is not needed, you
> still need to set ATA_EH_RESET for pm8001 ?

As below, it was the only location I found suitable to call
pm8001_send_abort_all().

However I am not really sure it is required now. For pm8001 NCQ error
handling we require 2x steps:
- read log ext
- Send OPC_INB_SATA_ABORT - we do this in pm8001_send_abort_all()

pm8001_send_abort_all() sends OPC_INB_SATA_ABORT in "device abort all"
mode, meaning any IO in the HBA is aborted for the device. But we are
also earlier in EH sending OPC_INB_SATA_ABORT for individual IOs in
sas_eh_handle_sas_errors() -> sas_scsi_find_task() ->
pm8001_abort_task() -> sas_execute_internal_abort_single() -> ...
send_abort_task()

So I don't think that the pm8001_send_abort_all() call has any effect,
as we're already aborting any outstanding IO earlier.

Admittedly the order of the 2x steps is different, but
OPC_INB_SATA_ABORT does not send any protocol message to the disk, so
would not affect anything subsequently read with read log ext.

Having said all that, it may be wise to still send
pm8001_send_abort_all()...

>
>> As I mentioned before, I saw nowhere better to call
>> pm8001_send_abort_all() for this. I would rather not do it in
>> ata_eh_reset() -> pm8001_I_T_nexus_reset()
> We could add a new op ->eh_link_autopsy which we can call if defined after the
> call to ata_eh_analyze_ncq_error() in ata_eh_link_autopsy(). With that, you can
> set ATA_EH_RESET for the hisi driver and only do pm8001_send_abort_all() for
> pm8001 (that will be done after the read log 10h).

hmmmm.... seems unfortunate if we need to add a new op just for this.

If we supported ata_port_operations.softreset CB for libsas, then it
seems a good location to issue pm8001_send_abort_all(). However, ATA EH
always prefers hardreset over softreset if both supported - do you see
any scope to change this so that we could use softreset?

>
>> How about this modified approach:
>> - Continue to set ATA_EH_RESET in sas_ata_device_link_abort()
>> - pm8001_I_T_nexus_reset() will only call pm8001_send_abort_all() when
>> the driver is in NCQ error state and not do a hard reset in that case.
>> - But I am not sure if that works as the autopsy for NCQ error may have
>> decided that a hardreset was really required. Hmmm..
> See the above. the new op may decided a reset is needed (hisi case) and even if
> the standard autopsy does not make that decision, the flag is set and
> ata_eh_recover() will reset the device. For the pm8001 case, no reset set with
> the new op, only pm8001_send_abort_all(). And even if ata_eh_link_autopsy()
> decides that a reset is needed after calling the new op, that would still be OK
> I think.

I just think that for hisi_sas a reset is always required unfortunately
- either an ATA softreset or hardreset. For pm8001, as you say, we can
let autopsy decide whether the big hammer (hard) reset is required.

Thanks,
John

2022-09-02 16:33:22

by John Garry

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()

Hi Damien,

>>>
>>> But the pm8001 manual and current driver indicate that the
>>> OPC_INB_SATA_ABORT command should be sent after read log ext when
>>> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
>>> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()
>> You lost me: ata_eh_recover() will call ata_eh_reset() only if the
>> ATA_EH_RESET
>> action flag is set. So are you saying that even though it is not
>> needed, you
>> still need to set ATA_EH_RESET for pm8001 ?
>
> As below, it was the only location I found suitable to call
> pm8001_send_abort_all().
>
> However I am not really sure it is required now. For pm8001 NCQ error
> handling we require 2x steps:
> - read log ext
> - Send OPC_INB_SATA_ABORT - we do this in pm8001_send_abort_all()
>
> pm8001_send_abort_all() sends OPC_INB_SATA_ABORT in "device abort all"
> mode, meaning any IO in the HBA is aborted for the device. But we are
> also earlier in EH sending OPC_INB_SATA_ABORT for individual IOs in
> sas_eh_handle_sas_errors() -> sas_scsi_find_task() ->
> pm8001_abort_task() -> sas_execute_internal_abort_single() -> ...
> send_abort_task()
>
> So I don't think that the pm8001_send_abort_all() call has any effect,
> as we're already aborting any outstanding IO earlier.
>
> Admittedly the order of the 2x steps is different, but
> OPC_INB_SATA_ABORT does not send any protocol message to the disk, so
> would not affect anything subsequently read with read log ext.
>
> Having said all that, it may be wise to still send
> pm8001_send_abort_all()..

Have you had a chance to consider all this yet?

I am now starting to think that it is not necessary to call
pm8001_send_abort_all(), and instead rely only on
sas_eh_handle_sas_errors() -> sas_scsi_find_task() ->
pm8001_abort_task() -> sas_execute_internal_abort_single() -> ... ->
send_abort_task() to abort each outstanding IO. Then we would not have
the issue of forcing the reset in $subject (to lead to calling
pm8001_send_abort_all()).

This idea could simply be tested by stubbing pm8001_send_abort_all()
(and dropping the |= ATA_EH_RESET in $subject).

Thanks,
John

2022-09-05 23:35:45

by Damien Le Moal

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] scsi: libsas: Add sas_ata_device_link_abort()

On 9/3/22 01:19, John Garry wrote:
> Hi Damien,
>
>>>>
>>>> But the pm8001 manual and current driver indicate that the
>>>> OPC_INB_SATA_ABORT command should be sent after read log ext when
>>>> handling NCQ error, regardless of an autopsy. I send OPC_INB_SATA_ABORT
>>>> in ata_eh_reset() -> pm8001_I_T_nexus_reset() -> pm8001_send_abort_all()
>>> You lost me: ata_eh_recover() will call ata_eh_reset() only if the
>>> ATA_EH_RESET
>>> action flag is set. So are you saying that even though it is not
>>> needed, you
>>> still need to set ATA_EH_RESET for pm8001 ?
>>
>> As below, it was the only location I found suitable to call
>> pm8001_send_abort_all().
>>
>> However I am not really sure it is required now. For pm8001 NCQ error
>> handling we require 2x steps:
>> - read log ext
>> - Send OPC_INB_SATA_ABORT - we do this in pm8001_send_abort_all()
>>
>> pm8001_send_abort_all() sends OPC_INB_SATA_ABORT in "device abort all"
>> mode, meaning any IO in the HBA is aborted for the device. But we are
>> also earlier in EH sending OPC_INB_SATA_ABORT for individual IOs in
>> sas_eh_handle_sas_errors() -> sas_scsi_find_task() ->
>> pm8001_abort_task() -> sas_execute_internal_abort_single() -> ...
>> send_abort_task()
>>
>> So I don't think that the pm8001_send_abort_all() call has any effect,
>> as we're already aborting any outstanding IO earlier.
>>
>> Admittedly the order of the 2x steps is different, but
>> OPC_INB_SATA_ABORT does not send any protocol message to the disk, so
>> would not affect anything subsequently read with read log ext.
>>
>> Having said all that, it may be wise to still send
>> pm8001_send_abort_all()..
>
> Have you had a chance to consider all this yet?
>
> I am now starting to think that it is not necessary to call
> pm8001_send_abort_all(), and instead rely only on
> sas_eh_handle_sas_errors() -> sas_scsi_find_task() ->
> pm8001_abort_task() -> sas_execute_internal_abort_single() -> ... ->
> send_abort_task() to abort each outstanding IO. Then we would not have
> the issue of forcing the reset in $subject (to lead to calling
> pm8001_send_abort_all()).
>
> This idea could simply be tested by stubbing pm8001_send_abort_all()
> (and dropping the |= ATA_EH_RESET in $subject).

Send a re-spinned patch and I will test :)

>
> Thanks,
> John
>

--
Damien Le Moal
Western Digital Research