2023-06-23 18:38:10

by Lorenz Brun

[permalink] [raw]
Subject: [PATCH] ata: libata-scsi: fix bogus SCSI sense after abort

Since commit 058e55e120ca which fixed that commands without valid
error/status codes did not result in any sense error, the returned sense
errors were completely bogus as ata_to_sense_error did not have valid
inputs in the first place.

For example the following ATA error

exception Emask 0x10 SAct 0x20c000 SErr 0x280100 action 0x6 frozen
irq_stat 0x08000000, interface fatal error
SError: { UnrecovData 10B8B BadCRC }
failed command: READ FPDMA QUEUED
cmd 60/e0:70:20:0a:00/00:00:00:00:00/40 tag 14 ncq dma 114688 in
res 40/00:ac:20:5e:50/00:00:5d:01:00/40 Emask 0x10 (ATA bus error)
status: { DRDY }

got turned into the following nonsensical SCSI error

FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Sense Key : Illegal Request [current]
Add. Sense: Unaligned write command
CDB: Read(16) 88 00 00 00 00 00 00 00 0a 20 00 00 00 e0 00 00

This has nothing to do with an unaligned write command, but is due to an
ATA EH-triggered abort. But ata_to_sense_error only knows about
status and error, both of which aren't even valid here as the command
has been aborted.

Add an additional section to ata_gen_ata_sense which handles
errors not coming from the device first, before calling into
ata_to_sense_error.

According to the SAT-5 spec a reset should cause a Unit Attention event,
which the SCSI subsystem should handle to retry its commands but I
am not sure how much of that infra is present in Linux's SCSI layer, so
this is a simpler solution.

Signed-off-by: Lorenz Brun <[email protected]>
---
drivers/ata/libata-scsi.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 551077cea4e4..61c6a4e8123a 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -13,6 +13,7 @@
* - http://www.t13.org/
*/

+#include "scsi/scsi_proto.h"
#include <linux/compat.h>
#include <linux/slab.h>
#include <linux/kernel.h>
@@ -1013,6 +1014,21 @@ static void ata_gen_ata_sense(struct ata_queued_cmd *qc)
ata_scsi_set_sense(dev, cmd, NOT_READY, 0x04, 0x21);
return;
}
+ if (qc->err_mask & (AC_ERR_HSM | AC_ERR_ATA_BUS | AC_ERR_HOST_BUS |
+ AC_ERR_SYSTEM | AC_ERR_OTHER)) {
+ /* Command aborted because of some issue with the ATA subsystem
+ * Should technically cause unit attention, but this is better
+ * than nothing, which results in nonsensical errors.
+ * POWER ON, RESET, OR BUS DEVICE RESET OCCURRED
+ */
+ ata_scsi_set_sense(dev, cmd, ABORTED_COMMAND, 0x29, 0x00);
+ return;
+ }
+ if (qc->err_mask & AC_ERR_TIMEOUT) {
+ /* COMMAND TIMEOUT DURING PROCESSING */
+ ata_scsi_set_sense(dev, cmd, ABORTED_COMMAND, 0x2e, 0x02);
+ return;
+ }
/* Use ata_to_sense_error() to map status register bits
* onto sense key, asc & ascq.
*/
--
2.40.1



2023-06-26 08:05:34

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH] ata: libata-scsi: fix bogus SCSI sense after abort

On 6/24/23 03:19, Lorenz Brun wrote:
> Since commit 058e55e120ca which fixed that commands without valid
> error/status codes did not result in any sense error, the returned sense
> errors were completely bogus as ata_to_sense_error did not have valid
> inputs in the first place.
>
> For example the following ATA error
>
> exception Emask 0x10 SAct 0x20c000 SErr 0x280100 action 0x6 frozen
> irq_stat 0x08000000, interface fatal error
> SError: { UnrecovData 10B8B BadCRC }
> failed command: READ FPDMA QUEUED
> cmd 60/e0:70:20:0a:00/00:00:00:00:00/40 tag 14 ncq dma 114688 in
> res 40/00:ac:20:5e:50/00:00:5d:01:00/40 Emask 0x10 (ATA bus error)
> status: { DRDY }
>
> got turned into the following nonsensical SCSI error
>
> FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> Sense Key : Illegal Request [current]
> Add. Sense: Unaligned write command
> CDB: Read(16) 88 00 00 00 00 00 00 00 0a 20 00 00 00 e0 00 00
>
> This has nothing to do with an unaligned write command, but is due to an
> ATA EH-triggered abort. But ata_to_sense_error only knows about
> status and error, both of which aren't even valid here as the command
> has been aborted.
>
> Add an additional section to ata_gen_ata_sense which handles
> errors not coming from the device first, before calling into
> ata_to_sense_error.
>
> According to the SAT-5 spec a reset should cause a Unit Attention event,
> which the SCSI subsystem should handle to retry its commands but I
> am not sure how much of that infra is present in Linux's SCSI layer, so
> this is a simpler solution.
>
> Signed-off-by: Lorenz Brun <[email protected]>
> ---
> drivers/ata/libata-scsi.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 551077cea4e4..61c6a4e8123a 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -13,6 +13,7 @@
> * - http://www.t13.org/
> */
>
> +#include "scsi/scsi_proto.h"
> #include <linux/compat.h>
> #include <linux/slab.h>
> #include <linux/kernel.h>
> @@ -1013,6 +1014,21 @@ static void ata_gen_ata_sense(struct ata_queued_cmd *qc)
> ata_scsi_set_sense(dev, cmd, NOT_READY, 0x04, 0x21);
> return;
> }
> + if (qc->err_mask & (AC_ERR_HSM | AC_ERR_ATA_BUS | AC_ERR_HOST_BUS |
> + AC_ERR_SYSTEM | AC_ERR_OTHER)) {

Did you check SATA IO specs and/or AHCI to see if that says anything about these
? And I wonder if we should check if we have something in tf->status and
tf->error...

> + /* Command aborted because of some issue with the ATA subsystem
> + * Should technically cause unit attention, but this is better
> + * than nothing, which results in nonsensical errors.
> + * POWER ON, RESET, OR BUS DEVICE RESET OCCURRED
> + */

Multi-line comment style: start with a "/*" line please. The phrasing of the
comment is not very clear. Maybe something like:

/*
* If the command aborted because of some issue with the
* adapter or link, report a POWER ON, RESET, OR BUS DEVICE
* RESET OCCURRED error.
*/

Did you check that all of the above error flags lead to a drive reset ? The
issue I have with this is that the drive reset is triggered by libata EH after
it got these bad errors but the sense data you use here normally indicate that
the reset was initiated by the adapter or the drive. Not sure this is ideal.

> + ata_scsi_set_sense(dev, cmd, ABORTED_COMMAND, 0x29, 0x00);
> + return;
> + }
> + if (qc->err_mask & AC_ERR_TIMEOUT) {
> + /* COMMAND TIMEOUT DURING PROCESSING */
> + ata_scsi_set_sense(dev, cmd, ABORTED_COMMAND, 0x2e, 0x02);
> + return;
> + }
> /* Use ata_to_sense_error() to map status register bits
> * onto sense key, asc & ascq.
> */

--
Damien Le Moal
Western Digital Research


2023-06-26 08:10:53

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH] ata: libata-scsi: fix bogus SCSI sense after abort

On 6/26/23 16:46, Hannes Reinecke wrote:
> On 6/26/23 09:29, Damien Le Moal wrote:
>> On 6/24/23 03:19, Lorenz Brun wrote:
>>> Since commit 058e55e120ca which fixed that commands without valid
>>> error/status codes did not result in any sense error, the returned sense
>>> errors were completely bogus as ata_to_sense_error did not have valid
>>> inputs in the first place.
>>>
>>> For example the following ATA error
>>>
>>> exception Emask 0x10 SAct 0x20c000 SErr 0x280100 action 0x6 frozen
>>> irq_stat 0x08000000, interface fatal error
>>> SError: { UnrecovData 10B8B BadCRC }
>>> failed command: READ FPDMA QUEUED
>>> cmd 60/e0:70:20:0a:00/00:00:00:00:00/40 tag 14 ncq dma 114688 in
>>> res 40/00:ac:20:5e:50/00:00:5d:01:00/40 Emask 0x10 (ATA bus error)
>>> status: { DRDY }
>>>
>>> got turned into the following nonsensical SCSI error
>>>
>>> FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>> Sense Key : Illegal Request [current]
>>> Add. Sense: Unaligned write command
>>> CDB: Read(16) 88 00 00 00 00 00 00 00 0a 20 00 00 00 e0 00 00
>>>
>>> This has nothing to do with an unaligned write command, but is due to an
>>> ATA EH-triggered abort. But ata_to_sense_error only knows about
>>> status and error, both of which aren't even valid here as the command
>>> has been aborted.
>>>
>>> Add an additional section to ata_gen_ata_sense which handles
>>> errors not coming from the device first, before calling into
>>> ata_to_sense_error.
>>>
>>> According to the SAT-5 spec a reset should cause a Unit Attention event,
>>> which the SCSI subsystem should handle to retry its commands but I
>>> am not sure how much of that infra is present in Linux's SCSI layer, so
>>> this is a simpler solution.
>>>
>>> Signed-off-by: Lorenz Brun <[email protected]>
>>> ---
>>> drivers/ata/libata-scsi.c | 16 ++++++++++++++++
>>> 1 file changed, 16 insertions(+)
>>>
>>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>>> index 551077cea4e4..61c6a4e8123a 100644
>>> --- a/drivers/ata/libata-scsi.c
>>> +++ b/drivers/ata/libata-scsi.c
>>> @@ -13,6 +13,7 @@
>>> * - http://www.t13.org/
>>> */
>>>
>>> +#include "scsi/scsi_proto.h"
>>> #include <linux/compat.h>
>>> #include <linux/slab.h>
>>> #include <linux/kernel.h>
>>> @@ -1013,6 +1014,21 @@ static void ata_gen_ata_sense(struct ata_queued_cmd *qc)
>>> ata_scsi_set_sense(dev, cmd, NOT_READY, 0x04, 0x21);
>>> return;
>>> }
>>> + if (qc->err_mask & (AC_ERR_HSM | AC_ERR_ATA_BUS | AC_ERR_HOST_BUS |
>>> + AC_ERR_SYSTEM | AC_ERR_OTHER)) {
>>
>> Did you check SATA IO specs and/or AHCI to see if that says anything about these
>> ? And I wonder if we should check if we have something in tf->status and
>> tf->error...
>>
> We really should. The above combination of error masks seems pretty
> arbitrary, as actually you do _not_ want to check for there error mask,
> but rather for the fact that the sense code is bogus.
> So shouldn't we rather test for that one directly?
>
>>> + /* Command aborted because of some issue with the ATA subsystem
>>> + * Should technically cause unit attention, but this is better
>>> + * than nothing, which results in nonsensical errors.
>>> + * POWER ON, RESET, OR BUS DEVICE RESET OCCURRED
>>> + */
>>
>> Multi-line comment style: start with a "/*" line please. The phrasing of the
>> comment is not very clear. Maybe something like:
>>
>> /*
>> * If the command aborted because of some issue with the
>> * adapter or link, report a POWER ON, RESET, OR BUS DEVICE
>> * RESET OCCURRED error.
>> */
>>
>> Did you check that all of the above error flags lead to a drive reset ? The
>> issue I have with this is that the drive reset is triggered by libata EH after
>> it got these bad errors but the sense data you use here normally indicate that
>> the reset was initiated by the adapter or the drive. Not sure this is ideal.
>>
> Yes. We should rather return a DID_RESET status instead of any made up
> sense code for which we don't have a good justification, and which might
> get invalidated by future SATL versions.

Need to check what SAT says about all this as well... Will do some spec reading.
The default sense code for when we cannot figure out proper sense based on
status and error is ABORTED COMMAND/0/0. That seems good enough to me. The issue
may be that we do not reach that default case when we have bogus status/error.
The test:

if (qc->err_mask ||
tf->status & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {

in ata_gen_ata_sense() is sketchy... Why is that qc->err_mask there ? It does
not seem needed at all. If removed, then we would fall back to aborted command
default.

>
> Cheers,
>
> Hannes

--
Damien Le Moal
Western Digital Research


2023-06-26 08:14:03

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] ata: libata-scsi: fix bogus SCSI sense after abort

On 6/26/23 09:29, Damien Le Moal wrote:
> On 6/24/23 03:19, Lorenz Brun wrote:
>> Since commit 058e55e120ca which fixed that commands without valid
>> error/status codes did not result in any sense error, the returned sense
>> errors were completely bogus as ata_to_sense_error did not have valid
>> inputs in the first place.
>>
>> For example the following ATA error
>>
>> exception Emask 0x10 SAct 0x20c000 SErr 0x280100 action 0x6 frozen
>> irq_stat 0x08000000, interface fatal error
>> SError: { UnrecovData 10B8B BadCRC }
>> failed command: READ FPDMA QUEUED
>> cmd 60/e0:70:20:0a:00/00:00:00:00:00/40 tag 14 ncq dma 114688 in
>> res 40/00:ac:20:5e:50/00:00:5d:01:00/40 Emask 0x10 (ATA bus error)
>> status: { DRDY }
>>
>> got turned into the following nonsensical SCSI error
>>
>> FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>> Sense Key : Illegal Request [current]
>> Add. Sense: Unaligned write command
>> CDB: Read(16) 88 00 00 00 00 00 00 00 0a 20 00 00 00 e0 00 00
>>
>> This has nothing to do with an unaligned write command, but is due to an
>> ATA EH-triggered abort. But ata_to_sense_error only knows about
>> status and error, both of which aren't even valid here as the command
>> has been aborted.
>>
>> Add an additional section to ata_gen_ata_sense which handles
>> errors not coming from the device first, before calling into
>> ata_to_sense_error.
>>
>> According to the SAT-5 spec a reset should cause a Unit Attention event,
>> which the SCSI subsystem should handle to retry its commands but I
>> am not sure how much of that infra is present in Linux's SCSI layer, so
>> this is a simpler solution.
>>
>> Signed-off-by: Lorenz Brun <[email protected]>
>> ---
>> drivers/ata/libata-scsi.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index 551077cea4e4..61c6a4e8123a 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -13,6 +13,7 @@
>> * - http://www.t13.org/
>> */
>>
>> +#include "scsi/scsi_proto.h"
>> #include <linux/compat.h>
>> #include <linux/slab.h>
>> #include <linux/kernel.h>
>> @@ -1013,6 +1014,21 @@ static void ata_gen_ata_sense(struct ata_queued_cmd *qc)
>> ata_scsi_set_sense(dev, cmd, NOT_READY, 0x04, 0x21);
>> return;
>> }
>> + if (qc->err_mask & (AC_ERR_HSM | AC_ERR_ATA_BUS | AC_ERR_HOST_BUS |
>> + AC_ERR_SYSTEM | AC_ERR_OTHER)) {
>
> Did you check SATA IO specs and/or AHCI to see if that says anything about these
> ? And I wonder if we should check if we have something in tf->status and
> tf->error...
>
We really should. The above combination of error masks seems pretty
arbitrary, as actually you do _not_ want to check for there error mask,
but rather for the fact that the sense code is bogus.
So shouldn't we rather test for that one directly?

>> + /* Command aborted because of some issue with the ATA subsystem
>> + * Should technically cause unit attention, but this is better
>> + * than nothing, which results in nonsensical errors.
>> + * POWER ON, RESET, OR BUS DEVICE RESET OCCURRED
>> + */
>
> Multi-line comment style: start with a "/*" line please. The phrasing of the
> comment is not very clear. Maybe something like:
>
> /*
> * If the command aborted because of some issue with the
> * adapter or link, report a POWER ON, RESET, OR BUS DEVICE
> * RESET OCCURRED error.
> */
>
> Did you check that all of the above error flags lead to a drive reset ? The
> issue I have with this is that the drive reset is triggered by libata EH after
> it got these bad errors but the sense data you use here normally indicate that
> the reset was initiated by the adapter or the drive. Not sure this is ideal.
>
Yes. We should rather return a DID_RESET status instead of any made up
sense code for which we don't have a good justification, and which might
get invalidated by future SATL versions.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


2023-06-29 19:24:44

by Lorenz Brun

[permalink] [raw]
Subject: Re: [PATCH] ata: libata-scsi: fix bogus SCSI sense after abort


Am Mo, 26. Jun 2023 um 16:51:47 +09:00:00 schrieb Damien Le Moal
<[email protected]>:
> On 6/26/23 16:46, Hannes Reinecke wrote:
>> On 6/26/23 09:29, Damien Le Moal wrote:
>>> On 6/24/23 03:19, Lorenz Brun wrote:
>>>> Since commit 058e55e120ca which fixed that commands without valid
>>>> error/status codes did not result in any sense error, the
>>>> returned sense
>>>> errors were completely bogus as ata_to_sense_error did not have
>>>> valid
>>>> inputs in the first place.
>>>>
>>>> For example the following ATA error
>>>>
>>>> exception Emask 0x10 SAct 0x20c000 SErr 0x280100 action 0x6 frozen
>>>> irq_stat 0x08000000, interface fatal error
>>>> SError: { UnrecovData 10B8B BadCRC }
>>>> failed command: READ FPDMA QUEUED
>>>> cmd 60/e0:70:20:0a:00/00:00:00:00:00/40 tag 14 ncq dma 114688 in
>>>> res 40/00:ac:20:5e:50/00:00:5d:01:00/40 Emask 0x10 (ATA bus error)
>>>> status: { DRDY }
>>>>
>>>> got turned into the following nonsensical SCSI error
>>>>
>>>> FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>>> Sense Key : Illegal Request [current]
>>>> Add. Sense: Unaligned write command
>>>> CDB: Read(16) 88 00 00 00 00 00 00 00 0a 20 00 00 00 e0 00 00
>>>>
>>>> This has nothing to do with an unaligned write command, but is
>>>> due to an
>>>> ATA EH-triggered abort. But ata_to_sense_error only knows about
>>>> status and error, both of which aren't even valid here as the
>>>> command
>>>> has been aborted.
>>>>
>>>> Add an additional section to ata_gen_ata_sense which handles
>>>> errors not coming from the device first, before calling into
>>>> ata_to_sense_error.
>>>>
>>>> According to the SAT-5 spec a reset should cause a Unit Attention
>>>> event,
>>>> which the SCSI subsystem should handle to retry its commands but I
>>>> am not sure how much of that infra is present in Linux's SCSI
>>>> layer, so
>>>> this is a simpler solution.
>>>>
>>>> Signed-off-by: Lorenz Brun <[email protected]>
>>>> ---
>>>> drivers/ata/libata-scsi.c | 16 ++++++++++++++++
>>>> 1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>>>> index 551077cea4e4..61c6a4e8123a 100644
>>>> --- a/drivers/ata/libata-scsi.c
>>>> +++ b/drivers/ata/libata-scsi.c
>>>> @@ -13,6 +13,7 @@
>>>> * - http://www.t13.org/
>>>> */
>>>>
>>>> +#include "scsi/scsi_proto.h"
>>>> #include <linux/compat.h>
>>>> #include <linux/slab.h>
>>>> #include <linux/kernel.h>
>>>> @@ -1013,6 +1014,21 @@ static void ata_gen_ata_sense(struct
>>>> ata_queued_cmd *qc)
>>>> ata_scsi_set_sense(dev, cmd, NOT_READY, 0x04, 0x21);
>>>> return;
>>>> }
>>>> + if (qc->err_mask & (AC_ERR_HSM | AC_ERR_ATA_BUS |
>>>> AC_ERR_HOST_BUS |
>>>> + AC_ERR_SYSTEM | AC_ERR_OTHER)) {
>>>
>>> Did you check SATA IO specs and/or AHCI to see if that says
>>> anything about these
>>> ? And I wonder if we should check if we have something in
>>> tf->status and
>>> tf->error...
>>>
>> We really should. The above combination of error masks seems pretty
>> arbitrary, as actually you do _not_ want to check for there error
>> mask,
>> but rather for the fact that the sense code is bogus.
>> So shouldn't we rather test for that one directly?
SCISI / ATA Translation (SAT)-5 has the following to say about it:

A SATL that detects a link reset sequence for a Serial ATA device or
initiates any reset of an ATA device shall
establish a unit attention condition on behalf of the emulated logical
unit corresponding to the ATA device with
the sense key set to UNIT ATTENTION and the additional sense code set
to POWER ON, RESET, OR BUS
DEVICE RESET OCCURRED for the SCSI initiator port associated with each
I_T nexus. The method a SATL
uses to detect a link reset sequence on the SATA link is vendor
specific.

I am however not sufficiently familiar with SCSI and especially the
SCSI midlayer in Linux to know how this Unit Attention causes the
requests to be aborted.

As a simpler fix I checked which err_maks cause a reset and in that
case report this error. I'd be happier with the proper SAT-5 solution,
but it looks like it would be quite a significant change.

>>
>>>> + /* Command aborted because of some issue with the ATA subsystem
>>>> + * Should technically cause unit attention, but this is better
>>>> + * than nothing, which results in nonsensical errors.
>>>> + * POWER ON, RESET, OR BUS DEVICE RESET OCCURRED
>>>> + */
>>>
>>> Multi-line comment style: start with a "/*" line please. The
>>> phrasing of the
>>> comment is not very clear. Maybe something like:
>>>
>>> /*
>>> * If the command aborted because of some issue with the
>>> * adapter or link, report a POWER ON, RESET, OR BUS DEVICE
>>> * RESET OCCURRED error.
>>> */
>>>
>>> Did you check that all of the above error flags lead to a drive
>>> reset ? The
>>> issue I have with this is that the drive reset is triggered by
>>> libata EH after
>>> it got these bad errors but the sense data you use here normally
>>> indicate that
>>> the reset was initiated by the adapter or the drive. Not sure this
>>> is ideal.
>>>
>> Yes. We should rather return a DID_RESET status instead of any made
>> up
>> sense code for which we don't have a good justification, and which
>> might
>> get invalidated by future SATL versions.
>
> Need to check what SAT says about all this as well... Will do some
> spec reading.
> The default sense code for when we cannot figure out proper sense
> based on
> status and error is ABORTED COMMAND/0/0. That seems good enough to
> me. The issue
> may be that we do not reach that default case when we have bogus
> status/error.
> The test:
>
> if (qc->err_mask ||
> tf->status & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
>
> in ata_gen_ata_sense() is sketchy... Why is that qc->err_mask there ?
> It does
> not seem needed at all. If removed, then we would fall back to
> aborted command
> default.
From reading SAT-5 I think most of this function is bogus, but I don't
know enough about the pieces involved to come up with the "correct" fix.
>
Regards,
Lorenz




2023-07-03 09:27:16

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH] ata: libata-scsi: fix bogus SCSI sense after abort

On 6/30/23 04:11, Lorenz Brun wrote:
>
> Am Mo, 26. Jun 2023 um 16:51:47 +09:00:00 schrieb Damien Le Moal
> <[email protected]>:
>> On 6/26/23 16:46, Hannes Reinecke wrote:
>>> On 6/26/23 09:29, Damien Le Moal wrote:
>>>> On 6/24/23 03:19, Lorenz Brun wrote:
>>>>> Since commit 058e55e120ca which fixed that commands without valid
>>>>> error/status codes did not result in any sense error, the
>>>>> returned sense
>>>>> errors were completely bogus as ata_to_sense_error did not have
>>>>> valid
>>>>> inputs in the first place.
>>>>>
>>>>> For example the following ATA error
>>>>>
>>>>> exception Emask 0x10 SAct 0x20c000 SErr 0x280100 action 0x6 frozen
>>>>> irq_stat 0x08000000, interface fatal error
>>>>> SError: { UnrecovData 10B8B BadCRC }
>>>>> failed command: READ FPDMA QUEUED
>>>>> cmd 60/e0:70:20:0a:00/00:00:00:00:00/40 tag 14 ncq dma 114688 in
>>>>> res 40/00:ac:20:5e:50/00:00:5d:01:00/40 Emask 0x10 (ATA bus error)
>>>>> status: { DRDY }

If the command error field is non-zero, there should be a line printed here
"error: { ... }". See the end of the function ata_eh_link_report(). Given that
you do not seem to be getting this message, I assume it is 0. However, your
status field only has ATA_DRDY set (0x40). With that, the function
ata_to_sense_error() hits the second entry of stat_table, which is:

{0x40, ILLEGAL_REQUEST, 0x21, 0x04},
// Device ready, unaligned write command

(This table comments are extremely confusing and need cleanup)

Hence the sense data error that you see. And if you check SAT section 11.6 (ATA
Fixed error translation), you will see that this case is not defined in that
table. In fact, only entry 4 is defined:

{0x04, RECOVERED_ERROR, 0x11, 0x00},
// Recovered ECC error Medium error, recovered

The sense_table table cover the cases for the error field bits, separated from
the status field bits, which is unlike what SAT defines. Not sure where these
definitions come from, or if they are accumulated "black magic" from decades of
dealing with weird ATA devices and adapters. Need to dig on that.

[...]

>>>> Did you check SATA IO specs and/or AHCI to see if that says
>>>> anything about these
>>>> ? And I wonder if we should check if we have something in
>>>> tf->status and
>>>> tf->error...
>>>>
>>> We really should. The above combination of error masks seems pretty
>>> arbitrary, as actually you do _not_ want to check for there error
>>> mask,
>>> but rather for the fact that the sense code is bogus.
>>> So shouldn't we rather test for that one directly?
> SCISI / ATA Translation (SAT)-5 has the following to say about it:
>
> A SATL that detects a link reset sequence for a Serial ATA device or
> initiates any reset of an ATA device shall
> establish a unit attention condition on behalf of the emulated logical
> unit corresponding to the ATA device with
> the sense key set to UNIT ATTENTION and the additional sense code set
> to POWER ON, RESET, OR BUS
> DEVICE RESET OCCURRED for the SCSI initiator port associated with each
> I_T nexus. The method a SATL
> uses to detect a link reset sequence on the SATA link is vendor
> specific.
>
> I am however not sufficiently familiar with SCSI and especially the
> SCSI midlayer in Linux to know how this Unit Attention causes the
> requests to be aborted.

Yes, found that text as well. But my thiniking is that strictly speaking, the
reset is not initiated by the translation code. It is libata-eh that decide that
the bad CRC error you are getting is serious enough to warrant a reset. The
point remain that for that command, the adpater returned an error field not
having any of the ATA_ICRC | ATA_UNC | ATA_AMNF | ATA_IDNF | ATA_ABORTED bits
set, and status field only having ATA_DRDY set. Which is weird. I would expect
ATA_ICRC to be set in error and ATA_ERR to be set in status.

> As a simpler fix I checked which err_maks cause a reset and in that
> case report this error. I'd be happier with the proper SAT-5 solution,
> but it looks like it would be quite a significant change.

I have a setup that returns:

[ 1531.680502] ata4: SError: { Proto TrStaTrns UnrecFIS }

due to bad data transmission over the wire. But for these failures, I get:

[ 1531.728008] ata4.00: status: { DRDY SENSE ERR }
[ 1531.737741] ata4.00: error: { ICRC ABRT }

Which gives me a hit on the sense_table entry:

/* ICRC|ABRT */ /* NB: ICRC & !ABRT is BBD */
{0x84, ABORTED_COMMAND, 0x47, 0x00},

SO an "internal target failure". Which makes more sense than the unaligned write
error. Not sure why your setup is not more verbose with error and status. What
adapter are you using ? AHCI ? If yes, which one ? (lspci)

Need to dig further in SATA-IO to see if I can find more hints about how to
handle this case.

[...]

>> Need to check what SAT says about all this as well... Will do some
>> spec reading.
>> The default sense code for when we cannot figure out proper sense
>> based on
>> status and error is ABORTED COMMAND/0/0. That seems good enough to
>> me. The issue
>> may be that we do not reach that default case when we have bogus
>> status/error.
>> The test:
>>
>> if (qc->err_mask ||
>> tf->status & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
>>
>> in ata_gen_ata_sense() is sketchy... Why is that qc->err_mask there ?
>> It does
>> not seem needed at all. If removed, then we would fall back to
>> aborted command
>> default.
> From reading SAT-5 I think most of this function is bogus, but I don't
> know enough about the pieces involved to come up with the "correct" fix.

I would not put it as "bogus". It is certainly hard to map this to the specs,
but I suspect a lot of that is historical reason. Need to dig in git history,
but most of this may actually predate git :)

Let me dig further to see if I can come up with something. Can you reliably
trigger these errors for testing any patch ?

--
Damien Le Moal
Western Digital Research