2022-07-22 11:31:23

by John Garry

[permalink] [raw]
Subject: [PATCH 0/6] libsas and drivers: NCQ error handling

As reported in [0], the pm8001 driver NCQ error handling more or less
duplicates what libata does in link error handling, as follows:
- abort all commands
- do autopsy with read log ext 10 command
- reset the target to recover

Indeed for the hisi_sas driver we want to add similar handling for NCQ
errors.

This series add a new libsas API - sas_ata_link_abort() - to handle host
NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
mentioned in the pm8001 changeover patch, I would prefer a better place to
locate the SATA ABORT command (rather that nexus reset callback).

I would appreciate some testing of the pm8001 change as the read log ext10
command mostly hangs on my arm64 machine - these arm64 hangs are a known
issue.

Finally with these changes we can make the libsas task alloc/free APIs
private, which they should always have been.

Based on v5.19-rc6

[0] https://lore.kernel.org/linux-scsi/[email protected]/

John Garry (5):
scsi: pm8001: Modify task abort handling for SATA task
scsi: libsas: Add sas_ata_link_abort()
scsi: pm8001: Use sas_ata_link_abort() to handle NCQ errors
scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private

Xingui Yang (1):
scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw

drivers/scsi/hisi_sas/hisi_sas_main.c | 5 +-
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++-
drivers/scsi/libsas/sas_ata.c | 10 ++
drivers/scsi/libsas/sas_init.c | 3 -
drivers/scsi/libsas/sas_internal.h | 4 +
drivers/scsi/pm8001/pm8001_hwi.c | 194 +++++++------------------
drivers/scsi/pm8001/pm8001_sas.c | 13 ++
drivers/scsi/pm8001/pm8001_sas.h | 8 +-
drivers/scsi/pm8001/pm80xx_hwi.c | 177 ++--------------------
include/scsi/libsas.h | 4 -
include/scsi/sas_ata.h | 5 +
11 files changed, 132 insertions(+), 313 deletions(-)

--
2.35.3


2022-07-22 11:31:30

by John Garry

[permalink] [raw]
Subject: [PATCH 1/6] scsi: pm8001: Modify task abort handling for SATA task

When we try to abort a SATA task, the CCB of the task which we are trying
to avoid may still complete. In this case, we should not touch the task
associated with that CCB as we can race with libsas freeing the last later
in sas_eh_handle_sas_errors() -> sas_eh_finish_cmd() for when
TASK_IS_ABORTED is returned from sas_scsi_find_task()

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/pm8001/pm8001_hwi.c | 15 +++++++++++++--
drivers/scsi/pm8001/pm8001_sas.c | 8 ++++++++
drivers/scsi/pm8001/pm80xx_hwi.c | 14 ++++++++++----
3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index f7466a895d3b..6a88255cb3c3 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -2291,7 +2291,9 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha, void *piomb)
if (t->dev && (t->dev->lldd_dev))
pm8001_dev = t->dev->lldd_dev;
} else {
- pm8001_dbg(pm8001_ha, FAIL, "task null\n");
+ pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+ ccb->ccb_tag);
+ pm8001_ccb_free(pm8001_ha, ccb);
return;
}

@@ -2671,8 +2673,17 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha, void *piomb)
pm8001_dev = ccb->device;
if (event)
pm8001_dbg(pm8001_ha, FAIL, "sata IO status 0x%x\n", event);
- if (unlikely(!t || !t->lldd_task || !t->dev))
+
+ if (unlikely(!t)) {
+ pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+ ccb->ccb_tag);
+ pm8001_ccb_free(pm8001_ha, ccb);
return;
+ }
+
+ if (unlikely(!t->lldd_task || !t->dev))
+ return;
+
ts = &t->task_status;
pm8001_dbg(pm8001_ha, DEVIO,
"port_id:0x%x, device_id:0x%x, tag:0x%x, event:0x%x\n",
diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index e8ffe56228a9..a2ccb3484f90 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -980,6 +980,7 @@ int pm8001_query_task(struct sas_task *task)
/* mandatory SAM-3, still need free task/ccb info, abort the specified task */
int pm8001_abort_task(struct sas_task *task)
{
+ struct pm8001_ccb_info *ccb = task->lldd_task;
unsigned long flags;
u32 tag;
struct domain_device *dev ;
@@ -1110,6 +1111,13 @@ int pm8001_abort_task(struct sas_task *task)
pm8001_dev, DS_OPERATIONAL);
wait_for_completion(&completion);
} else {
+ /*
+ * Ensure that if we see a completion for the ccb
+ * associated with the task we are trying to abort then
+ * we should not touch the sas_task as it may race with
+ * libsas freeing it when return here.
+ */
+ ccb->task = NULL;
ret = sas_execute_internal_abort_single(dev, tag, 0, NULL);
}
rc = TMF_RESP_FUNC_COMPLETE;
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index 01c5e8ff4cc5..d55f8b1c236d 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -2390,7 +2390,9 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha,
if (t->dev && (t->dev->lldd_dev))
pm8001_dev = t->dev->lldd_dev;
} else {
- pm8001_dbg(pm8001_ha, FAIL, "task null\n");
+ pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+ ccb->ccb_tag);
+ pm8001_ccb_free(pm8001_ha, ccb);
return;
}

@@ -2807,12 +2809,16 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha,
ccb = &pm8001_ha->ccb_info[tag];
t = ccb->task;
pm8001_dev = ccb->device;
-
- if (unlikely(!t || !t->lldd_task || !t->dev)) {
- pm8001_dbg(pm8001_ha, FAIL, "task or dev null\n");
+ if (unlikely(!t)) {
+ pm8001_dbg(pm8001_ha, FAIL, "task null, freeing CCB tag %d\n",
+ ccb->ccb_tag);
+ pm8001_ccb_free(pm8001_ha, ccb);
return;
}

+ if (unlikely(!t->lldd_task || !t->dev))
+ return;
+
ts = &t->task_status;
pm8001_dbg(pm8001_ha, IOERR, "port_id:0x%x, tag:0x%x, event:0x%x\n",
port_id, tag, event);
--
2.35.3

2022-07-22 11:42:07

by John Garry

[permalink] [raw]
Subject: [PATCH 6/6] scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private

We have no users outside libsas any longer, so make sas_alloc_task(),
sas_alloc_slow_task(), and sas_free_task() private.

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/libsas/sas_init.c | 3 ---
drivers/scsi/libsas/sas_internal.h | 4 ++++
include/scsi/libsas.h | 4 ----
3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index dc35f0f8eae3..fad4fb731bf3 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -35,7 +35,6 @@ struct sas_task *sas_alloc_task(gfp_t flags)

return task;
}
-EXPORT_SYMBOL_GPL(sas_alloc_task);

struct sas_task *sas_alloc_slow_task(gfp_t flags)
{
@@ -56,7 +55,6 @@ struct sas_task *sas_alloc_slow_task(gfp_t flags)

return task;
}
-EXPORT_SYMBOL_GPL(sas_alloc_slow_task);

void sas_free_task(struct sas_task *task)
{
@@ -65,7 +63,6 @@ void sas_free_task(struct sas_task *task)
kmem_cache_free(sas_task_cache, task);
}
}
-EXPORT_SYMBOL_GPL(sas_free_task);

/*------------ SAS addr hash -----------*/
void sas_hash_addr(u8 *hashed, const u8 *sas_addr)
diff --git a/drivers/scsi/libsas/sas_internal.h b/drivers/scsi/libsas/sas_internal.h
index 13d0ffaada93..19856e107ebd 100644
--- a/drivers/scsi/libsas/sas_internal.h
+++ b/drivers/scsi/libsas/sas_internal.h
@@ -52,6 +52,10 @@ void sas_unregister_phys(struct sas_ha_struct *sas_ha);
struct asd_sas_event *sas_alloc_event(struct asd_sas_phy *phy, gfp_t gfp_flags);
void sas_free_event(struct asd_sas_event *event);

+struct sas_task *sas_alloc_task(gfp_t flags);
+struct sas_task *sas_alloc_slow_task(gfp_t flags);
+void sas_free_task(struct sas_task *task);
+
int sas_register_ports(struct sas_ha_struct *sas_ha);
void sas_unregister_ports(struct sas_ha_struct *sas_ha);

diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index ff04eb6d250b..1c45575d1082 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -639,10 +639,6 @@ struct sas_task_slow {
#define SAS_TASK_STATE_ABORTED 4
#define SAS_TASK_NEED_DEV_RESET 8

-extern struct sas_task *sas_alloc_task(gfp_t flags);
-extern struct sas_task *sas_alloc_slow_task(gfp_t flags);
-extern void sas_free_task(struct sas_task *task);
-
static inline bool sas_is_internal_abort(struct sas_task *task)
{
return task->task_proto == SAS_PROTOCOL_INTERNAL_ABORT;
--
2.35.3

2022-07-22 11:42:19

by John Garry

[permalink] [raw]
Subject: [PATCH 2/6] scsi: libsas: Add sas_ata_link_abort()

Similar to how AHCI handles NCQ errors in ahci_error_intr() ->
ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs
to call to initiate a link abort.

This will mark all outstanding QCs as failed and kick-off EH.

Suggested-by: Damien Le Moal <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/libsas/sas_ata.c | 10 ++++++++++
include/scsi/sas_ata.h | 5 +++++
2 files changed, 15 insertions(+)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index d35c9296f738..5a68197861bf 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -861,6 +861,16 @@ void sas_ata_wait_eh(struct domain_device *dev)
ata_port_wait_eh(ap);
}

+void sas_ata_link_abort(struct domain_device *device)
+{
+ struct ata_port *ap = device->sata_dev.ap;
+ struct ata_link *link = &ap->link;
+
+ link->eh_info.err_mask |= AC_ERR_DEV;
+ ata_link_abort(link);
+}
+EXPORT_SYMBOL_GPL(sas_ata_link_abort);
+
int sas_execute_ata_cmd(struct domain_device *device, u8 *fis, int force_phy_id)
{
struct sas_tmf_task tmf_task = {};
diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h
index a1df4f9d57a3..e53aa21b590d 100644
--- a/include/scsi/sas_ata.h
+++ b/include/scsi/sas_ata.h
@@ -32,6 +32,7 @@ void sas_probe_sata(struct asd_sas_port *port);
void sas_suspend_sata(struct asd_sas_port *port);
void sas_resume_sata(struct asd_sas_port *port);
void sas_ata_end_eh(struct ata_port *ap);
+void sas_ata_link_abort(struct domain_device *dev);
int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
int force_phy_id);
int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline);
@@ -87,6 +88,10 @@ static inline void sas_ata_end_eh(struct ata_port *ap)
{
}

+static inline void sas_ata_link_abort(struct domain_device *dev)
+{
+}
+
static inline int sas_execute_ata_cmd(struct domain_device *device, u8 *fis,
int force_phy_id)
{
--
2.35.3

2022-07-22 11:42:34

by John Garry

[permalink] [raw]
Subject: [PATCH 5/6] scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw

From: Xingui Yang <[email protected]>

When CQ header dw3 SATA_DISK_ERR is set it means this SATA disk is in error
state and the current IPTT is invalid. An invalid IPTT does not correspond
to any slot.

In this scenario, new I/Os that delivered to disk will be rejected by the,
controller and all I/Os remained on the disk should be aborted, which we
add here with the sas_ata_link_abort() call.

Signed-off-by: Xingui Yang <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index eb86afb21aab..7be3ea23f170 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -423,6 +423,8 @@
#define CMPLT_HDR_DEV_ID_OFF 16
#define CMPLT_HDR_DEV_ID_MSK (0xffff << CMPLT_HDR_DEV_ID_OFF)
/* dw3 */
+#define CMPLT_HDR_SATA_DISK_ERR_OFF 16
+#define CMPLT_HDR_SATA_DISK_ERR_MSK (0x1 << CMPLT_HDR_SATA_DISK_ERR_OFF)
#define CMPLT_HDR_IO_IN_TARGET_OFF 17
#define CMPLT_HDR_IO_IN_TARGET_MSK (0x1 << CMPLT_HDR_IO_IN_TARGET_OFF)

@@ -2379,14 +2381,30 @@ static irqreturn_t cq_thread_v3_hw(int irq_no, void *p)
while (rd_point != wr_point) {
struct hisi_sas_complete_v3_hdr *complete_hdr;
struct device *dev = hisi_hba->dev;
- u32 dw1;
+ u32 dw0, dw1, dw3;
int iptt;

complete_hdr = &complete_queue[rd_point];
+ dw0 = le32_to_cpu(complete_hdr->dw0);
dw1 = le32_to_cpu(complete_hdr->dw1);
+ dw3 = le32_to_cpu(complete_hdr->dw3);

iptt = dw1 & CMPLT_HDR_IPTT_MSK;
- if (likely(iptt < HISI_SAS_COMMAND_ENTRIES_V3_HW)) {
+ if (unlikely((dw0 & CMPLT_HDR_CMPLT_MSK) == 0x3) &&
+ (dw3 & CMPLT_HDR_SATA_DISK_ERR_MSK)) {
+ int device_id = (dw1 & CMPLT_HDR_DEV_ID_MSK) >>
+ CMPLT_HDR_DEV_ID_OFF;
+ struct hisi_sas_itct *itct =
+ &hisi_hba->itct[device_id];
+ struct hisi_sas_device *sas_dev =
+ &hisi_hba->devices[device_id];
+ struct domain_device *device = sas_dev->sas_device;
+
+ dev_err(dev, "erroneous completion disk err dev id=%d sas_addr=0x%llx CQ hdr: 0x%x 0x%x 0x%x 0x%x\n",
+ device_id, itct->sas_addr, dw0, dw1,
+ complete_hdr->act, dw3);
+ sas_ata_link_abort(device);
+ } else if (likely(iptt < HISI_SAS_COMMAND_ENTRIES_V3_HW)) {
slot = &hisi_hba->slot_info[iptt];
slot->cmplt_queue_slot = rd_point;
slot->cmplt_queue = queue;
--
2.35.3

2022-07-22 11:43:34

by John Garry

[permalink] [raw]
Subject: [PATCH 4/6] scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()

It is not necessary to issue an ATA softreset in hisi_sas_abort_task().
This is because in the libsas EH scheme we will defer to ATA EH which will
deal with any device reset during recovery.

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/hisi_sas/hisi_sas_main.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 764e859d0106..00d73090e6d1 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -1601,13 +1601,16 @@ static int hisi_sas_abort_task(struct sas_task *task)
} else if (task->task_proto & SAS_PROTOCOL_SATA ||
task->task_proto & SAS_PROTOCOL_STP) {
if (task->dev->dev_type == SAS_SATA_DEV) {
+ struct hisi_sas_slot *slot = task->lldd_task;
+
+ slot->task = NULL;
rc = hisi_sas_internal_task_abort_dev(sas_dev, false);
if (rc < 0) {
dev_err(dev, "abort task: internal abort failed\n");
goto out;
}
hisi_sas_dereg_device(hisi_hba, device);
- rc = hisi_sas_softreset_ata_disk(device);
+ rc = TMF_RESP_FUNC_COMPLETE;
}
} else if (task->lldd_task && task->task_proto & SAS_PROTOCOL_SMP) {
/* SMP */
--
2.35.3

2022-07-22 11:43:50

by John Garry

[permalink] [raw]
Subject: [PATCH RFT 3/6] scsi: pm8001: Use sas_ata_link_abort() to handle NCQ errors

In commit c6b9ef5779c3 ("[SCSI] pm80xx: NCQ error handling changes") the
driver had support added to handle NCQ errors but much of what is done
in this handling is duplicated from the libata EH.

In that named commit we handle in 2x main steps:
a. Issue read log ext10 to examine and clear the errors
b. Issue SATA_ABORT command

Indeed, in libata EH, we do similar to above:
a. ata_do_eh() -> ata_eh_autopsy() -> ata_eh_link_autopsy() ->#
ata_eh_analyze_ncq_error() -> ata_eh_read_log_10h()
b. ata_do_eh() -> ata_eh_recover() which will issue a device soft reset
or hard reset

Since there is so much duplication, use sas_ata_link_abort() which will
abort all pending IOs and kick off ATA EH which will do the steps, above.

The pm8001_send_abort_all() call to issue the SATA_ABORT command is added
in pm8001_I_T_nexus_reset() [which is called from ata_eh_recover() to
issue the device reset] as there is currently no better place to locate it.
In future if we add ata_port_operations.softreset callback implementation
for libsas then that may be better.

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/pm8001/pm8001_hwi.c | 179 ++++++++-----------------------
drivers/scsi/pm8001/pm8001_sas.c | 5 +
drivers/scsi/pm8001/pm8001_sas.h | 8 +-
drivers/scsi/pm8001/pm80xx_hwi.c | 163 ++--------------------------
4 files changed, 58 insertions(+), 297 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index 6a88255cb3c3..bb5c0a746fff 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -1720,7 +1720,15 @@ void pm8001_work_fn(struct work_struct *work)
pm8001_free_dev(pm8001_dev);
}
}
- } break;
+ }
+ break;
+ case IO_XFER_ERROR_ABORTED_NCQ_MODE:
+ {
+ pm8001_dev->id |= NCQ_ERR_FLAG;
+ dev = pm8001_dev->sas_device;
+ sas_ata_link_abort(dev);
+ }
+ break;
}
kfree(pw);
}
@@ -1744,108 +1752,47 @@ int pm8001_handle_event(struct pm8001_hba_info *pm8001_ha, void *data,
return ret;
}

-static void pm8001_send_abort_all(struct pm8001_hba_info *pm8001_ha,
- struct pm8001_device *pm8001_ha_dev)
+void pm8001_send_abort_all(struct pm8001_hba_info *pm8001_ha,
+ struct pm8001_device *pm8001_dev)
{
+ DECLARE_COMPLETION_ONSTACK(completion);
struct pm8001_ccb_info *ccb;
- struct sas_task *task;
struct task_abort_req task_abort;
u32 opc = OPC_INB_SATA_ABORT;
+ struct domain_device *dev = pm8001_dev->sas_device;
int ret;

- pm8001_ha_dev->id |= NCQ_ABORT_ALL_FLAG;
- pm8001_ha_dev->id &= ~NCQ_READ_LOG_FLAG;
-
- task = sas_alloc_slow_task(GFP_ATOMIC);
- if (!task) {
- pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task\n");
- return;
- }
-
- task->task_done = pm8001_task_done;
-
- ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
- if (!ccb) {
- sas_free_task(task);
+ ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_dev, NULL);
+ if (!ccb)
return;
- }

memset(&task_abort, 0, sizeof(task_abort));
task_abort.abort_all = cpu_to_le32(1);
- task_abort.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
+ task_abort.device_id = cpu_to_le32(pm8001_dev->device_id);
task_abort.tag = cpu_to_le32(ccb->ccb_tag);
-
+ pm8001_dev->sata_abort_all_completion = &completion;
+ pm8001_dev->id |= NCQ_ERR_ABORT_ALL_FLAG;
ret = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &task_abort,
sizeof(task_abort), 0);
if (ret) {
- sas_free_task(task);
+ pm8001_dev->sata_abort_all_completion = NULL;
pm8001_ccb_free(pm8001_ha, ccb);
}
-}
-
-static void pm8001_send_read_log(struct pm8001_hba_info *pm8001_ha,
- struct pm8001_device *pm8001_ha_dev)
-{
- struct sata_start_req sata_cmd;
- int res;
- struct pm8001_ccb_info *ccb;
- struct sas_task *task = NULL;
- struct host_to_dev_fis fis;
- struct domain_device *dev;
- u32 opc = OPC_INB_SATA_HOST_OPSTART;
-
- task = sas_alloc_slow_task(GFP_ATOMIC);
- if (!task) {
- pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task !!!\n");
- return;
- }
- task->task_done = pm8001_task_done;

/*
- * Allocate domain device by ourselves as libsas is not going to
- * provide any.
+ * It's always wise to use a timeout when dealing with HW.
+ * The value is abritary, same as PM8001_TASK_TIMEOUT at time of
+ * writing.
*/
- dev = kzalloc(sizeof(struct domain_device), GFP_ATOMIC);
- if (!dev) {
- sas_free_task(task);
- pm8001_dbg(pm8001_ha, FAIL,
- "Domain device cannot be allocated\n");
- return;
- }
- task->dev = dev;
- task->dev->lldd_dev = pm8001_ha_dev;
-
- ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
- if (!ccb) {
- sas_free_task(task);
- kfree(dev);
- return;
- }
-
- pm8001_ha_dev->id |= NCQ_READ_LOG_FLAG;
- pm8001_ha_dev->id |= NCQ_2ND_RLE_FLAG;
-
- /* construct read log FIS */
- memset(&fis, 0, sizeof(struct host_to_dev_fis));
- fis.fis_type = 0x27;
- fis.flags = 0x80;
- fis.command = ATA_CMD_READ_LOG_EXT;
- fis.lbal = 0x10;
- fis.sector_count = 0x1;
-
- memset(&sata_cmd, 0, sizeof(sata_cmd));
- sata_cmd.tag = cpu_to_le32(ccb->ccb_tag);
- sata_cmd.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
- sata_cmd.ncqtag_atap_dir_m = cpu_to_le32((0x1 << 7) | (0x5 << 9));
- memcpy(&sata_cmd.sata_fis, &fis, sizeof(struct host_to_dev_fis));
-
- res = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &sata_cmd,
- sizeof(sata_cmd), 0);
- if (res) {
- sas_free_task(task);
- pm8001_ccb_free(pm8001_ha, ccb);
- kfree(dev);
- }
+ ret = wait_for_completion_timeout(&completion, 20 * HZ);
+ if (!ret)
+ pm8001_dbg(pm8001_ha, FAIL, "Failed to send SATA ABORT ALL:%016llx\n",
+ SAS_ADDR(dev->sas_addr));
+
+ pm8001_dev->sata_abort_all_completion = NULL;
+ /* Clear the flag regardless - not much else which we can do */
+ pm8001_dev->id &= ~NCQ_ERR_FLAG;
+ pm8001_dev->id &= ~NCQ_ERR_ABORT_ALL_FLAG;
}

/**
@@ -2297,8 +2244,7 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha, void *piomb)
return;
}

- if ((pm8001_dev && !(pm8001_dev->id & NCQ_READ_LOG_FLAG))
- && unlikely(!t || !t->lldd_task || !t->dev)) {
+ if (pm8001_dev && unlikely(!t || !t->lldd_task || !t->dev)) {
pm8001_dbg(pm8001_ha, FAIL, "task or dev null\n");
return;
}
@@ -2356,15 +2302,6 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha, void *piomb)
if (param == 0) {
ts->resp = SAS_TASK_COMPLETE;
ts->stat = SAS_SAM_STAT_GOOD;
- /* check if response is for SEND READ LOG */
- if (pm8001_dev &&
- (pm8001_dev->id & NCQ_READ_LOG_FLAG)) {
- pm8001_send_abort_all(pm8001_ha, pm8001_dev);
- /* Free the tag */
- pm8001_tag_free(pm8001_ha, tag);
- sas_free_task(t);
- return;
- }
} else {
u8 len;
ts->resp = SAS_TASK_COMPLETE;
@@ -2662,9 +2599,10 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha, void *piomb)
if (event == IO_XFER_ERROR_ABORTED_NCQ_MODE) {
/* find device using device id */
pm8001_dev = pm8001_find_dev(pm8001_ha, dev_id);
- /* send read log extension */
if (pm8001_dev)
- pm8001_send_read_log(pm8001_ha, pm8001_dev);
+ pm8001_handle_event(pm8001_ha,
+ pm8001_dev,
+ IO_XFER_ERROR_ABORTED_NCQ_MODE);
return;
}

@@ -3628,6 +3566,12 @@ int pm8001_mpi_task_abort_resp(struct pm8001_hba_info *pm8001_ha, void *piomb)
t = ccb->task;
pm8001_dev = ccb->device; /* retrieve device */

+ if (pm8001_dev->id & NCQ_ERR_ABORT_ALL_FLAG) {
+ if (pm8001_dev->sata_abort_all_completion)
+ complete(pm8001_dev->sata_abort_all_completion);
+ return 0;
+ }
+
if (!t) {
pm8001_dbg(pm8001_ha, FAIL, " TASK NULL. RETURNING !!!\n");
return -1;
@@ -3654,12 +3598,7 @@ int pm8001_mpi_task_abort_resp(struct pm8001_hba_info *pm8001_ha, void *piomb)
pm8001_ccb_task_free(pm8001_ha, ccb);
mb();

- if (pm8001_dev->id & NCQ_ABORT_ALL_FLAG) {
- sas_free_task(t);
- pm8001_dev->id &= ~NCQ_ABORT_ALL_FLAG;
- } else {
- t->task_done(t);
- }
+ t->task_done(t);

return 0;
}
@@ -4211,7 +4150,6 @@ static int pm8001_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
u64 phys_addr;
u32 ATAP = 0x0;
u32 dir;
- unsigned long flags;
u32 opc = OPC_INB_SATA_HOST_OPSTART;

memset(&sata_cmd, 0, sizeof(sata_cmd));
@@ -4266,39 +4204,6 @@ static int pm8001_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
sata_cmd.esgl = 0;
}

- /* Check for read log for failed drive and return */
- if (sata_cmd.sata_fis.command == 0x2f) {
- if (((pm8001_ha_dev->id & NCQ_READ_LOG_FLAG) ||
- (pm8001_ha_dev->id & NCQ_ABORT_ALL_FLAG) ||
- (pm8001_ha_dev->id & NCQ_2ND_RLE_FLAG))) {
- struct task_status_struct *ts;
-
- pm8001_ha_dev->id &= 0xDFFFFFFF;
- ts = &task->task_status;
-
- spin_lock_irqsave(&task->task_state_lock, flags);
- ts->resp = SAS_TASK_COMPLETE;
- ts->stat = SAS_SAM_STAT_GOOD;
- task->task_state_flags &= ~SAS_TASK_STATE_PENDING;
- task->task_state_flags |= SAS_TASK_STATE_DONE;
- if (unlikely((task->task_state_flags &
- SAS_TASK_STATE_ABORTED))) {
- spin_unlock_irqrestore(&task->task_state_lock,
- flags);
- pm8001_dbg(pm8001_ha, FAIL,
- "task 0x%p resp 0x%x stat 0x%x but aborted by upper layer\n",
- task, ts->resp,
- ts->stat);
- pm8001_ccb_task_free(pm8001_ha, ccb);
- } else {
- spin_unlock_irqrestore(&task->task_state_lock,
- flags);
- pm8001_ccb_task_free_done(pm8001_ha, ccb);
- return 0;
- }
- }
- }
-
return pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &sata_cmd,
sizeof(sata_cmd), 0);
}
diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index a2ccb3484f90..1e0c0e6e0277 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -818,6 +818,11 @@ int pm8001_I_T_nexus_reset(struct domain_device *dev)

pm8001_dev = dev->lldd_dev;
pm8001_ha = pm8001_find_ha_by_dev(dev);
+
+ /* If in NCQ ABORT MODE then we need to issue a SATA_ABORT */
+ if (pm8001_dev->id & NCQ_ERR_FLAG)
+ pm8001_send_abort_all(pm8001_ha, pm8001_dev);
+
phy = sas_get_local_phy(dev);

if (dev_is_sata(dev)) {
diff --git a/drivers/scsi/pm8001/pm8001_sas.h b/drivers/scsi/pm8001/pm8001_sas.h
index 060ab680a7ed..b3eb4610336c 100644
--- a/drivers/scsi/pm8001/pm8001_sas.h
+++ b/drivers/scsi/pm8001/pm8001_sas.h
@@ -263,6 +263,7 @@ struct pm8001_device {
u32 id;
struct completion *dcompletion;
struct completion *setds_completion;
+ struct completion *sata_abort_all_completion;
u32 device_id;
atomic_t running_req;
};
@@ -576,9 +577,8 @@ struct pm8001_fw_image_header {
#define FLASH_UPDATE_DNLD_NOT_SUPPORTED 0x10
#define FLASH_UPDATE_DISABLED 0x11

-#define NCQ_READ_LOG_FLAG 0x80000000
-#define NCQ_ABORT_ALL_FLAG 0x40000000
-#define NCQ_2ND_RLE_FLAG 0x20000000
+#define NCQ_ERR_FLAG 0x20000000
+#define NCQ_ERR_ABORT_ALL_FLAG 0x40000000

/* Device states */
#define DS_OPERATIONAL 0x01
@@ -658,6 +658,8 @@ void pm8001_open_reject_retry(
int pm8001_mem_alloc(struct pci_dev *pdev, void **virt_addr,
dma_addr_t *pphys_addr, u32 *pphys_addr_hi, u32 *pphys_addr_lo,
u32 mem_size, u32 align);
+void pm8001_send_abort_all(struct pm8001_hba_info *pm8001_ha,
+ struct pm8001_device *pm8001_ha_dev);

void pm8001_chip_iounmap(struct pm8001_hba_info *pm8001_ha);
int pm8001_mpi_build_cmd(struct pm8001_hba_info *pm8001_ha,
diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index d55f8b1c236d..f0c27c152d0d 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -1772,113 +1772,6 @@ pm80xx_chip_interrupt_disable(struct pm8001_hba_info *pm8001_ha, u8 vec)
pm80xx_chip_intx_interrupt_disable(pm8001_ha);
}

-static void pm80xx_send_abort_all(struct pm8001_hba_info *pm8001_ha,
- struct pm8001_device *pm8001_ha_dev)
-{
- struct pm8001_ccb_info *ccb;
- struct sas_task *task;
- struct task_abort_req task_abort;
- u32 opc = OPC_INB_SATA_ABORT;
- int ret;
-
- pm8001_ha_dev->id |= NCQ_ABORT_ALL_FLAG;
- pm8001_ha_dev->id &= ~NCQ_READ_LOG_FLAG;
-
- task = sas_alloc_slow_task(GFP_ATOMIC);
- if (!task) {
- pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task\n");
- return;
- }
- task->task_done = pm8001_task_done;
-
- ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
- if (!ccb) {
- sas_free_task(task);
- return;
- }
-
- memset(&task_abort, 0, sizeof(task_abort));
- task_abort.abort_all = cpu_to_le32(1);
- task_abort.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
- task_abort.tag = cpu_to_le32(ccb->ccb_tag);
-
- ret = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &task_abort,
- sizeof(task_abort), 0);
- pm8001_dbg(pm8001_ha, FAIL, "Executing abort task end\n");
- if (ret) {
- sas_free_task(task);
- pm8001_ccb_free(pm8001_ha, ccb);
- }
-}
-
-static void pm80xx_send_read_log(struct pm8001_hba_info *pm8001_ha,
- struct pm8001_device *pm8001_ha_dev)
-{
- struct sata_start_req sata_cmd;
- int res;
- struct pm8001_ccb_info *ccb;
- struct sas_task *task = NULL;
- struct host_to_dev_fis fis;
- struct domain_device *dev;
- u32 opc = OPC_INB_SATA_HOST_OPSTART;
-
- task = sas_alloc_slow_task(GFP_ATOMIC);
- if (!task) {
- pm8001_dbg(pm8001_ha, FAIL, "cannot allocate task !!!\n");
- return;
- }
- task->task_done = pm8001_task_done;
-
- /*
- * Allocate domain device by ourselves as libsas is not going to
- * provide any.
- */
- dev = kzalloc(sizeof(struct domain_device), GFP_ATOMIC);
- if (!dev) {
- sas_free_task(task);
- pm8001_dbg(pm8001_ha, FAIL,
- "Domain device cannot be allocated\n");
- return;
- }
-
- task->dev = dev;
- task->dev->lldd_dev = pm8001_ha_dev;
-
- ccb = pm8001_ccb_alloc(pm8001_ha, pm8001_ha_dev, task);
- if (!ccb) {
- sas_free_task(task);
- kfree(dev);
- return;
- }
-
- pm8001_ha_dev->id |= NCQ_READ_LOG_FLAG;
- pm8001_ha_dev->id |= NCQ_2ND_RLE_FLAG;
-
- memset(&sata_cmd, 0, sizeof(sata_cmd));
-
- /* construct read log FIS */
- memset(&fis, 0, sizeof(struct host_to_dev_fis));
- fis.fis_type = 0x27;
- fis.flags = 0x80;
- fis.command = ATA_CMD_READ_LOG_EXT;
- fis.lbal = 0x10;
- fis.sector_count = 0x1;
-
- sata_cmd.tag = cpu_to_le32(ccb->ccb_tag);
- sata_cmd.device_id = cpu_to_le32(pm8001_ha_dev->device_id);
- sata_cmd.ncqtag_atap_dir_m_dad = cpu_to_le32(((0x1 << 7) | (0x5 << 9)));
- memcpy(&sata_cmd.sata_fis, &fis, sizeof(struct host_to_dev_fis));
-
- res = pm8001_mpi_build_cmd(pm8001_ha, 0, opc, &sata_cmd,
- sizeof(sata_cmd), 0);
- pm8001_dbg(pm8001_ha, FAIL, "Executing read log end\n");
- if (res) {
- sas_free_task(task);
- pm8001_ccb_free(pm8001_ha, ccb);
- kfree(dev);
- }
-}
-
/**
* mpi_ssp_completion - process the event that FW response to the SSP request.
* @pm8001_ha: our hba card information
@@ -2396,11 +2289,9 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha,
return;
}

- if ((pm8001_dev && !(pm8001_dev->id & NCQ_READ_LOG_FLAG))
- && unlikely(!t || !t->lldd_task || !t->dev)) {
- pm8001_dbg(pm8001_ha, FAIL, "task or dev null\n");
+
+ if (pm8001_dev && unlikely(!t->lldd_task || !t->dev))
return;
- }

ts = &t->task_status;

@@ -2457,15 +2348,6 @@ mpi_sata_completion(struct pm8001_hba_info *pm8001_ha,
if (param == 0) {
ts->resp = SAS_TASK_COMPLETE;
ts->stat = SAS_SAM_STAT_GOOD;
- /* check if response is for SEND READ LOG */
- if (pm8001_dev &&
- (pm8001_dev->id & NCQ_READ_LOG_FLAG)) {
- pm80xx_send_abort_all(pm8001_ha, pm8001_dev);
- /* Free the tag */
- pm8001_tag_free(pm8001_ha, tag);
- sas_free_task(t);
- return;
- }
} else {
u8 len;
ts->resp = SAS_TASK_COMPLETE;
@@ -2800,9 +2682,11 @@ static void mpi_sata_event(struct pm8001_hba_info *pm8001_ha,
if (event == IO_XFER_ERROR_ABORTED_NCQ_MODE) {
/* find device using device id */
pm8001_dev = pm8001_find_dev(pm8001_ha, dev_id);
- /* send read log extension */
+ /* send read log extension by aborting the link - libata does what we want */
if (pm8001_dev)
- pm80xx_send_read_log(pm8001_ha, pm8001_dev);
+ pm8001_handle_event(pm8001_ha,
+ pm8001_dev,
+ IO_XFER_ERROR_ABORTED_NCQ_MODE);
return;
}

@@ -4525,7 +4409,6 @@ static int pm80xx_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
u32 end_addr_high, end_addr_low;
u32 ATAP = 0x0;
u32 dir;
- unsigned long flags;
u32 opc = OPC_INB_SATA_HOST_OPSTART;
memset(&sata_cmd, 0, sizeof(sata_cmd));
cpu_id = smp_processor_id();
@@ -4704,40 +4587,6 @@ static int pm80xx_chip_sata_req(struct pm8001_hba_info *pm8001_ha,
(task->ata_task.atapi_packet[15] << 24)));
}

- /* Check for read log for failed drive and return */
- if (sata_cmd.sata_fis.command == 0x2f) {
- if (pm8001_ha_dev && ((pm8001_ha_dev->id & NCQ_READ_LOG_FLAG) ||
- (pm8001_ha_dev->id & NCQ_ABORT_ALL_FLAG) ||
- (pm8001_ha_dev->id & NCQ_2ND_RLE_FLAG))) {
- struct task_status_struct *ts;
-
- pm8001_ha_dev->id &= 0xDFFFFFFF;
- ts = &task->task_status;
-
- spin_lock_irqsave(&task->task_state_lock, flags);
- ts->resp = SAS_TASK_COMPLETE;
- ts->stat = SAS_SAM_STAT_GOOD;
- task->task_state_flags &= ~SAS_TASK_STATE_PENDING;
- task->task_state_flags |= SAS_TASK_STATE_DONE;
- if (unlikely((task->task_state_flags &
- SAS_TASK_STATE_ABORTED))) {
- spin_unlock_irqrestore(&task->task_state_lock,
- flags);
- pm8001_dbg(pm8001_ha, FAIL,
- "task 0x%p resp 0x%x stat 0x%x but aborted by upper layer\n",
- task, ts->resp,
- ts->stat);
- pm8001_ccb_task_free(pm8001_ha, ccb);
- return 0;
- } else {
- spin_unlock_irqrestore(&task->task_state_lock,
- flags);
- pm8001_ccb_task_free_done(pm8001_ha, ccb);
- atomic_dec(&pm8001_ha_dev->running_req);
- return 0;
- }
- }
- }
trace_pm80xx_request_issue(pm8001_ha->id,
ccb->device ? ccb->device->attached_phy : PM8001_MAX_PHYS,
ccb->ccb_tag, opc,
--
2.35.3

2022-08-11 19:13:46

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 0/6] libsas and drivers: NCQ error handling

On 2022/08/11 11:54, Damien Le Moal wrote:
> On 2022/07/22 4:24, John Garry wrote:
>> As reported in [0], the pm8001 driver NCQ error handling more or less
>> duplicates what libata does in link error handling, as follows:
>> - abort all commands
>> - do autopsy with read log ext 10 command
>> - reset the target to recover
>>
>> Indeed for the hisi_sas driver we want to add similar handling for NCQ
>> errors.
>>
>> This series add a new libsas API - sas_ata_link_abort() - to handle host
>> NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
>> mentioned in the pm8001 changeover patch, I would prefer a better place to
>> locate the SATA ABORT command (rather that nexus reset callback).
>>
>> I would appreciate some testing of the pm8001 change as the read log ext10
>> command mostly hangs on my arm64 machine - these arm64 hangs are a known
>> issue.
>
> I applied this series on top of the current Linus tree and ran some tests: a
> bunch of fio runs and also ran libzbc test suites on a SATA SMR drive as that
> generates many command failures. No problems detected, the tests all pass.
> FYI, messages for failed commands look like this:
>
> pm80xx0:: mpi_sata_event 2685: SATA EVENT 0x23
> sas: Enter sas_scsi_recover_host busy: 1 failed: 1
> sas: sas_scsi_find_task: aborting task 0x00000000ba62a907
> pm80xx0:: mpi_sata_completion 2292: task null, freeing CCB tag 2
> sas: sas_scsi_find_task: task 0x00000000ba62a907 is aborted
> sas: sas_eh_handle_sas_errors: task 0x00000000ba62a907 is aborted
> ata21.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 action 0x0
> ata21.00: failed command: WRITE FPDMA QUEUED
> ata21.00: cmd 61/02:00:ff:ff:ea/00:00:02:00:00/40 tag 29 ncq dma 8192 out
> res 43/04:02:ff:ff:ea/00:00:02:00:00/00 Emask 0x400 (NCQ error) <F>
> ata21.00: status: { DRDY SENSE ERR }
> ata21.00: error: { ABRT }
> ata21.00: configured for UDMA/133
> ata21: EH complete
> sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
>
> Seems all good to me.

Forgot to mention: tested with pm80xx driver.

>
>>
>> Finally with these changes we can make the libsas task alloc/free APIs
>> private, which they should always have been.
>>
>> Based on v5.19-rc6
>>
>> [0] https://lore.kernel.org/linux-scsi/[email protected]/
>>
>> John Garry (5):
>> scsi: pm8001: Modify task abort handling for SATA task
>> scsi: libsas: Add sas_ata_link_abort()
>> scsi: pm8001: Use sas_ata_link_abort() to handle NCQ errors
>> scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
>> scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
>>
>> Xingui Yang (1):
>> scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
>>
>> drivers/scsi/hisi_sas/hisi_sas_main.c | 5 +-
>> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++-
>> drivers/scsi/libsas/sas_ata.c | 10 ++
>> drivers/scsi/libsas/sas_init.c | 3 -
>> drivers/scsi/libsas/sas_internal.h | 4 +
>> drivers/scsi/pm8001/pm8001_hwi.c | 194 +++++++------------------
>> drivers/scsi/pm8001/pm8001_sas.c | 13 ++
>> drivers/scsi/pm8001/pm8001_sas.h | 8 +-
>> drivers/scsi/pm8001/pm80xx_hwi.c | 177 ++--------------------
>> include/scsi/libsas.h | 4 -
>> include/scsi/sas_ata.h | 5 +
>> 11 files changed, 132 insertions(+), 313 deletions(-)
>>
>
>


--
Damien Le Moal
Western Digital Research

2022-08-11 19:27:44

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 0/6] libsas and drivers: NCQ error handling

On 2022/07/22 4:24, John Garry wrote:
> As reported in [0], the pm8001 driver NCQ error handling more or less
> duplicates what libata does in link error handling, as follows:
> - abort all commands
> - do autopsy with read log ext 10 command
> - reset the target to recover
>
> Indeed for the hisi_sas driver we want to add similar handling for NCQ
> errors.
>
> This series add a new libsas API - sas_ata_link_abort() - to handle host
> NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
> mentioned in the pm8001 changeover patch, I would prefer a better place to
> locate the SATA ABORT command (rather that nexus reset callback).
>
> I would appreciate some testing of the pm8001 change as the read log ext10
> command mostly hangs on my arm64 machine - these arm64 hangs are a known
> issue.

I applied this series on top of the current Linus tree and ran some tests: a
bunch of fio runs and also ran libzbc test suites on a SATA SMR drive as that
generates many command failures. No problems detected, the tests all pass.
FYI, messages for failed commands look like this:

pm80xx0:: mpi_sata_event 2685: SATA EVENT 0x23
sas: Enter sas_scsi_recover_host busy: 1 failed: 1
sas: sas_scsi_find_task: aborting task 0x00000000ba62a907
pm80xx0:: mpi_sata_completion 2292: task null, freeing CCB tag 2
sas: sas_scsi_find_task: task 0x00000000ba62a907 is aborted
sas: sas_eh_handle_sas_errors: task 0x00000000ba62a907 is aborted
ata21.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 action 0x0
ata21.00: failed command: WRITE FPDMA QUEUED
ata21.00: cmd 61/02:00:ff:ff:ea/00:00:02:00:00/40 tag 29 ncq dma 8192 out
res 43/04:02:ff:ff:ea/00:00:02:00:00/00 Emask 0x400 (NCQ error) <F>
ata21.00: status: { DRDY SENSE ERR }
ata21.00: error: { ABRT }
ata21.00: configured for UDMA/133
ata21: EH complete
sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Seems all good to me.

>
> Finally with these changes we can make the libsas task alloc/free APIs
> private, which they should always have been.
>
> Based on v5.19-rc6
>
> [0] https://lore.kernel.org/linux-scsi/[email protected]/
>
> John Garry (5):
> scsi: pm8001: Modify task abort handling for SATA task
> scsi: libsas: Add sas_ata_link_abort()
> scsi: pm8001: Use sas_ata_link_abort() to handle NCQ errors
> scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
> scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
>
> Xingui Yang (1):
> scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
>
> drivers/scsi/hisi_sas/hisi_sas_main.c | 5 +-
> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++-
> drivers/scsi/libsas/sas_ata.c | 10 ++
> drivers/scsi/libsas/sas_init.c | 3 -
> drivers/scsi/libsas/sas_internal.h | 4 +
> drivers/scsi/pm8001/pm8001_hwi.c | 194 +++++++------------------
> drivers/scsi/pm8001/pm8001_sas.c | 13 ++
> drivers/scsi/pm8001/pm8001_sas.h | 8 +-
> drivers/scsi/pm8001/pm80xx_hwi.c | 177 ++--------------------
> include/scsi/libsas.h | 4 -
> include/scsi/sas_ata.h | 5 +
> 11 files changed, 132 insertions(+), 313 deletions(-)
>


--
Damien Le Moal
Western Digital Research

2022-08-12 05:04:41

by Jinpu Wang

[permalink] [raw]
Subject: Re: [PATCH 0/6] libsas and drivers: NCQ error handling

On Fri, Jul 22, 2022 at 1:30 PM John Garry <[email protected]> wrote:
>
> As reported in [0], the pm8001 driver NCQ error handling more or less
> duplicates what libata does in link error handling, as follows:
> - abort all commands
> - do autopsy with read log ext 10 command
> - reset the target to recover
>
> Indeed for the hisi_sas driver we want to add similar handling for NCQ
> errors.
>
> This series add a new libsas API - sas_ata_link_abort() - to handle host
> NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
> mentioned in the pm8001 changeover patch, I would prefer a better place to
> locate the SATA ABORT command (rather that nexus reset callback).
>
> I would appreciate some testing of the pm8001 change as the read log ext10
> command mostly hangs on my arm64 machine - these arm64 hangs are a known
> issue.
>
> Finally with these changes we can make the libsas task alloc/free APIs
> private, which they should always have been.
>
> Based on v5.19-rc6
>
> [0] https://lore.kernel.org/linux-scsi/[email protected]/
>
> John Garry (5):
> scsi: pm8001: Modify task abort handling for SATA task
> scsi: libsas: Add sas_ata_link_abort()
> scsi: pm8001: Use sas_ata_link_abort() to handle NCQ errors
> scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
> scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
>
> Xingui Yang (1):
> scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
>
> drivers/scsi/hisi_sas/hisi_sas_main.c | 5 +-
> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++-
> drivers/scsi/libsas/sas_ata.c | 10 ++
> drivers/scsi/libsas/sas_init.c | 3 -
> drivers/scsi/libsas/sas_internal.h | 4 +
> drivers/scsi/pm8001/pm8001_hwi.c | 194 +++++++------------------
> drivers/scsi/pm8001/pm8001_sas.c | 13 ++
> drivers/scsi/pm8001/pm8001_sas.h | 8 +-
> drivers/scsi/pm8001/pm80xx_hwi.c | 177 ++--------------------
> include/scsi/libsas.h | 4 -
> include/scsi/sas_ata.h | 5 +
> 11 files changed, 132 insertions(+), 313 deletions(-)

Thank! John and Damien,
for pm80xx.
Acked-by: Jack Wang <[email protected]>
> --
> 2.35.3
>

2022-08-12 08:17:32

by John Garry

[permalink] [raw]
Subject: Re: [PATCH 0/6] libsas and drivers: NCQ error handling

On 11/08/2022 19:54, Damien Le Moal wrote:
> On 2022/07/22 4:24, John Garry wrote:
>> As reported in [0], the pm8001 driver NCQ error handling more or less
>> duplicates what libata does in link error handling, as follows:
>> - abort all commands
>> - do autopsy with read log ext 10 command
>> - reset the target to recover
>>
>> Indeed for the hisi_sas driver we want to add similar handling for NCQ
>> errors.
>>
>> This series add a new libsas API - sas_ata_link_abort() - to handle host
>> NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
>> mentioned in the pm8001 changeover patch, I would prefer a better place to
>> locate the SATA ABORT command (rather that nexus reset callback).
>>
>> I would appreciate some testing of the pm8001 change as the read log ext10
>> command mostly hangs on my arm64 machine - these arm64 hangs are a known
>> issue.
>

Thanks for this!

> I applied this series on top of the current Linus tree and ran some tests: a
> bunch of fio runs and also ran libzbc test suites on a SATA SMR drive as that
> generates many command failures. No problems detected, the tests all pass.
> FYI, messages for failed commands look like this:
>
> pm80xx0:: mpi_sata_event 2685: SATA EVENT 0x23
> sas: Enter sas_scsi_recover_host busy: 1 failed: 1
> sas: sas_scsi_find_task: aborting task 0x00000000ba62a907
> pm80xx0:: mpi_sata_completion 2292: task null, freeing CCB tag 2
> sas: sas_scsi_find_task: task 0x00000000ba62a907 is aborted
> sas: sas_eh_handle_sas_errors: task 0x00000000ba62a907 is aborted
> ata21.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 action 0x0
> ata21.00: failed command: WRITE FPDMA QUEUED
> ata21.00: cmd 61/02:00:ff:ff:ea/00:00:02:00:00/40 tag 29 ncq dma 8192 out
> res 43/04:02:ff:ff:ea/00:00:02:00:00/00 Emask 0x400 (NCQ error) <F>
> ata21.00: status: { DRDY SENSE ERR }
> ata21.00: error: { ABRT }
> ata21.00: configured for UDMA/133
> ata21: EH complete
> sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
>

For this specific test we don't seem to run a hardreset after the
autopsy, but we do seem to be getting an NCQ error. That's interesting.

We have noticed this scenario for hisi_sas NCQ error, whereby the
autopsy decided a reset is not required or useful, such as a medium
error. Anyway the pm8001 driver relies on the reset being run always for
the NCQ error. So I am thinking of tweaking sas_ata_link_abort() as follows:

void sas_ata_link_abort(struct domain_device *device)
{
struct ata_port *ap = device->sata_dev.ap;
struct ata_link *link = &ap->link;

link->eh_info.err_mask |= AC_ERR_DEV;
+ link->eh_info.action |= ATA_EH_RESET;
ata_link_abort(link);
}

This should force a reset.

Thanks,
John

> Seems all good to me.
>
>>
>> Finally with these changes we can make the libsas task alloc/free APIs
>> private, which they should always have been.
>>
>> Based on v5.19-rc6
>>
>> [0] https://lore.kernel.org/linux-scsi/[email protected]/
>>
>> John Garry (5):
>> scsi: pm8001: Modify task abort handling for SATA task
>> scsi: libsas: Add sas_ata_link_abort()
>> scsi: pm8001: Use sas_ata_link_abort() to handle NCQ errors
>> scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
>> scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
>>
>> Xingui Yang (1):
>> scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
>>
>> drivers/scsi/hisi_sas/hisi_sas_main.c | 5 +-
>> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++-
>> drivers/scsi/libsas/sas_ata.c | 10 ++
>> drivers/scsi/libsas/sas_init.c | 3 -
>> drivers/scsi/libsas/sas_internal.h | 4 +
>> drivers/scsi/pm8001/pm8001_hwi.c | 194 +++++++------------------
>> drivers/scsi/pm8001/pm8001_sas.c | 13 ++
>> drivers/scsi/pm8001/pm8001_sas.h | 8 +-
>> drivers/scsi/pm8001/pm80xx_hwi.c | 177 ++--------------------
>> include/scsi/libsas.h | 4 -
>> include/scsi/sas_ata.h | 5 +
>> 11 files changed, 132 insertions(+), 313 deletions(-)
>>
>
>

2022-08-12 16:03:26

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 0/6] libsas and drivers: NCQ error handling

On 2022/08/12 1:06, John Garry wrote:
> On 11/08/2022 19:54, Damien Le Moal wrote:
>> On 2022/07/22 4:24, John Garry wrote:
>>> As reported in [0], the pm8001 driver NCQ error handling more or less
>>> duplicates what libata does in link error handling, as follows:
>>> - abort all commands
>>> - do autopsy with read log ext 10 command
>>> - reset the target to recover
>>>
>>> Indeed for the hisi_sas driver we want to add similar handling for NCQ
>>> errors.
>>>
>>> This series add a new libsas API - sas_ata_link_abort() - to handle host
>>> NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. As
>>> mentioned in the pm8001 changeover patch, I would prefer a better place to
>>> locate the SATA ABORT command (rather that nexus reset callback).
>>>
>>> I would appreciate some testing of the pm8001 change as the read log ext10
>>> command mostly hangs on my arm64 machine - these arm64 hangs are a known
>>> issue.
>>
>
> Thanks for this!
>
>> I applied this series on top of the current Linus tree and ran some tests: a
>> bunch of fio runs and also ran libzbc test suites on a SATA SMR drive as that
>> generates many command failures. No problems detected, the tests all pass.
>> FYI, messages for failed commands look like this:
>>
>> pm80xx0:: mpi_sata_event 2685: SATA EVENT 0x23
>> sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>> sas: sas_scsi_find_task: aborting task 0x00000000ba62a907
>> pm80xx0:: mpi_sata_completion 2292: task null, freeing CCB tag 2
>> sas: sas_scsi_find_task: task 0x00000000ba62a907 is aborted
>> sas: sas_eh_handle_sas_errors: task 0x00000000ba62a907 is aborted
>> ata21.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 action 0x0
>> ata21.00: failed command: WRITE FPDMA QUEUED
>> ata21.00: cmd 61/02:00:ff:ff:ea/00:00:02:00:00/40 tag 29 ncq dma 8192 out
>> res 43/04:02:ff:ff:ea/00:00:02:00:00/00 Emask 0x400 (NCQ error) <F>
>> ata21.00: status: { DRDY SENSE ERR }
>> ata21.00: error: { ABRT }
>> ata21.00: configured for UDMA/133
>> ata21: EH complete
>> sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
>>
>
> For this specific test we don't seem to run a hardreset after the
> autopsy, but we do seem to be getting an NCQ error. That's interesting.
>
> We have noticed this scenario for hisi_sas NCQ error, whereby the
> autopsy decided a reset is not required or useful, such as a medium
> error. Anyway the pm8001 driver relies on the reset being run always for
> the NCQ error. So I am thinking of tweaking sas_ata_link_abort() as follows:
>
> void sas_ata_link_abort(struct domain_device *device)
> {
> struct ata_port *ap = device->sata_dev.ap;
> struct ata_link *link = &ap->link;
>
> link->eh_info.err_mask |= AC_ERR_DEV;
> + link->eh_info.action |= ATA_EH_RESET;
> ata_link_abort(link);
> }
>
> This should force a reset.

This is an unaligned write to a sequential write required zone on SMR. So
definitely not worth a reset. Forcing hard resetting the link for such error is
an overkill. I think it is better to let ata_link_abort() -> ... -> scsi & ata
EH decide on the disposition.

Note that patch 3 did not apply cleanly to the current Linus tree. So a rebase
for the series is needed.

>
> Thanks,
> John
>
>> Seems all good to me.
>>
>>>
>>> Finally with these changes we can make the libsas task alloc/free APIs
>>> private, which they should always have been.
>>>
>>> Based on v5.19-rc6
>>>
>>> [0] https://lore.kernel.org/linux-scsi/[email protected]/
>>>
>>> John Garry (5):
>>> scsi: pm8001: Modify task abort handling for SATA task
>>> scsi: libsas: Add sas_ata_link_abort()
>>> scsi: pm8001: Use sas_ata_link_abort() to handle NCQ errors
>>> scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
>>> scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
>>>
>>> Xingui Yang (1):
>>> scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
>>>
>>> drivers/scsi/hisi_sas/hisi_sas_main.c | 5 +-
>>> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 22 ++-
>>> drivers/scsi/libsas/sas_ata.c | 10 ++
>>> drivers/scsi/libsas/sas_init.c | 3 -
>>> drivers/scsi/libsas/sas_internal.h | 4 +
>>> drivers/scsi/pm8001/pm8001_hwi.c | 194 +++++++------------------
>>> drivers/scsi/pm8001/pm8001_sas.c | 13 ++
>>> drivers/scsi/pm8001/pm8001_sas.h | 8 +-
>>> drivers/scsi/pm8001/pm80xx_hwi.c | 177 ++--------------------
>>> include/scsi/libsas.h | 4 -
>>> include/scsi/sas_ata.h | 5 +
>>> 11 files changed, 132 insertions(+), 313 deletions(-)
>>>
>>
>>
>


--
Damien Le Moal
Western Digital Research

2022-08-12 16:40:22

by John Garry

[permalink] [raw]
Subject: Re: [PATCH 0/6] libsas and drivers: NCQ error handling

On 12/08/2022 16:39, Damien Le Moal wrote:
>> For this specific test we don't seem to run a hardreset after the
>> autopsy, but we do seem to be getting an NCQ error. That's interesting.
>>
>> We have noticed this scenario for hisi_sas NCQ error, whereby the
>> autopsy decided a reset is not required or useful, such as a medium
>> error. Anyway the pm8001 driver relies on the reset being run always for
>> the NCQ error. So I am thinking of tweaking sas_ata_link_abort() as follows:
>>
>> void sas_ata_link_abort(struct domain_device *device)
>> {
>> struct ata_port *ap = device->sata_dev.ap;
>> struct ata_link *link = &ap->link;
>>
>> link->eh_info.err_mask |= AC_ERR_DEV;
>> + link->eh_info.action |= ATA_EH_RESET;
>> ata_link_abort(link);
>> }
>>
>> This should force a reset.
> This is an unaligned write to a sequential write required zone on SMR. So
> definitely not worth a reset. Forcing hard resetting the link for such error is
> an overkill. I think it is better to let ata_link_abort() -> ... -> scsi & ata
> EH decide on the disposition.

Do you know if this triggered the pm8001 IO_XFER_ERROR_ABORTED_NCQ_MODE
error?

If I do not set ATA_EH_RESET then I need to trust that libata will
always decide to do the reset for pm8001 IO_XFER_ERROR_ABORTED_NCQ_MODE
error. That is because it is in the reset that I send the pm8001 "abort
all" command - I could not find a better place for it.

>
> Note that patch 3 did not apply cleanly to the current Linus tree. So a rebase
> for the series is needed.
>

That might be just git am, which always seems temperamental. The patches
still apply from cherry-pick'ing for me. Anyway, I'll send a new version
next week.

Thanks,
John

2022-08-12 19:03:22

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 0/6] libsas and drivers: NCQ error handling

On 2022/08/12 9:33, John Garry wrote:
> On 12/08/2022 16:39, Damien Le Moal wrote:
>>> For this specific test we don't seem to run a hardreset after the
>>> autopsy, but we do seem to be getting an NCQ error. That's interesting.
>>>
>>> We have noticed this scenario for hisi_sas NCQ error, whereby the
>>> autopsy decided a reset is not required or useful, such as a medium
>>> error. Anyway the pm8001 driver relies on the reset being run always for
>>> the NCQ error. So I am thinking of tweaking sas_ata_link_abort() as follows:
>>>
>>> void sas_ata_link_abort(struct domain_device *device)
>>> {
>>> struct ata_port *ap = device->sata_dev.ap;
>>> struct ata_link *link = &ap->link;
>>>
>>> link->eh_info.err_mask |= AC_ERR_DEV;
>>> + link->eh_info.action |= ATA_EH_RESET;
>>> ata_link_abort(link);
>>> }
>>>
>>> This should force a reset.
>> This is an unaligned write to a sequential write required zone on SMR. So
>> definitely not worth a reset. Forcing hard resetting the link for such error is
>> an overkill. I think it is better to let ata_link_abort() -> ... -> scsi & ata
>> EH decide on the disposition.
>
> Do you know if this triggered the pm8001 IO_XFER_ERROR_ABORTED_NCQ_MODE
> error?
>
> If I do not set ATA_EH_RESET then I need to trust that libata will
> always decide to do the reset for pm8001 IO_XFER_ERROR_ABORTED_NCQ_MODE
> error. That is because it is in the reset that I send the pm8001 "abort
> all" command - I could not find a better place for it.

Not sure what error it was. Will need to add a print of it to check. Easy to do.

>
>>
>> Note that patch 3 did not apply cleanly to the current Linus tree. So a rebase
>> for the series is needed.
>>
>
> That might be just git am, which always seems temperamental. The patches
> still apply from cherry-pick'ing for me. Anyway, I'll send a new version
> next week.

Yes, it was a "bad ancestor" thing. Direct patching worked just fine.

>
> Thanks,
> John
>


--
Damien Le Moal
Western Digital Research