2020-09-08 08:18:11

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v3 0/4] qla2xxx: A couple crash fixes

The first crash we observed is due memory corruption in the srb memory
pool. Unforuntatly, I couldn't find the source of the problem but the
workaround by resetting the cleanup callbacks 'fixes' this problem
(patch #1). I think as intermeditate step this should be merged until
the real cause can be identified.

The second crash is due a race condition(?) in the firmware. The sts
entries are not updated in time which leads to this crash pattern
which several customers have reported:

#0 [c00000ffffd1bb80] scsi_dma_unmap at d00000001e4904d4 [scsi_mod]
#1 [c00000ffffd1bbe0] qla2x00_sp_compl at d0000000204803cc [qla2xxx]
#2 [c00000ffffd1bc20] qla24xx_process_response_queue at d0000000204c5810 [qla2xxx]
#3 [c00000ffffd1bd50] qla24xx_msix_rsp_q at d0000000204c8fd8 [qla2xxx]
#4 [c00000ffffd1bde0] __handle_irq_event_percpu at c000000000189510
#5 [c00000ffffd1bea0] handle_irq_event_percpu at c00000000018978c
#6 [c00000ffffd1bee0] handle_irq_event at c00000000018984c
#7 [c00000ffffd1bf10] handle_fasteoi_irq at c00000000018efc0
#8 [c00000ffffd1bf40] generic_handle_irq at c000000000187f10
#9 [c00000ffffd1bf60] __do_irq at c000000000018784
#10 [c00000ffffd1bf90] call_do_irq at c00000000002caa4
#11 [c00000ecca417a00] do_IRQ at c000000000018970
#12 [c00000ecca417a50] restore_check_irq_replay at c00000000000de98

From analyzing the crash dump it was clear that
qla24xx_mbx_iocb_entry() calls sp->done (qla2x00_sp_compl) which
crashes because the response is not a mailbox entry, it is a status
entry. Patch #4 changes the process logic for mailbox commands so that
the sp is parsed before calling the correct proccess function.


changes since v1:
- addressed review comments by Martin
- patch#1: added dummy warn function
- patch#4: added log entry

changes since v2:
- added reviewed tags by Martin
- addressed review comments by Arun
- patch#1: add srb pointer to log message
- patch#3: print calling func name in qla2x00_get_sp_from_handle()
- patch#4: dropped comment, reset HBA


Daniel Wagner (4):
qla2xxx: Warn if done() or free() are called on an already freed srb
qla2xxx: Simplify return value logic in qla2x00_get_sp_from_handle()
qla2xxx: Log calling function name in qla2x00_get_sp_from_handle()
qla2xxx: Handle incorrect entry_type entries

drivers/scsi/qla2xxx/qla_init.c | 10 +++++++++
drivers/scsi/qla2xxx/qla_inline.h | 5 +++++
drivers/scsi/qla2xxx/qla_isr.c | 47 ++++++++++++++++++++++++++++++---------
3 files changed, 51 insertions(+), 11 deletions(-)

--
2.16.4


2020-09-08 08:18:20

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v3 1/4] qla2xxx: Warn if done() or free() are called on an already freed srb

Emit a warning when ->done or ->free are called on an already freed
srb. There is a hidden use-after-free bug in the driver which corrupts
the srb memory pool which originates from the cleanup callbacks.

An extensive search didn't bring any lights on the real problem. The
initial fix was to set both pointers to NULL and try to catch invalid
accesses. But instead the memory corruption was gone and the driver
didn't crash. Since not all calling places check for NULL pointer, add
explicitly default handlers. With this we workaround the memory
corruption and add a debug help.

Reviewed-by: Martin Wilck <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/scsi/qla2xxx/qla_init.c | 10 ++++++++++
drivers/scsi/qla2xxx/qla_inline.h | 5 +++++
2 files changed, 15 insertions(+)

diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
index 57a2d76aa691..fb7d57dc4e69 100644
--- a/drivers/scsi/qla2xxx/qla_init.c
+++ b/drivers/scsi/qla2xxx/qla_init.c
@@ -63,6 +63,16 @@ void qla2x00_sp_free(srb_t *sp)
qla2x00_rel_sp(sp);
}

+void qla2xxx_rel_done_warning(srb_t *sp, int res)
+{
+ WARN_ONCE(1, "Calling done() of an already freed srb %p object\n", sp);
+}
+
+void qla2xxx_rel_free_warning(srb_t *sp)
+{
+ WARN_ONCE(1, "Calling free() of an already freed srb %p object\n", sp);
+}
+
/* Asynchronous Login/Logout Routines -------------------------------------- */

unsigned long
diff --git a/drivers/scsi/qla2xxx/qla_inline.h b/drivers/scsi/qla2xxx/qla_inline.h
index 861dc522723c..2aa6f81f87c4 100644
--- a/drivers/scsi/qla2xxx/qla_inline.h
+++ b/drivers/scsi/qla2xxx/qla_inline.h
@@ -207,10 +207,15 @@ qla2xxx_get_qpair_sp(scsi_qla_host_t *vha, struct qla_qpair *qpair,
return sp;
}

+void qla2xxx_rel_done_warning(srb_t *sp, int res);
+void qla2xxx_rel_free_warning(srb_t *sp);
+
static inline void
qla2xxx_rel_qpair_sp(struct qla_qpair *qpair, srb_t *sp)
{
sp->qpair = NULL;
+ sp->done = qla2xxx_rel_done_warning;
+ sp->free = qla2xxx_rel_free_warning;
mempool_free(sp, qpair->srb_mempool);
QLA_QPAIR_MARK_NOT_BUSY(qpair);
}
--
2.16.4

2020-09-08 08:19:04

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH v3 4/4] qla2xxx: Handle incorrect entry_type entries

It was observed on an ISP8324 16Gb HBA with fw=8.08.203 (d0d5) in a
PowerPC64 machine that pkt->entry_type was MBX_IOCB_TYPE/0x39 with an
sp->type SRB_SCSI_CMD which is invalid and should not be possible.

Reading the entry_type from the crash dump shows the expected value of
STATUS_TYPE/0x03 but the call trace shows that qla24xx_mbx_iocb_entry()
is used.

Add a check to verify for consistency and reset the HBA if an invalid
state is reached. Obviously, this is only a workaround until the real
problem is solved.

Signed-off-by: Daniel Wagner <[email protected]>
---
drivers/scsi/qla2xxx/qla_isr.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index b0b6dd2b608d..f953564cbed8 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -3406,6 +3406,32 @@ void qla24xx_nvme_ls4_iocb(struct scsi_qla_host *vha,
sp->done(sp, comp_status);
}

+static void qla24xx_process_mbx_iocb_response(struct scsi_qla_host *vha,
+ struct rsp_que *rsp, struct sts_entry_24xx *pkt)
+{
+ struct qla_hw_data *ha = vha->hw;
+ srb_t *sp;
+ const char func[] = "MBX-IOCB2";
+
+ sp = qla2x00_get_sp_from_handle(vha, func, rsp->req, pkt);
+ if (!sp)
+ return;
+
+ if (sp->type == SRB_SCSI_CMD ||
+ sp->type == SRB_NVME_CMD ||
+ sp->type == SRB_TM_CMD) {
+ ql_log(ql_log_warn, vha, 0x509d,
+ "Inconsistent event entry type %d\n", sp->type);
+ if (IS_P3P_TYPE(ha))
+ set_bit(FCOE_CTX_RESET_NEEDED, &vha->dpc_flags);
+ else
+ set_bit(ISP_ABORT_NEEDED, &vha->dpc_flags);
+ return;
+ }
+
+ qla24xx_mbx_iocb_entry(vha, rsp->req, (struct mbx_24xx_entry *)pkt);
+}
+
/**
* qla24xx_process_response_queue() - Process response queue entries.
* @vha: SCSI driver HA context
@@ -3513,8 +3539,7 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
(struct abort_entry_24xx *)pkt);
break;
case MBX_IOCB_TYPE:
- qla24xx_mbx_iocb_entry(vha, rsp->req,
- (struct mbx_24xx_entry *)pkt);
+ qla24xx_process_mbx_iocb_response(vha, rsp, pkt);
break;
case VP_CTRL_IOCB_TYPE:
qla_ctrlvp_completed(vha, rsp->req,
--
2.16.4

2020-09-09 22:41:16

by Arun Easi

[permalink] [raw]
Subject: Re: [EXT] [PATCH v3 0/4] qla2xxx: A couple crash fixes

Looks good. Thanks Daniel.

For the series:
Reviewed-by: Arun Easi <[email protected]>

Regards,
-Arun

On Tue, 8 Sep 2020, 1:15am, Daniel Wagner wrote:

> External Email
>
> ----------------------------------------------------------------------
> The first crash we observed is due memory corruption in the srb memory
> pool. Unforuntatly, I couldn't find the source of the problem but the
> workaround by resetting the cleanup callbacks 'fixes' this problem
> (patch #1). I think as intermeditate step this should be merged until
> the real cause can be identified.
>
> The second crash is due a race condition(?) in the firmware. The sts
> entries are not updated in time which leads to this crash pattern
> which several customers have reported:
>
> #0 [c00000ffffd1bb80] scsi_dma_unmap at d00000001e4904d4 [scsi_mod]
> #1 [c00000ffffd1bbe0] qla2x00_sp_compl at d0000000204803cc [qla2xxx]
> #2 [c00000ffffd1bc20] qla24xx_process_response_queue at d0000000204c5810 [qla2xxx]
> #3 [c00000ffffd1bd50] qla24xx_msix_rsp_q at d0000000204c8fd8 [qla2xxx]
> #4 [c00000ffffd1bde0] __handle_irq_event_percpu at c000000000189510
> #5 [c00000ffffd1bea0] handle_irq_event_percpu at c00000000018978c
> #6 [c00000ffffd1bee0] handle_irq_event at c00000000018984c
> #7 [c00000ffffd1bf10] handle_fasteoi_irq at c00000000018efc0
> #8 [c00000ffffd1bf40] generic_handle_irq at c000000000187f10
> #9 [c00000ffffd1bf60] __do_irq at c000000000018784
> #10 [c00000ffffd1bf90] call_do_irq at c00000000002caa4
> #11 [c00000ecca417a00] do_IRQ at c000000000018970
> #12 [c00000ecca417a50] restore_check_irq_replay at c00000000000de98
>
> From analyzing the crash dump it was clear that
> qla24xx_mbx_iocb_entry() calls sp->done (qla2x00_sp_compl) which
> crashes because the response is not a mailbox entry, it is a status
> entry. Patch #4 changes the process logic for mailbox commands so that
> the sp is parsed before calling the correct proccess function.
>
>
> changes since v1:
> - addressed review comments by Martin
> - patch#1: added dummy warn function
> - patch#4: added log entry
>
> changes since v2:
> - added reviewed tags by Martin
> - addressed review comments by Arun
> - patch#1: add srb pointer to log message
> - patch#3: print calling func name in qla2x00_get_sp_from_handle()
> - patch#4: dropped comment, reset HBA
>
>
> Daniel Wagner (4):
> qla2xxx: Warn if done() or free() are called on an already freed srb
> qla2xxx: Simplify return value logic in qla2x00_get_sp_from_handle()
> qla2xxx: Log calling function name in qla2x00_get_sp_from_handle()
> qla2xxx: Handle incorrect entry_type entries
>
> drivers/scsi/qla2xxx/qla_init.c | 10 +++++++++
> drivers/scsi/qla2xxx/qla_inline.h | 5 +++++
> drivers/scsi/qla2xxx/qla_isr.c | 47 ++++++++++++++++++++++++++++++---------
> 3 files changed, 51 insertions(+), 11 deletions(-)
>
>

2020-09-10 02:22:06

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] qla2xxx: A couple crash fixes


Daniel,

> The first crash we observed is due memory corruption in the srb memory
> pool. Unforuntatly, I couldn't find the source of the problem but the
> workaround by resetting the cleanup callbacks 'fixes' this problem
> (patch #1). I think as intermeditate step this should be merged until
> the real cause can be identified.
>
> The second crash is due a race condition(?) in the firmware. The sts
> entries are not updated in time which leads to this crash pattern
> which several customers have reported:

Applied to 5.10/scsi-staging. Thanks!

--
Martin K. Petersen Oracle Linux Engineering

2020-09-15 20:25:36

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] qla2xxx: A couple crash fixes

On Tue, 8 Sep 2020 10:15:12 +0200, Daniel Wagner wrote:

> The first crash we observed is due memory corruption in the srb memory
> pool. Unforuntatly, I couldn't find the source of the problem but the
> workaround by resetting the cleanup callbacks 'fixes' this problem
> (patch #1). I think as intermeditate step this should be merged until
> the real cause can be identified.
>
> The second crash is due a race condition(?) in the firmware. The sts
> entries are not updated in time which leads to this crash pattern
> which several customers have reported:
>
> [...]

Applied to 5.10/scsi-queue, thanks!

[1/4] scsi: qla2xxx: Warn if done() or free() are called on an already freed srb
https://git.kernel.org/mkp/scsi/c/c0014f94218e
[2/4] scsi: qla2xxx: Simplify return value logic in qla2x00_get_sp_from_handle()
https://git.kernel.org/mkp/scsi/c/622299f16f33
[3/4] scsi: qla2xxx: Log calling function name in qla2x00_get_sp_from_handle()
https://git.kernel.org/mkp/scsi/c/7d88d5dff95f
[4/4] scsi: qla2xxx: Handle incorrect entry_type entries
https://git.kernel.org/mkp/scsi/c/31a3271ff11b

--
Martin K. Petersen Oracle Linux Engineering