Subject: [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O

This patchset addresses a couple of errors that might happen during
PCI device remove (e.g., PCI hotplug, PowerVM DLPAR), which prevent
the successful removal and re-addition of the adapter to the system,
and cause an oops and/or invalid DMA access (triggers an EEH event).

It allowed several cycles of PCI device add/remove with ongoing I/O,
to complete successfully without triggering oopses or EEH events.

Verified on v4.9-rc3.

Test-case:
---
# lspci
<...>
001d:70:00.0 Fibre Channel: QLogic Corp. ISP2532-based ...
001d:70:00.1 Fibre Channel: QLogic Corp. ISP2532-based ...
<...>

# for sd in $(find /sys/bus/pci/devices/001d:70:00.*/ \
-name 'sd*' -printf "%f\n"); do \
dd if=/dev/$sd of=/dev/null iflag=nocache & done

# echo 1 | tee /sys/bus/pci/devices/001d:70:00.*/remove
(this either works or not)

# echo 1 > /sys/bus/pci/rescan

Before:
---
<...>
EEH: Frozen PHB#1d-PE#700000 detected
qla2xxx [001d:70:00.1]-8042:2: PCI/Register disconnect, exiting.
<...>
EEH: Detected PCI bus error on PHB#29-PE#700000
<...>
(and/or)
Unable to handle kernel paging request for data at address 0x00000138
<...>
NIP [d000000004700a40] qla2xxx_queuecommand+0x80/0x3f0 [qla2xxx]
LR [d000000004700a10] qla2xxx_queuecommand+0x50/0x3f0 [qla2xxx]

(command does not return; adapter cannot be re-added)

After:
---
<...>
qla2xxx [001d:70:00.0]-801c:1: Abort command issued nexus=1:0:0 -- 1 2003.
<...>
qla2xxx [001d:70:00.1]-801c:2: Abort command issued nexus=2:3:0 -- 1 2003.
<...>

(command does return; adapter can be re-added correctly)


Mauricio Faria de Oliveira (2):
qla2xxx: do not queue commands when unloading
qla2xxx: fix invalid DMA access after command aborts in PCI device
remove

drivers/scsi/qla2xxx/qla_os.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

--
1.8.3.1


Subject: [PATCH 2/2] qla2xxx: fix invalid DMA access after command aborts in PCI device remove

If a command is aborted in the kernel but not in the adapter, it might be
considered complete and its DMA memory released, but it is still alive in
the adapter, which will trigger an invalid DMA access upon its completion
(in the DMA operations to deliver the command response to the driver).

On powerpc platforms with IOMMU/EEH capabilities, the problem is observed
during PCI device removal with ongoing IO requests -- which might trigger
an EEH event very often, pointing to a 'TCE Request Page Access Error'.

In that path, which is qla2x00_remove_one(), the commands are aborted in
qla2x00_abort_all_cmds(), which does not perform an abort in the adapter
as is done in qla2xxx_eh_abort() for example.

So, this patch changes qla2x00_abort_all_cmds() to abort commands in the
adapter too, with a call to qla2xxx_eh_abort(), which already implements
all the logic to submit abort requests and handle responses.

Reported-by: Naresh Bannoth <[email protected]>
Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
drivers/scsi/qla2xxx/qla_os.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index fdb135b..c50dd22 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1456,6 +1456,15 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) {
sp = req->outstanding_cmds[cnt];
if (sp) {
+ /* Get a reference to the sp and drop the lock.
+ * The reference ensures this sp->done() call
+ * - and not the call in qla2xxx_eh_abort() -
+ * ends the SCSI command (with result 'res').
+ */
+ sp_get(sp);
+ spin_unlock_irqrestore(&ha->hardware_lock, flags);
+ qla2xxx_eh_abort(GET_CMD_SP(sp));
+ spin_lock_irqsave(&ha->hardware_lock, flags);
req->outstanding_cmds[cnt] = NULL;
sp->done(vha, sp, res);
}
--
1.8.3.1

Subject: [PATCH 1/2] qla2xxx: do not queue commands when unloading

When the driver is unloading, in qla2x00_remove_one(), there is a single
call/point in time to abort ongoing commands, qla2x00_abort_all_cmds(),
which is still several steps away from the call to scsi_remove_host().

If more commands continue to arrive and be processed during that interval,
when the driver is tearing down and releasing its structures, it might
potentially hit an oops due to invalid memory access:

Unable to handle kernel paging request for data at address 0x00000138
<...>
NIP [d000000004700a40] qla2xxx_queuecommand+0x80/0x3f0 [qla2xxx]
LR [d000000004700a10] qla2xxx_queuecommand+0x50/0x3f0 [qla2xxx]

So, fail commands in qla2xxx_queuecommand() if the UNLOADING bit is set.

Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
---
drivers/scsi/qla2xxx/qla_os.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index ace65db..fdb135b 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -707,6 +707,11 @@ static int qla25xx_setup_mode(struct scsi_qla_host *vha)
srb_t *sp;
int rval;

+ if (unlikely(test_bit(UNLOADING, &base_vha->dpc_flags))) {
+ cmd->result = DID_NO_CONNECT << 16;
+ goto qc24_fail_command;
+ }
+
if (ha->flags.eeh_busy) {
if (ha->flags.pci_channel_io_perm_failure) {
ql_dbg(ql_dbg_aer, vha, 0x9010,
--
1.8.3.1

2016-11-08 20:22:00

by Madhani, Himanshu

[permalink] [raw]
Subject: Re: [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O



On 11/7/16, 11:53 AM, "Mauricio Faria de Oliveira" <[email protected]> wrote:

>This patchset addresses a couple of errors that might happen during
>PCI device remove (e.g., PCI hotplug, PowerVM DLPAR), which prevent
>the successful removal and re-addition of the adapter to the system,
>and cause an oops and/or invalid DMA access (triggers an EEH event).
>
>It allowed several cycles of PCI device add/remove with ongoing I/O,
>to complete successfully without triggering oopses or EEH events.
>
>Verified on v4.9-rc3.
>
>Test-case:
>---
> # lspci
> <...>
> 001d:70:00.0 Fibre Channel: QLogic Corp. ISP2532-based ...
> 001d:70:00.1 Fibre Channel: QLogic Corp. ISP2532-based ...
> <...>
>
> # for sd in $(find /sys/bus/pci/devices/001d:70:00.*/ \
> -name 'sd*' -printf "%f\n"); do \
> dd if=/dev/$sd of=/dev/null iflag=nocache & done
>
> # echo 1 | tee /sys/bus/pci/devices/001d:70:00.*/remove
> (this either works or not)
>
> # echo 1 > /sys/bus/pci/rescan
>
>Before:
>---
> <...>
> EEH: Frozen PHB#1d-PE#700000 detected
> qla2xxx [001d:70:00.1]-8042:2: PCI/Register disconnect, exiting.
> <...>
> EEH: Detected PCI bus error on PHB#29-PE#700000
> <...>
> (and/or)
> Unable to handle kernel paging request for data at address 0x00000138
> <...>
> NIP [d000000004700a40] qla2xxx_queuecommand+0x80/0x3f0 [qla2xxx]
> LR [d000000004700a10] qla2xxx_queuecommand+0x50/0x3f0 [qla2xxx]
>
> (command does not return; adapter cannot be re-added)
>
>After:
>---
> <...>
> qla2xxx [001d:70:00.0]-801c:1: Abort command issued nexus=1:0:0 -- 1 2003.
> <...>
> qla2xxx [001d:70:00.1]-801c:2: Abort command issued nexus=2:3:0 -- 1 2003.
> <...>
>
> (command does return; adapter can be re-added correctly)
>
>
>Mauricio Faria de Oliveira (2):
> qla2xxx: do not queue commands when unloading
> qla2xxx: fix invalid DMA access after command aborts in PCI device
> remove
>
> drivers/scsi/qla2xxx/qla_os.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
>--
>1.8.3.1
>
Thanks for the patches. Series Looks Good.

Acked-by: Himanshu Madhani <[email protected]>

>

2016-11-09 00:15:34

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O

>>>>> "Mauricio" == Mauricio Faria de Oliveira <[email protected]> writes:

Mauricio> This patchset addresses a couple of errors that might happen
Mauricio> during PCI device remove (e.g., PCI hotplug, PowerVM DLPAR),
Mauricio> which prevent the successful removal and re-addition of the
Mauricio> adapter to the system, and cause an oops and/or invalid DMA
Mauricio> access (triggers an EEH event).

Applied to 4.9/scsi-fixes.

--
Martin K. Petersen Oracle Linux Engineering