LinuxLists.cc - [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'

2019-02-27 16:33:05

Subject: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'

When this HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the cause why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate _why_ the volume got offlined.
Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.

This patch is just about reporting that fact to help admins making a relationship between this event and the offlining.

Signed-off-by: Erwan Velu <[email protected]>
---
drivers/scsi/smartpqi/smartpqi_init.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..89f37d76735c 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
sshdr.sense_key == HARDWARE_ERROR &&
sshdr.asc == 0x3e &&
sshdr.ascq == 0x1) {
+ struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+ struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+ dev_err(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+ ctrl_info->scsi_host->host_no, device->bus,
+ device->target, device->lun);
pqi_take_device_offline(scmd->device, "RAID");
host_byte = DID_NO_CONNECT;
}
--
2.20.1

2019-02-28 13:28:17

by Erwan Velu

[permalink] [raw]

Subject: Re: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'

Hey,

That makes me wonder why the 0x3e / 0x2 isn't handled here aka

3E/02 DZTPROMAEBKVF TIMEOUT ON LOGICAL UNIT Is it possible the
controller send to the kernel this kind of message, if so shouldn't we
handle it here ? Erwan,

Le 27/02/2019 à 17:31, Erwan Velu a écrit :

> When this HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
> When reading the kernel log, the cause why the device got offlined isn't reported to the user.
> This situation makes difficult for admins to estimate _why_ the volume got offlined.
> Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
>
> This patch is just about reporting that fact to help admins making a relationship between this event and the offlining.
>
> Signed-off-by: Erwan Velu <[email protected]>
> ---
> drivers/scsi/smartpqi/smartpqi_init.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
> index f564af8949e8..89f37d76735c 100644
> --- a/drivers/scsi/smartpqi/smartpqi_init.c
> +++ b/drivers/scsi/smartpqi/smartpqi_init.c
> @@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
> sshdr.sense_key == HARDWARE_ERROR &&
> sshdr.asc == 0x3e &&
> sshdr.ascq == 0x1) {
> + struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
> + struct pqi_scsi_dev *device = scmd->device->hostdata;
> +
> + dev_err(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
> + ctrl_info->scsi_host->host_no, device->bus,
> + device->target, device->lun);
> pqi_take_device_offline(scmd->device, "RAID");
> host_byte = DID_NO_CONNECT;
> }

2019-02-28 23:13:55

by Elliott, Robert (Servers)

[permalink] [raw]

Subject: RE: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> Erwan Velu
> Sent: Wednesday, February 27, 2019 10:32 AM
> Subject: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'
>
> When this HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
> When reading the kernel log, the cause why the device got offlined isn't reported to the user.
> This situation makes difficult for admins to estimate _why_ the volume got offlined.
> Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1
> which is a 'logical unit failure'.
>
> This patch is just about reporting that fact to help admins making a relationship between this event
> and the offlining.
>
> Signed-off-by: Erwan Velu <[email protected]>
> ---
> drivers/scsi/smartpqi/smartpqi_init.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
> index f564af8949e8..89f37d76735c 100644
> --- a/drivers/scsi/smartpqi/smartpqi_init.c
> +++ b/drivers/scsi/smartpqi/smartpqi_init.c
> @@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
> sshdr.sense_key == HARDWARE_ERROR &&
> sshdr.asc == 0x3e &&
> sshdr.ascq == 0x1) {
> + struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
> + struct pqi_scsi_dev *device = scmd->device->hostdata;
> +
> + dev_err(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller
> for scsi %d:%d:%d:%d\n",
> + ctrl_info->scsi_host->host_no, device->bus,
> + device->target, device->lun);
> pqi_take_device_offline(scmd->device, "RAID");
> host_byte = DID_NO_CONNECT;
> }

Be careful printing errors per-IO; you could get thousands of them if things go bad.
The block layer print_req_error() uses printk_ratelimited(KERN_ERR) for that reason,
and the SCSI layer scsi_io_completion_action() maintains a ratelimit on its own.

The dev_err_ratelimited() macro might be a good fit here.

---
Robert Elliott, HPE Persistent Memory

2019-03-01 15:01:31

by Erwan Velu

[permalink] [raw]

Subject: Re: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'

[...]
> Be careful printing errors per-IO; you could get thousands of them if things go bad.
> The block layer print_req_error() uses printk_ratelimited(KERN_ERR) for that reason,
> and the SCSI layer scsi_io_completion_action() maintains a ratelimit on its own.
>
> The dev_err_ratelimited() macro might be a good fit here.

Thanks for the tip. I updated the patch and send a V2 for that.

I adjusted also to commit message to make it more explicit.

Thanks for the review.

Erwan,

2019-03-01 15:01:36

by Erwan Velu

[permalink] [raw]

Subject: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'

When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the reason why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate the root cause of the issue they analize.

Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
This patch is just about reporting the reason behind the offlining to ease the analyse.

Signed-off-by: Erwan Velu <[email protected]>
---
drivers/scsi/smartpqi/smartpqi_init.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..dfc4a6813440 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
sshdr.sense_key == HARDWARE_ERROR &&
sshdr.asc == 0x3e &&
sshdr.ascq == 0x1) {
+ struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+ struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+ dev_err_ratelimited(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+ ctrl_info->scsi_host->host_no, device->bus,
+ device->target, device->lun);
pqi_take_device_offline(scmd->device, "RAID");
host_byte = DID_NO_CONNECT;
}
--
2.20.1

2019-03-01 15:33:49

by James Bottomley

[permalink] [raw]

Subject: Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'

On Fri, 2019-03-01 at 15:58 +0100, Erwan Velu wrote:
> + dev_err_ratelimited(&ctrl_info->pci_dev-
> >dev, "received 'logical unit failure' from controller for scsi
> %d:%d:%d:%d\n",
> + ctrl_info-
> >scsi_host->host_no, device->bus,
> + device-
> >target, device->lun);

Shouldn't this be a variant of sdev/scmd_printk? Otherwise it tells
you what disk in the array terms is the problem but not what device in
your actual system is affected.

James

2019-03-01 15:46:43

by Erwan Velu

[permalink] [raw]

Subject: Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'

Le 01/03/2019 à 16:26, James Bottomley a écrit :
> [...]
> Shouldn't this be a variant of sdev/scmd_printk? Otherwise it tells
> you what disk in the array terms is the problem but not what device in
> your actual system is affected.

Hey James,

My initial take on that was that pqi_take_device_offline(), which is
called just after, will print the "re-scanning " message with the same
format.

As they will be both printed in the same error context and one after the
other, I though that would make sense to represent the same information
to ease the reading like cause -> consequence.

As the message is about the LUN itself, which is reported faulty, I
though it would worth reporting the info that way.

Shall I consider printing also the disk name in addition ?

Erwan,

2019-03-01 15:58:46

by James Bottomley

[permalink] [raw]

Subject: Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'

On Fri, 2019-03-01 at 15:43 +0000, Erwan Velu wrote:
> Le 01/03/2019 à 16:26, James Bottomley a écrit :
> > [...]
> > Shouldn't this be a variant of sdev/scmd_printk? Otherwise it
> > tells
> > you what disk in the array terms is the problem but not what device
> > in
> > your actual system is affected.
>
> Hey James,
>
> My initial take on that was that pqi_take_device_offline(), which is
> called just after, will print the "re-scanning " message with the
> same
> format.
>
> As they will be both printed in the same error context and one after
> the
> other, I though that would make sense to represent the same
> information
> to ease the reading like cause -> consequence.
>
> As the message is about the LUN itself, which is reported faulty, I
> though it would worth reporting the info that way.
>
> Shall I consider printing also the disk name in addition ?

I was thinking just

if (printk_ratelimit())
scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n", ...

That will give all the necessary information

James

2019-03-01 16:02:28

by Erwan Velu

[permalink] [raw]

Subject: Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'

Le 01/03/2019 à 16:56, James Bottomley a écrit :
> [...]
> I was thinking just
>
> if (printk_ratelimit())
> scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n", ...
>
> That will give all the necessary information

I'm pretty new to this area, learning from you ;o)

I'll update the v3 this way.

Thanks for the review.

Erwan,

2019-03-01 16:09:57

by Erwan Velu

[permalink] [raw]

Subject: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'

When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the reason why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate the root cause of the issue they analize.

Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
This patch is just about reporting the reason behind the offlining to ease the analyse.

Signed-off-by: Erwan Velu <[email protected]>
---
drivers/scsi/smartpqi/smartpqi_init.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..adebafe56b5b 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
sshdr.sense_key == HARDWARE_ERROR &&
sshdr.asc == 0x3e &&
sshdr.ascq == 0x1) {
+ struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+ struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+ if (printk_ratelimit())
+ scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+ ctrl_info->scsi_host->host_no, device->bus, device->target, device->lun);
pqi_take_device_offline(scmd->device, "RAID");
host_byte = DID_NO_CONNECT;
}
--
2.20.1

2019-03-05 23:13:04

by Don Brace

[permalink] [raw]

Subject: RE: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Erwan Velu
Sent: Friday, March 1, 2019 10:08 AM
Cc: Erwan Velu <[email protected]>; Don Brace <[email protected]>; James E.J. Bottomley <[email protected]>; Martin K. Petersen <[email protected]>; open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi) <[email protected]>; open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi) <[email protected]>; open list <[email protected]>
Subject: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'

When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the reason why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate the root cause of the issue they analize.

Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
This patch is just about reporting the reason behind the offlining to ease the analyse.

Signed-off-by: Erwan Velu <[email protected]>

Acked-by: Don Brace <[email protected]>

---
drivers/scsi/smartpqi/smartpqi_init.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..adebafe56b5b 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
sshdr.sense_key == HARDWARE_ERROR &&
sshdr.asc == 0x3e &&
sshdr.ascq == 0x1) {
+ struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+ struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+ if (printk_ratelimit())
+ scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+ ctrl_info->scsi_host->host_no, device->bus, device->target, device->lun);
pqi_take_device_offline(scmd->device, "RAID");
host_byte = DID_NO_CONNECT;
}
--
2.20.1

2019-03-06 20:04:15

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'

Erwan,

> When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume
> is offlined. When reading the kernel log, the reason why the device
> got offlined isn't reported to the user. This situation makes
> difficult for admins to estimate the root cause of the issue they
> analize.

Applied to 5.1/scsi-queue, thanks.

--
Martin K. Petersen Oracle Linux Engineering

2019-03-11 16:38:16

by Erwan Velu

[permalink] [raw]

Subject: Re: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'

2019-03-11 16:44:16

by Erwan Velu

[permalink] [raw]

Subject: Re: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'

Le 06/03/2019 à 18:34, Martin K. Petersen a écrit :
> Erwan,
>
>> When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume
>> is offlined. When reading the kernel log, the reason why the device
>> got offlined isn't reported to the user. This situation makes
>> difficult for admins to estimate the root cause of the issue they
>> analize.

While I was debugging this scenario, I was wondering if some other cases
were possible.

The current code is considering (sshdr.asc == 0x3e && sshdr.ascq ==
0x1), but what if ascq have a different value here ?

The specification (http://www.t10.org/lists/asc-num.htm#ASC_3E) reports
other sub-values like ASCQ==02 which means a timeout on the lun.

So, does the raid controllers supported by smartpqi can generates these
other values ? If so, how/where are they handled ?

I was considering at least, to a switch statement on sshdr.ascq with a
0x1 case on the current code and a a default one that prints at least a
message saying that a message got received but not handled.

Thanks !

Erwan,