2024-06-07 17:57:59

by TJ Adams

[permalink] [raw]
Subject: [RESEND][PATCH 0/3] small pm80xx driver fixes

These are 3 small patches to prevent a kernel crash, prevent ncq from
being turned off accidentally, and change some logs' levels. Resending
because I missed some recipients the first time.

Igor Pylypiv (2):
scsi: pm80xx: Set phy->enable_completion only when we wait for it
scsi: pm80xx: Do not issue hard reset before NCQ EH

Terrence Adams (1):
scsi: pm8001: Update log level when reading config table

drivers/scsi/pm8001/pm8001_hwi.c | 11 +++++++++++
drivers/scsi/pm8001/pm8001_sas.c | 4 +++-
drivers/scsi/pm8001/pm80xx_hwi.c | 6 +++---
3 files changed, 17 insertions(+), 4 deletions(-)

--
2.45.2.505.gda0bf45e8d-goog



2024-06-07 17:58:13

by TJ Adams

[permalink] [raw]
Subject: [PATCH 1/3] scsi: pm80xx: Set phy->enable_completion only when we wait for it

From: Igor Pylypiv <[email protected]>

pm8001_phy_control() populates the enable_completion pointer with a
stack address, sends a PHY_LINK_RESET / PHY_HARD_RESET, waits 300 ms,
and returns. The problem arises when a phy control response comes late.
After 300 ms the pm8001_phy_control() function returns and the passed
enable_completion stack address is no longer valid. Late phy control
response invokes complete() on a dangling enable_completion pointer
which leads to a kernel crash.

Signed-off-by: Igor Pylypiv <[email protected]>
Signed-off-by: Terrence Adams <[email protected]>
---
drivers/scsi/pm8001/pm8001_sas.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
index a5a31dfa4512..ee2da8e49d4c 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -166,7 +166,6 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
unsigned long flags;
pm8001_ha = sas_phy->ha->lldd_ha;
phy = &pm8001_ha->phy[phy_id];
- pm8001_ha->phy[phy_id].enable_completion = &completion;

if (PM8001_CHIP_DISP->fatal_errors(pm8001_ha)) {
/*
@@ -190,6 +189,7 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
rates->maximum_linkrate;
}
if (pm8001_ha->phy[phy_id].phy_state == PHY_LINK_DISABLE) {
+ pm8001_ha->phy[phy_id].enable_completion = &completion;
PM8001_CHIP_DISP->phy_start_req(pm8001_ha, phy_id);
wait_for_completion(&completion);
}
@@ -198,6 +198,7 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
break;
case PHY_FUNC_HARD_RESET:
if (pm8001_ha->phy[phy_id].phy_state == PHY_LINK_DISABLE) {
+ pm8001_ha->phy[phy_id].enable_completion = &completion;
PM8001_CHIP_DISP->phy_start_req(pm8001_ha, phy_id);
wait_for_completion(&completion);
}
@@ -206,6 +207,7 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
break;
case PHY_FUNC_LINK_RESET:
if (pm8001_ha->phy[phy_id].phy_state == PHY_LINK_DISABLE) {
+ pm8001_ha->phy[phy_id].enable_completion = &completion;
PM8001_CHIP_DISP->phy_start_req(pm8001_ha, phy_id);
wait_for_completion(&completion);
}
--
2.45.2.505.gda0bf45e8d-goog


2024-06-07 17:58:26

by TJ Adams

[permalink] [raw]
Subject: [PATCH 2/3] scsi: pm80xx: Do not issue hard reset before NCQ EH

From: Igor Pylypiv <[email protected]>

v6.2 commit 811be570a9a8 ("scsi: pm8001: Use sas_ata_device_link_abort()
to handle NCQ errors") removed duplicate NCQ EH from the pm80xx driver
and started relying on libata to handle the NCQ errors. The PM8006
controller has a special EH sequence that was added in v4.15 commit
869ddbdcae3b ("scsi: pm80xx: corrected SATA abort handling sequence.").
The special EH sequence issues a hard reset to a drive before libata EH
has a chance to read the NCQ log page. Libata EH gets confused by empty
NCQ log page which results in HSM violation. The failed command gets
retried a few times and each time fails with the same HSM violation.
Finally, libata decides to disable NCQ due to subsequent HSM vioaltions.

To avoid unwanted hard resets we can initiate abort all from the driver
to prevent libsas EH from calling lldd_abort_task()/pm8001_abort_task().

Signed-off-by: Igor Pylypiv <[email protected]>
Signed-off-by: Terrence Adams <[email protected]>
---
drivers/scsi/pm8001/pm8001_hwi.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c
index dec1e2d380f1..f19f76dc6e1c 100644
--- a/drivers/scsi/pm8001/pm8001_hwi.c
+++ b/drivers/scsi/pm8001/pm8001_hwi.c
@@ -1672,7 +1672,18 @@ void pm8001_work_fn(struct work_struct *work)
break;
case IO_XFER_ERROR_ABORTED_NCQ_MODE:
{
+ struct pm8001_hba_info *pm8001_ha = pw->pm8001_ha;
dev = pm8001_dev->sas_device;
+ /*
+ * pm8001_abort_task() issues a hard reset to a drive
+ * before libata EH has a chance to read the NCQ log page.
+ *
+ * Initiate abort all from the driver to prevent libsas EH
+ * from calling lldd_abort_task() / pm8001_abort_task().
+ */
+ if (pm8001_ha->chip_id == chip_8006)
+ sas_execute_internal_abort_dev(dev, 0, NULL);
+
sas_ata_device_link_abort(dev, false);
}
break;
--
2.45.2.505.gda0bf45e8d-goog


2024-06-07 17:58:40

by TJ Adams

[permalink] [raw]
Subject: [PATCH 3/3] scsi: pm8001: Update log level when reading config table

From: Terrence Adams <[email protected]>

Reading the main config table occurs as a part of initialization in
pm80xx_chip_init(). Because of this it makes more sense to have it be a
part of the INIT logging.

Signed-off-by: Terrence Adams <[email protected]>
---
drivers/scsi/pm8001/pm80xx_hwi.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
index a52ae6841939..8fe886dc5e47 100644
--- a/drivers/scsi/pm8001/pm80xx_hwi.c
+++ b/drivers/scsi/pm8001/pm80xx_hwi.c
@@ -568,13 +568,13 @@ static void read_main_config_table(struct pm8001_hba_info *pm8001_ha)
pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version =
pm8001_mr32(address, MAIN_MPI_INACTIVE_FW_VERSION);

- pm8001_dbg(pm8001_ha, DEV,
+ pm8001_dbg(pm8001_ha, INIT,
"Main cfg table: sign:%x interface rev:%x fw_rev:%x\n",
pm8001_ha->main_cfg_tbl.pm80xx_tbl.signature,
pm8001_ha->main_cfg_tbl.pm80xx_tbl.interface_rev,
pm8001_ha->main_cfg_tbl.pm80xx_tbl.firmware_rev);

- pm8001_dbg(pm8001_ha, DEV,
+ pm8001_dbg(pm8001_ha, INIT,
"table offset: gst:%x iq:%x oq:%x int vec:%x phy attr:%x\n",
pm8001_ha->main_cfg_tbl.pm80xx_tbl.gst_offset,
pm8001_ha->main_cfg_tbl.pm80xx_tbl.inbound_queue_offset,
@@ -582,7 +582,7 @@ static void read_main_config_table(struct pm8001_hba_info *pm8001_ha)
pm8001_ha->main_cfg_tbl.pm80xx_tbl.int_vec_table_offset,
pm8001_ha->main_cfg_tbl.pm80xx_tbl.phy_attr_table_offset);

- pm8001_dbg(pm8001_ha, DEV,
+ pm8001_dbg(pm8001_ha, INIT,
"Main cfg table; ila rev:%x Inactive fw rev:%x\n",
pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version,
pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version);
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 09:20:36

by Niklas Cassel

[permalink] [raw]
Subject: Re: [PATCH 2/3] scsi: pm80xx: Do not issue hard reset before NCQ EH

Hello Igor, TJ,

On Fri, Jun 07, 2024 at 05:57:42PM +0000, TJ Adams wrote:
> From: Igor Pylypiv <[email protected]>
>
> v6.2 commit 811be570a9a8 ("scsi: pm8001: Use sas_ata_device_link_abort()

Do not specify kernel version (it is irrelevant), SHA1 is enough.


> to handle NCQ errors") removed duplicate NCQ EH from the pm80xx driver
> and started relying on libata to handle the NCQ errors. The PM8006
> controller has a special EH sequence that was added in v4.15 commit
> 869ddbdcae3b ("scsi: pm80xx: corrected SATA abort handling sequence.").

Do not specify kernel version (it is irrelevant), SHA1 is enough.

Since the code added in 869ddbdcae3b still exists in the pm80xx driver,
I think that you should mention the commits in chronological order.
(Right now you mention the oldest still existing code last, which seems
a bit backwards.)


> The special EH sequence issues a hard reset to a drive before libata EH
> has a chance to read the NCQ log page. Libata EH gets confused by empty
> NCQ log page which results in HSM violation. The failed command gets
> retried a few times and each time fails with the same HSM violation.
> Finally, libata decides to disable NCQ due to subsequent HSM vioaltions.

s/vioaltions/violations/

I'm not an expert in libsas EH, but I think that your commit message fails
to explain why this change actually fixes anything. You do not mention the
relationship between the code that you add pm8001_work_fn() and the
existing code in pm8001_abort_task(), and the order in which the functions
get executed.

Does calling sas_execute_internal_abort_dev() from pm8001_work_fn() ensure
that the libsas EH is never invoked? Or does it cancel the hard reset that
is part of the "special EH sequence" in pm8001_abort_task() ?

Wouldn't it be better if this was fixed in pm8001_abort_task() or similar
instead? It appears that the code you add to pm8001_work_fn() (that has a
very ugly if (pm8006)) is only there to undo or avoid the hard reset that
is done in pm8001_abort_task() (which also has a very ugly if (pm8006)).

Now we have this ugly if (pm8006) in two different functions... which
makes my "this could be solved in a nicer way" detector go off.

If this patch (as is) is really the way to go, then I think there should
be a more detailed reasoning why this change is the most sensible one.


Kind regards,
Niklas

2024-06-14 05:33:17

by Jinpu Wang

[permalink] [raw]
Subject: Re: [PATCH 1/3] scsi: pm80xx: Set phy->enable_completion only when we wait for it

On Fri, Jun 7, 2024 at 7:57 PM TJ Adams <[email protected]> wrote:
>
> From: Igor Pylypiv <[email protected]>
>
> pm8001_phy_control() populates the enable_completion pointer with a
> stack address, sends a PHY_LINK_RESET / PHY_HARD_RESET, waits 300 ms,
> and returns. The problem arises when a phy control response comes late.
> After 300 ms the pm8001_phy_control() function returns and the passed
> enable_completion stack address is no longer valid. Late phy control
> response invokes complete() on a dangling enable_completion pointer
> which leads to a kernel crash.
>
> Signed-off-by: Igor Pylypiv <[email protected]>
> Signed-off-by: Terrence Adams <[email protected]>
Acked-by: Jack Wang <[email protected]>
> ---
> drivers/scsi/pm8001/pm8001_sas.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c
> index a5a31dfa4512..ee2da8e49d4c 100644
> --- a/drivers/scsi/pm8001/pm8001_sas.c
> +++ b/drivers/scsi/pm8001/pm8001_sas.c
> @@ -166,7 +166,6 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
> unsigned long flags;
> pm8001_ha = sas_phy->ha->lldd_ha;
> phy = &pm8001_ha->phy[phy_id];
> - pm8001_ha->phy[phy_id].enable_completion = &completion;
>
> if (PM8001_CHIP_DISP->fatal_errors(pm8001_ha)) {
> /*
> @@ -190,6 +189,7 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
> rates->maximum_linkrate;
> }
> if (pm8001_ha->phy[phy_id].phy_state == PHY_LINK_DISABLE) {
> + pm8001_ha->phy[phy_id].enable_completion = &completion;
> PM8001_CHIP_DISP->phy_start_req(pm8001_ha, phy_id);
> wait_for_completion(&completion);
> }
> @@ -198,6 +198,7 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
> break;
> case PHY_FUNC_HARD_RESET:
> if (pm8001_ha->phy[phy_id].phy_state == PHY_LINK_DISABLE) {
> + pm8001_ha->phy[phy_id].enable_completion = &completion;
> PM8001_CHIP_DISP->phy_start_req(pm8001_ha, phy_id);
> wait_for_completion(&completion);
> }
> @@ -206,6 +207,7 @@ int pm8001_phy_control(struct asd_sas_phy *sas_phy, enum phy_func func,
> break;
> case PHY_FUNC_LINK_RESET:
> if (pm8001_ha->phy[phy_id].phy_state == PHY_LINK_DISABLE) {
> + pm8001_ha->phy[phy_id].enable_completion = &completion;
> PM8001_CHIP_DISP->phy_start_req(pm8001_ha, phy_id);
> wait_for_completion(&completion);
> }
> --
> 2.45.2.505.gda0bf45e8d-goog
>

2024-06-14 05:37:41

by Jinpu Wang

[permalink] [raw]
Subject: Re: [PATCH 3/3] scsi: pm8001: Update log level when reading config table

On Fri, Jun 7, 2024 at 7:57 PM TJ Adams <[email protected]> wrote:
>
> From: Terrence Adams <[email protected]>
>
> Reading the main config table occurs as a part of initialization in
> pm80xx_chip_init(). Because of this it makes more sense to have it be a
> part of the INIT logging.
>
> Signed-off-by: Terrence Adams <[email protected]>
Acked-by: Jack Wang <[email protected]>
> ---
> drivers/scsi/pm8001/pm80xx_hwi.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c
> index a52ae6841939..8fe886dc5e47 100644
> --- a/drivers/scsi/pm8001/pm80xx_hwi.c
> +++ b/drivers/scsi/pm8001/pm80xx_hwi.c
> @@ -568,13 +568,13 @@ static void read_main_config_table(struct pm8001_hba_info *pm8001_ha)
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version =
> pm8001_mr32(address, MAIN_MPI_INACTIVE_FW_VERSION);
>
> - pm8001_dbg(pm8001_ha, DEV,
> + pm8001_dbg(pm8001_ha, INIT,
> "Main cfg table: sign:%x interface rev:%x fw_rev:%x\n",
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.signature,
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.interface_rev,
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.firmware_rev);
>
> - pm8001_dbg(pm8001_ha, DEV,
> + pm8001_dbg(pm8001_ha, INIT,
> "table offset: gst:%x iq:%x oq:%x int vec:%x phy attr:%x\n",
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.gst_offset,
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.inbound_queue_offset,
> @@ -582,7 +582,7 @@ static void read_main_config_table(struct pm8001_hba_info *pm8001_ha)
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.int_vec_table_offset,
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.phy_attr_table_offset);
>
> - pm8001_dbg(pm8001_ha, DEV,
> + pm8001_dbg(pm8001_ha, INIT,
> "Main cfg table; ila rev:%x Inactive fw rev:%x\n",
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version,
> pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version);
> --
> 2.45.2.505.gda0bf45e8d-goog
>

2024-06-14 21:43:02

by Igor Pylypiv

[permalink] [raw]
Subject: Re: [PATCH 2/3] scsi: pm80xx: Do not issue hard reset before NCQ EH

On Tue, Jun 11, 2024 at 10:40:48AM +0200, Niklas Cassel wrote:
> Hello Igor, TJ,
>
Hi Niklas,

Thank you for the feedback!

> On Fri, Jun 07, 2024 at 05:57:42PM +0000, TJ Adams wrote:
> > From: Igor Pylypiv <[email protected]>
> >
> > v6.2 commit 811be570a9a8 ("scsi: pm8001: Use sas_ata_device_link_abort()
>
> Do not specify kernel version (it is irrelevant), SHA1 is enough.
>
Noted.

>
> > to handle NCQ errors") removed duplicate NCQ EH from the pm80xx driver
> > and started relying on libata to handle the NCQ errors. The PM8006
> > controller has a special EH sequence that was added in v4.15 commit
> > 869ddbdcae3b ("scsi: pm80xx: corrected SATA abort handling sequence.").
>
> Do not specify kernel version (it is irrelevant), SHA1 is enough.
>
> Since the code added in 869ddbdcae3b still exists in the pm80xx driver,
> I think that you should mention the commits in chronological order.
> (Right now you mention the oldest still existing code last, which seems
> a bit backwards.)
>
Noted. I wanted to emphasise that newer commit 811be570a9a8 broke the NCQ EH
for pm8006 so I put it first. I should have added a Fixes tag to make it
clear.

>
> > The special EH sequence issues a hard reset to a drive before libata EH
> > has a chance to read the NCQ log page. Libata EH gets confused by empty
> > NCQ log page which results in HSM violation. The failed command gets
> > retried a few times and each time fails with the same HSM violation.
> > Finally, libata decides to disable NCQ due to subsequent HSM vioaltions.
>
> s/vioaltions/violations/
>
> I'm not an expert in libsas EH, but I think that your commit message fails
> to explain why this change actually fixes anything. You do not mention the
> relationship between the code that you add pm8001_work_fn() and the
> existing code in pm8001_abort_task(), and the order in which the functions
> get executed.
>
Noted, will update with more details.

> Does calling sas_execute_internal_abort_dev() from pm8001_work_fn() ensure
> that the libsas EH is never invoked? Or does it cancel the hard reset that
> is part of the "special EH sequence" in pm8001_abort_task() ?
>
It ensures that all I/Os are aborted before libsas EH kicks in. Since all
I/Os are aborted by the controller the pm8001_abort_task() will not be called.
Going to add that to commit message as well.

> Wouldn't it be better if this was fixed in pm8001_abort_task() or similar
> instead? It appears that the code you add to pm8001_work_fn() (that has a
> very ugly if (pm8006)) is only there to undo or avoid the hard reset that
> is done in pm8001_abort_task() (which also has a very ugly if (pm8006)).
>

It would definetely be better to fix this in pm8001_abort_task(), if possible.
One way would be to add a flag that will be set when NCQ error happens (when
IO_XFER_ERROR_ABORTED_NCQ_MODE event is received) and then check that flag
in pm8001_abort_task() to skip hard reset. This approach adds another type
of ugliness to the code and I'm not sure which of these ugly approaches is
less ugly.

> Now we have this ugly if (pm8006) in two different functions... which
> makes my "this could be solved in a nicer way" detector go off.
>

I would be very happy if we can drop those ugly if (pm8006) checks and have
a generic nice solution.

> If this patch (as is) is really the way to go, then I think there should
> be a more detailed reasoning why this change is the most sensible one.
>

Let me investigate this issue more to see if there is a way to drop
the ugly pm8006 checks.

Any ideas/suggestions on how to fix this nicely are very welcomed.

>
> Kind regards,
> Niklas

Thank you,
Igor