2020-05-09 04:38:39

by Luis Chamberlain

[permalink] [raw]
Subject: [PATCH 09/15] qed: use new module_firmware_crashed()

This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Ariel Elior <[email protected]>
Cc: [email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/net/ethernet/qlogic/qed/qed_debug.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index f4eebaabb6d0..9cc6287b889b 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7854,6 +7854,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
REGDUMP_HEADER_SIZE,
&feature_size);
if (!rc) {
+ module_firmware_crashed();
*(u32 *)((u8 *)buffer + offset) =
qed_calc_regdump_header(cdev, PROTECTION_OVERRIDE,
cur_engine,
@@ -7869,6 +7870,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
rc = qed_dbg_fw_asserts(cdev, (u8 *)buffer + offset +
REGDUMP_HEADER_SIZE, &feature_size);
if (!rc) {
+ module_firmware_crashed();
*(u32 *)((u8 *)buffer + offset) =
qed_calc_regdump_header(cdev, FW_ASSERTS,
cur_engine, feature_size,
@@ -7906,6 +7908,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
REGDUMP_HEADER_SIZE, &feature_size);
if (!rc) {
+ module_firmware_crashed();
*(u32 *)((u8 *)buffer + offset) =
qed_calc_regdump_header(cdev, GRC_DUMP,
cur_engine,
--
2.25.1


2020-05-09 06:36:21

by Igor Russkikh

[permalink] [raw]
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()


> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
>
> Using a taint flag allows us to annotate when this happens clearly.
>
> Cc: Ariel Elior <[email protected]>
> Cc: [email protected]
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> drivers/net/ethernet/qlogic/qed/qed_debug.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> index f4eebaabb6d0..9cc6287b889b 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> @@ -7854,6 +7854,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> *buffer)
> REGDUMP_HEADER_SIZE,
> &feature_size);
> if (!rc) {
> + module_firmware_crashed();
> *(u32 *)((u8 *)buffer + offset) =
> qed_calc_regdump_header(cdev,
> PROTECTION_OVERRIDE,
> cur_engine,
> @@ -7869,6 +7870,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> *buffer)
> rc = qed_dbg_fw_asserts(cdev, (u8 *)buffer + offset +
> REGDUMP_HEADER_SIZE,
> &feature_size);
> if (!rc) {
> + module_firmware_crashed();
> *(u32 *)((u8 *)buffer + offset) =
> qed_calc_regdump_header(cdev, FW_ASSERTS,
> cur_engine,
> feature_size,
> @@ -7906,6 +7908,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> *buffer)
> rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
> REGDUMP_HEADER_SIZE, &feature_size);
> if (!rc) {
> + module_firmware_crashed();
> *(u32 *)((u8 *)buffer + offset) =
> qed_calc_regdump_header(cdev, GRC_DUMP,
> cur_engine,


Hi Luis,

qed_dbg_all_data is being used to gather debug dump from device. Failures
inside it may happen due to various reasons, but they normally do not indicate
FW failure.

So I think its not a good place to insert this call.

Its hard to find exact good place to insert it in qed.

One more thing is that AFAIU taint flag gets permanent on kernel, but for
example our device can recover itself from some FW crashes, thus it'd be
transparent for user.

Whats the logical purpose of module_firmware_crashed? Does it mean fatal
unrecoverable error on device?

Thanks,
Igor

2020-05-09 16:44:13

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()

On Sat, May 09, 2020 at 09:32:51AM +0300, Igor Russkikh wrote:
>
> > This makes use of the new module_firmware_crashed() to help
> > annotate when firmware for device drivers crash. When firmware
> > crashes devices can sometimes become unresponsive, and recovery
> > sometimes requires a driver unload / reload and in the worst cases
> > a reboot.
> >
> > Using a taint flag allows us to annotate when this happens clearly.
> >
> > Cc: Ariel Elior <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Luis Chamberlain <[email protected]>
> > ---
> > drivers/net/ethernet/qlogic/qed/qed_debug.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > index f4eebaabb6d0..9cc6287b889b 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > +++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > @@ -7854,6 +7854,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> > *buffer)
> > REGDUMP_HEADER_SIZE,
> > &feature_size);
> > if (!rc) {
> > + module_firmware_crashed();
> > *(u32 *)((u8 *)buffer + offset) =
> > qed_calc_regdump_header(cdev,
> > PROTECTION_OVERRIDE,
> > cur_engine,
> > @@ -7869,6 +7870,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> > *buffer)
> > rc = qed_dbg_fw_asserts(cdev, (u8 *)buffer + offset +
> > REGDUMP_HEADER_SIZE,
> > &feature_size);
> > if (!rc) {
> > + module_firmware_crashed();
> > *(u32 *)((u8 *)buffer + offset) =
> > qed_calc_regdump_header(cdev, FW_ASSERTS,
> > cur_engine,
> > feature_size,
> > @@ -7906,6 +7908,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> > *buffer)
> > rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
> > REGDUMP_HEADER_SIZE, &feature_size);
> > if (!rc) {
> > + module_firmware_crashed();
> > *(u32 *)((u8 *)buffer + offset) =
> > qed_calc_regdump_header(cdev, GRC_DUMP,
> > cur_engine,
>
>
> Hi Luis,
>
> qed_dbg_all_data is being used to gather debug dump from device. Failures
> inside it may happen due to various reasons, but they normally do not indicate
> FW failure.
>
> So I think its not a good place to insert this call.
> Its hard to find exact good place to insert it in qed.

Is there a way to check if what happened was indeed a fw crash?

> One more thing is that AFAIU taint flag gets permanent on kernel, but for
> example our device can recover itself from some FW crashes, thus it'd be
> transparent for user.

Similar things are *supposed* to recoverable with other device, however
this can also sometimes lead to a situation where devices are not usable
anymore, and require a full driver unload / load.

> Whats the logical purpose of module_firmware_crashed? Does it mean fatal
> unrecoverable error on device?

Its just to annotate on the module and kernel that this has happened.

I take it you may agree that, firmware crashing *often* is not good design,
and these issues should be reported to / fixed by vendors. In cases
where driver bugs are reported it is good to see if a firmware crash has
happened before, so that during analysis this is ruled out.

Luis

2020-05-12 16:26:48

by Igor Russkikh

[permalink] [raw]
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()


>> So I think its not a good place to insert this call.
>> Its hard to find exact good place to insert it in qed.
>
> Is there a way to check if what happened was indeed a fw crash?

Our driver has two firmwares (slowpath and fastpath).
For slowpath firmware the way to understand it crashed is to observe command
response timeout. This is in qed_mcp.c, around "The MFW failed to respond to
command" traceout.

For fastpath this is tricky, think you may leave the above place as the only
place to invoke module_firmware_crashed()

>
>> One more thing is that AFAIU taint flag gets permanent on kernel, but
> for
>> example our device can recover itself from some FW crashes, thus it'd be
>> transparent for user.
>
> Similar things are *supposed* to recoverable with other device, however
> this can also sometimes lead to a situation where devices are not usable
> anymore, and require a full driver unload / load.
>
>> Whats the logical purpose of module_firmware_crashed? Does it mean fatal
>> unrecoverable error on device?
>
> Its just to annotate on the module and kernel that this has happened.
>
> I take it you may agree that, firmware crashing *often* is not good
> design,
> and these issues should be reported to / fixed by vendors. In cases
> where driver bugs are reported it is good to see if a firmware crash has
> happened before, so that during analysis this is ruled out.

Probably, but still I see some misalignment here, in sense that taint is about
the kernel state, not about a hardware state indication.

devlink health could really be a much better candidate for such things.

Regards
Igor

2020-05-12 17:38:24

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()

On Tue, May 12, 2020 at 07:23:28PM +0300, Igor Russkikh wrote:
>
> >> So I think its not a good place to insert this call.
> >> Its hard to find exact good place to insert it in qed.
> >
> > Is there a way to check if what happened was indeed a fw crash?
>
> Our driver has two firmwares (slowpath and fastpath).
> For slowpath firmware the way to understand it crashed is to observe command
> response timeout. This is in qed_mcp.c, around "The MFW failed to respond to
> command" traceout.

Ok thanks.

> For fastpath this is tricky, think you may leave the above place as the only
> place to invoke module_firmware_crashed()

So do you mean like the changes below?

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index f4eebaabb6d0..95cb7da2542e 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7906,6 +7906,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
REGDUMP_HEADER_SIZE, &feature_size);
if (!rc) {
+ module_firmware_crashed();
*(u32 *)((u8 *)buffer + offset) =
qed_calc_regdump_header(cdev, GRC_DUMP,
cur_engine,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 280527cc0578..a818cf09dccf 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -566,6 +566,7 @@ _qed_mcp_cmd_and_union(struct qed_hwfn *p_hwfn,
DP_NOTICE(p_hwfn,
"The MFW failed to respond to command 0x%08x [param 0x%08x].\n",
p_mb_params->cmd, p_mb_params->param);
+ module_firmware_crashed();
qed_mcp_print_cpu_info(p_hwfn, p_ptt);

spin_lock_bh(&p_hwfn->mcp_info->cmd_lock);

> >> One more thing is that AFAIU taint flag gets permanent on kernel, but
> > for
> >> example our device can recover itself from some FW crashes, thus it'd be
> >> transparent for user.
> >
> > Similar things are *supposed* to recoverable with other device, however
> > this can also sometimes lead to a situation where devices are not usable
> > anymore, and require a full driver unload / load.
> >
> >> Whats the logical purpose of module_firmware_crashed? Does it mean fatal
> >> unrecoverable error on device?
> >
> > Its just to annotate on the module and kernel that this has happened.
> >
> > I take it you may agree that, firmware crashing *often* is not good
> > design,
> > and these issues should be reported to / fixed by vendors. In cases
> > where driver bugs are reported it is good to see if a firmware crash has
> > happened before, so that during analysis this is ruled out.
>
> Probably, but still I see some misalignment here, in sense that taint is about
> the kernel state, not about a hardware state indication.

The kernel carries the driver though, and the driver / subsystem can
often times act strange when this happens.

> devlink health could really be a much better candidate for such things.

That sounds fantastic, please Cc me on patches! However I still believe
we should register this event in the kernel for support purposes.

Luis

2020-05-14 14:56:58

by Igor Russkikh

[permalink] [raw]
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()

>
> So do you mean like the changes below?
>
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> index f4eebaabb6d0..95cb7da2542e 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> @@ -7906,6 +7906,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> *buffer)
> rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
> REGDUMP_HEADER_SIZE, &feature_size);
> if (!rc) {
> + module_firmware_crashed();
> *(u32 *)((u8 *)buffer + offset) =
> qed_calc_regdump_header(cdev, GRC_DUMP,
> cur_engine,

Please remove this invocation. Its not a place where FW crash is happening.


> DP_NOTICE(p_hwfn,
> "The MFW failed to respond to command 0x%08x
> [param 0x%08x].\n",
> p_mb_params->cmd, p_mb_params->param);
> + module_firmware_crashed();
> qed_mcp_print_cpu_info(p_hwfn, p_ptt);

This one is perfect, thanks!

Regards
Igor

2020-05-15 20:36:21

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()

On Thu, May 14, 2020 at 05:53:41PM +0300, Igor Russkikh wrote:
> >
> > So do you mean like the changes below?
> >
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > index f4eebaabb6d0..95cb7da2542e 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > +++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
> > @@ -7906,6 +7906,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void
> > *buffer)
> > rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
> > REGDUMP_HEADER_SIZE, &feature_size);
> > if (!rc) {
> > + module_firmware_crashed();
> > *(u32 *)((u8 *)buffer + offset) =
> > qed_calc_regdump_header(cdev, GRC_DUMP,
> > cur_engine,
>
> Please remove this invocation. Its not a place where FW crash is happening.

Will do!

>
> > DP_NOTICE(p_hwfn,
> > "The MFW failed to respond to command 0x%08x
> > [param 0x%08x].\n",
> > p_mb_params->cmd, p_mb_params->param);
> > + module_firmware_crashed();
> > qed_mcp_print_cpu_info(p_hwfn, p_ptt);
>
> This one is perfect, thanks!

Can I add your Reviewed-by provided I remove the hunk above?

Luis

2020-05-15 20:41:41

by Igor Russkikh

[permalink] [raw]
Subject: RE: [EXT] [PATCH 09/15] qed: use new module_firmware_crashed()


> >
> > This one is perfect, thanks!
>
> Can I add your Reviewed-by provided I remove the hunk above?
>
> Luis

Sure, I'm OK.

Regards,
Igor