2020-05-15 21:30:35

by Luis Chamberlain

[permalink] [raw]
Subject: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: [email protected]
Cc: [email protected]
Cc: Kalle Valo <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/net/wireless/ath/ath10k/pci.c | 2 ++
drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
drivers/net/wireless/ath/ath10k/snoc.c | 1 +
3 files changed, 5 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 1d941d53fdc9..6bd0f3b518b9 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
scnprintf(guid, sizeof(guid), "n/a");

ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+ module_firmware_crashed();
ath10k_print_driver_info(ar);
ath10k_pci_dump_registers(ar, crash_data);
ath10k_ce_dump_registers(ar, crash_data);
@@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
if (ret) {
if (ath10k_pci_has_fw_crashed(ar)) {
ath10k_warn(ar, "firmware crashed during chip reset\n");
+ module_firmware_crashed();
ath10k_pci_fw_crashed_clear(ar);
ath10k_pci_fw_crashed_dump(ar);
}
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index e2aff2254a40..d34ad289380f 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)

/* TODO: Add firmware crash handling */
ath10k_warn(ar, "firmware crashed\n");
+ module_firmware_crashed();

/* read counter to clear the interrupt, the debug error interrupt is
* counter 0.
@@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
ath10k_err(ar, "firmware crashed!\n");
queue_work(ar->workqueue, &ar->restart_work);
+ module_firmware_crashed();
}
return ret;
}
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 354d49b1cd45..7cfc123c345c 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
scnprintf(guid, sizeof(guid), "n/a");

ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+ module_firmware_crashed();
ath10k_print_driver_info(ar);
ath10k_msa_dump_memory(ar, crash_data);
mutex_unlock(&ar->dump_mutex);
--
2.26.2


2020-05-16 04:13:59

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Fri, May 15, 2020 at 09:28:43PM +0000, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
>
> Using a taint flag allows us to annotate when this happens clearly.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Kalle Valo <[email protected]>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> drivers/net/wireless/ath/ath10k/pci.c | 2 ++
> drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
> drivers/net/wireless/ath/ath10k/snoc.c | 1 +
> 3 files changed, 5 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
> index 1d941d53fdc9..6bd0f3b518b9 100644
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
> scnprintf(guid, sizeof(guid), "n/a");
>
> ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> + module_firmware_crashed();
> ath10k_print_driver_info(ar);
> ath10k_pci_dump_registers(ar, crash_data);
> ath10k_ce_dump_registers(ar, crash_data);
> @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
> if (ret) {
> if (ath10k_pci_has_fw_crashed(ar)) {
> ath10k_warn(ar, "firmware crashed during chip reset\n");
> + module_firmware_crashed();
> ath10k_pci_fw_crashed_clear(ar);
> ath10k_pci_fw_crashed_dump(ar);
> }
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> index e2aff2254a40..d34ad289380f 100644
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
>
> /* TODO: Add firmware crash handling */
> ath10k_warn(ar, "firmware crashed\n");
> + module_firmware_crashed();
>
> /* read counter to clear the interrupt, the debug error interrupt is
> * counter 0.
> @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
> if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
> ath10k_err(ar, "firmware crashed!\n");
> queue_work(ar->workqueue, &ar->restart_work);
> + module_firmware_crashed();
> }
> return ret;
> }
> diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
> index 354d49b1cd45..7cfc123c345c 100644
> --- a/drivers/net/wireless/ath/ath10k/snoc.c
> +++ b/drivers/net/wireless/ath/ath10k/snoc.c
> @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
> scnprintf(guid, sizeof(guid), "n/a");
>
> ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> + module_firmware_crashed();
> ath10k_print_driver_info(ar);
> ath10k_msa_dump_memory(ar, crash_data);
> mutex_unlock(&ar->dump_mutex);
> --
> 2.26.2
>
Acked-by: Rafael Aquini <[email protected]>

2020-05-16 13:30:37

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed

You didn't CC me or the wireless list on the rest of the patches, so I'm
replying to a random one, but ...

What is the point here?

This should in no way affect the integrity of the system/kernel, for
most devices anyway.

So what if ath10k's firmware crashes? If there's a driver bug it will
not handle it right (and probably crash, WARN_ON, or something else),
but if the driver is working right then that will not affect the kernel
at all.

So maybe I can understand that maybe you want an easy way to discover -
per device - that the firmware crashed, but that still doesn't warrant a
complete kernel taint.

Instead of the kernel taint, IMHO you should provide an annotation in
sysfs (or somewhere else) for the *struct device* that had its firmware
crash. Or maybe, if it's too complex to walk the entire hierarchy
checking for that, have a uevent, or add the ability for the kernel to
print out elsewhere in debugfs the list of devices that crashed at some
point... All of that is fine, but a kernel taint?

johannes

2020-05-16 13:54:13

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote:

> Instead of the kernel taint, IMHO you should provide an annotation in
> sysfs (or somewhere else) for the *struct device* that had its firmware
> crash. Or maybe, if it's too complex to walk the entire hierarchy
> checking for that, have a uevent, or add the ability for the kernel to
> print out elsewhere in debugfs the list of devices that crashed at some

I mean sysfs, oops.


In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
detect that the device is really wedged enough that the only way we can
still try to recover is by completely unbinding the driver from it, then
we give userspace a uevent for that. I don't remember exactly how and
where that gets used (ChromeOS) though, but it'd be nice to have that
sort of thing as part of the infrastructure, in a sort of two-level
notification?

Level 1: firmware crashed, but we're recovering, at least mostly, and
it's more informational

Level 2: device is wedged, going to try to recover by some more forceful
means (perhaps some devices can be power-cycled? etc.) but (more) state
would be lost in these cases?

Still don't think a kernel taint is appropriate for either of these.

johannes

2020-05-18 16:58:14

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Sat, May 16, 2020 at 03:50:55PM +0200, Johannes Berg wrote:
> On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote:
>
> > Instead of the kernel taint, IMHO you should provide an annotation in
> > sysfs (or somewhere else) for the *struct device* that had its firmware
> > crash. Or maybe, if it's too complex to walk the entire hierarchy
> > checking for that, have a uevent, or add the ability for the kernel to
> > print out elsewhere in debugfs the list of devices that crashed at some
>
> I mean sysfs, oops.
>
>
> In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> detect that the device is really wedged enough that the only way we can
> still try to recover is by completely unbinding the driver from it, then
> we give userspace a uevent for that.

Nice! Indeed a uevent is in order for these sorts of things, and I'd
argue that it begs the question if we should even uevent for any taint
as well. Today these are silent. If the kernel crashes, today we only
give userspace a log.

> I don't remember exactly how and
> where that gets used (ChromeOS) though, but it'd be nice to have that
> sort of thing as part of the infrastructure, in a sort of two-level
> notification?
>
> Level 1: firmware crashed, but we're recovering, at least mostly, and
> it's more informational
>
> Level 2: device is wedged, going to try to recover by some more forceful
> means (perhaps some devices can be power-cycled? etc.) but (more) state
> would be lost in these cases?

I agree that *all* this would be ideal. I don't see this as mutually
exclusive with a taint on the kernel and module for the device.

> Still don't think a kernel taint is appropriate for either of these.

From a support perspective, I do think it is vital quick information.

Luis

2020-05-18 17:10:22

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
>
>
> On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > >
> > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > replying to a random one, but ...
> > >
> > > What is the point here?
> > >
> > > This should in no way affect the integrity of the system/kernel, for
> > > most devices anyway.
> >
> > Keyword you used here is "most device". And in the worst case, *who*
> > knows what other odd things may happen afterwards.
> >
> > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > not handle it right (and probably crash, WARN_ON, or something else),
> > > but if the driver is working right then that will not affect the kernel
> > > at all.
> >
> > Sometimes the device can go into a state which requires driver removal
> > and addition to get things back up.
>
> It would be lovely to be able to detect this case in the driver/system
> somehow! I haven't seen any such cases recently,

I assure you that I have run into it. Once it does again I'll report
the crash, but the problem with some of this is that unless you scrape
the log you won't know. Eventually, a uevent would indeed tell inform
me.

> but in case there is
> some common case you see, maybe we can think of a way to detect it?

ath10k is just one case, this patch series addresses a simple way to
annotate this tree-wide.

> > > So maybe I can understand that maybe you want an easy way to discover -
> > > per device - that the firmware crashed, but that still doesn't warrant a
> > > complete kernel taint.
> >
> > That is one reason, another is that a taint helps support cases *fast*
> > easily detect if the issue was a firmware crash, instead of scraping
> > logs for driver specific ways to say the firmware has crashed.
>
> You can listen for udev events (I think that is the right term),
> and find crashes that way. You get the actual crash info as well.

My follow up to this was to add uevent to add_taint() as well, this way
these could generically be processed by userspace.

Luis

2020-05-18 17:16:44

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()



On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
>>
>>
>> On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
>>> On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
>>>> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
>>>>
>>>> You didn't CC me or the wireless list on the rest of the patches, so I'm
>>>> replying to a random one, but ...
>>>>
>>>> What is the point here?
>>>>
>>>> This should in no way affect the integrity of the system/kernel, for
>>>> most devices anyway.
>>>
>>> Keyword you used here is "most device". And in the worst case, *who*
>>> knows what other odd things may happen afterwards.
>>>
>>>> So what if ath10k's firmware crashes? If there's a driver bug it will
>>>> not handle it right (and probably crash, WARN_ON, or something else),
>>>> but if the driver is working right then that will not affect the kernel
>>>> at all.
>>>
>>> Sometimes the device can go into a state which requires driver removal
>>> and addition to get things back up.
>>
>> It would be lovely to be able to detect this case in the driver/system
>> somehow! I haven't seen any such cases recently,
>
> I assure you that I have run into it. Once it does again I'll report
> the crash, but the problem with some of this is that unless you scrape
> the log you won't know. Eventually, a uevent would indeed tell inform
> me.
>
>> but in case there is
>> some common case you see, maybe we can think of a way to detect it?
>
> ath10k is just one case, this patch series addresses a simple way to
> annotate this tree-wide.
>
>>>> So maybe I can understand that maybe you want an easy way to discover -
>>>> per device - that the firmware crashed, but that still doesn't warrant a
>>>> complete kernel taint.
>>>
>>> That is one reason, another is that a taint helps support cases *fast*
>>> easily detect if the issue was a firmware crash, instead of scraping
>>> logs for driver specific ways to say the firmware has crashed.
>>
>> You can listen for udev events (I think that is the right term),
>> and find crashes that way. You get the actual crash info as well.
>
> My follow up to this was to add uevent to add_taint() as well, this way
> these could generically be processed by userspace.

I'm not opposed to the taint, though I have not thought much on it.

But, if you can already get the crash info from uevent, and it automatically
comes without polling or scraping logs, then what benefit beyond that does
the taint give you?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-05-18 17:19:50

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote:
>
>
> On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
> > >
> > >
> > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > > > >
> > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > > > replying to a random one, but ...
> > > > >
> > > > > What is the point here?
> > > > >
> > > > > This should in no way affect the integrity of the system/kernel, for
> > > > > most devices anyway.
> > > >
> > > > Keyword you used here is "most device". And in the worst case, *who*
> > > > knows what other odd things may happen afterwards.
> > > >
> > > > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > > > not handle it right (and probably crash, WARN_ON, or something else),
> > > > > but if the driver is working right then that will not affect the kernel
> > > > > at all.
> > > >
> > > > Sometimes the device can go into a state which requires driver removal
> > > > and addition to get things back up.
> > >
> > > It would be lovely to be able to detect this case in the driver/system
> > > somehow! I haven't seen any such cases recently,
> >
> > I assure you that I have run into it. Once it does again I'll report
> > the crash, but the problem with some of this is that unless you scrape
> > the log you won't know. Eventually, a uevent would indeed tell inform
> > me.
> >
> > > but in case there is
> > > some common case you see, maybe we can think of a way to detect it?
> >
> > ath10k is just one case, this patch series addresses a simple way to
> > annotate this tree-wide.
> >
> > > > > So maybe I can understand that maybe you want an easy way to discover -
> > > > > per device - that the firmware crashed, but that still doesn't warrant a
> > > > > complete kernel taint.
> > > >
> > > > That is one reason, another is that a taint helps support cases *fast*
> > > > easily detect if the issue was a firmware crash, instead of scraping
> > > > logs for driver specific ways to say the firmware has crashed.
> > >
> > > You can listen for udev events (I think that is the right term),
> > > and find crashes that way. You get the actual crash info as well.
> >
> > My follow up to this was to add uevent to add_taint() as well, this way
> > these could generically be processed by userspace.
>
> I'm not opposed to the taint, though I have not thought much on it.
>
> But, if you can already get the crash info from uevent, and it automatically
> comes without polling or scraping logs, then what benefit beyond that does
> the taint give you?

From a support perspective it is a *crystal* clear sign that the device
and / or device driver may be in a very bad state, in a generic way.

Luis

2020-05-18 20:01:05

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 11:06:27AM -0700, Steve deRosier wrote:
> On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <[email protected]> wrote:
> > From a support perspective it is a *crystal* clear sign that the device
> > and / or device driver may be in a very bad state, in a generic way.
> >
>
> Unfortunately a "taint" is interpreted by many users as: "your kernel
> is really F#*D up, you better do something about it right now."
> Assuming they're paying attention at all in the first place of course.

Taint historically has been used and still is today to help rule out
whether or not you get support, or how you get support.

For instance, a staging driver is not supported by some upstream
developers, but it will be by those who help staging and Greg. TAINT_CRAP
cannot be even more clear.

So, no, it is not just about "hey your kernel is messed up", there are
clear support boundaries being drawn.

> The fact is, WiFi chip firmware crashes, and in most cases the driver
> is able to recover seamlessly. At least that is the case with most QCA
> chipsets I work with.

That has not been my exerience with the same driver, and so how do we
know? And this patch set is not about ath10k alone, I want you to
think about *all* device drivers with firmware. In my journey to scrape
the kernel for these cases I was very surprised by the amount of code
which clearly annotates these situations.

> And the users or our ability to do anything
> about it is minimal to none as we don't have access to firmware
> source.

This is not true, we have open firmware in WiFi. Some vendors choose
to not open source their firmware, that is their decision.

These days though, I think we all admit, that firmware crashes can use
a better generic infrastructure for ensuring that clearly affecting-user
experience issues. This patch is about that *when and if these happen*,
we annotate it in the kernel for support pursposes.

> It's too bad and I wish it weren't the case, but we have
> embraced reality and most drivers have a recovery mechanism built in
> for this case.

The mentality about firmware crashes being the end of the world is
certainly what will lead developers to often hide these. Where this
is openly clear, and not obfucscated I'd argue that firmware issues
get fixed likely more common.

So what you describe is not bad, its just accepting evolution.

> In short, it's a non-event. I fear that elevating this
> to a kernel taint will significantly increase "support" requests that
> really are nothing but noise;

That will depend on where you put this on the driver, and that is
why it is important to place it in the right place, if any.

> similar to how the firmware load failure
> messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded
> fw-0.bin) cause a lot of noise.

That can be fixed, the developers behind this series gave up on it.
It has to do with a range version of supported firmwares, and all
being optional, but at least one is required.

> Not specifically opposed, but I wonder what it really accomplishes in
> a world where the firmware crashing is pretty much a normal
> occurrence.

Recovery without affecting user experience would be great, the taint is
*not* for those cases. The taint definition has:

+ 18) ``Q`` used by device drivers to annotate that the device driver's firmware
+ has crashed and the device's operation has been severely affected. The
+ device may be left in a crippled state, requiring full driver removal /
+ addition, system reboot, or it is unclear how long recovery will take.

Let me know if this is not clear.

> If it goes in, I think that the drivers shouldn't trigger the taint if
> they're able to recover normally. Only trigger on failure to come back
> up. In other words, the ideal place in the ath10k driver isn't where
> you have proposed as at that point operation is normal and we're doing
> a routine recovery.

Sure, happy to remove it if indeed it is the case that the firwmare
crash is not happening to cripple the device, but I can vouch for the
fact that the exact place where I placed it left my device driver in a
state where I had to remove / add again.

Luis

2020-05-18 20:05:02

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 09:25:09PM +0200, Johannes Berg wrote:
> On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote:
>
> > > Unfortunately a "taint" is interpreted by many users as: "your kernel
> > > is really F#*D up, you better do something about it right now."
> > > Assuming they're paying attention at all in the first place of course.
> >
> > Taint historically has been used and still is today to help rule out
> > whether or not you get support, or how you get support.
> >
> > For instance, a staging driver is not supported by some upstream
> > developers, but it will be by those who help staging and Greg. TAINT_CRAP
> > cannot be even more clear.
> >
> > So, no, it is not just about "hey your kernel is messed up", there are
> > clear support boundaries being drawn.
>
> Err, no. Those two are most definitely related. Have you looked at (most
> or some or whatever) staging drivers recently? Those contain all kinds
> of garbage that might do whatever with your kernel.

No, I stay away :)

> Of course that's not a completely clear boundary, maybe you can find a
> driver in staging that's perfect code just not written to kernel style?
> But I find that hard to believe, in most cases.
>
> So no, it's really not about "[a] staging driver is not supported" vs.
> "your kernel is messed up". The very fact that you loaded one of those
> things might very well have messed up your kernel entirely.
>
> > These days though, I think we all admit, that firmware crashes can use
> > a better generic infrastructure for ensuring that clearly affecting-user
> > experience issues. This patch is about that *when and if these happen*,
> > we annotate it in the kernel for support pursposes.
>
> That's all fine, I just don't think it's appropriate to pretend that
> your kernel is now 'tainted' (think about the meaning of that word) when
> the firmware of some random device crashed.

If the firmware crash *does* require driver remove / addition again,
or a reboot, would you think that this is a situation that merits a taint?

> > Recovery without affecting user experience would be great, the taint is
> > *not* for those cases. The taint definition has:
> >
> > + 18) ``Q`` used by device drivers to annotate that the device driver's firmware
> > + has crashed and the device's operation has been severely affected. The
> > + device may be left in a crippled state, requiring full driver removal /
> > + addition, system reboot, or it is unclear how long recovery will take.
> >
> > Let me know if this is not clear.
>
> It's pretty clear, but even then, first of all I doubt this is the case
> for many of the places that you've sprinkled the annotation on,

We can remove it, for this driver I can vouch for its location as it did
reach a state where I required a reboot. And its not the first time this
has happened. This got me thinking about the bigger picture of the lack
of proper way to address these cases in the kernel, and how the user is
left dumbfounded.

> and secondly it actually hides useful information.

What is it hiding?

> Regardless of the support issue, I think this hiding of information is
> also problematic.
>
> I really think we'd all be better off if you just made a sysfs file (I
> mistyped debugfs in some other email, sorry, apparently you didn't see
> the correction in time) that listed which device(s) crashed and how many
> times.

Ah yes, count. The taint does not address count.

> That would actually be useful. Because honestly, if a random
> device crashed for some random reason, that's pretty much a non-event.
> If it keeps happening, then we might even want to know about it.

True.

> You can obviously save the contents of this file into your bug reports
> automatically and act accordingly, but I think you'll find that this is
> far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this
> report.

Absolutely.

> Yeah, that might be reasonable thing if the bug report is about
> slow wifi *and* you see that ath10k firmware crashed every 10 seconds,
> but if it just crashed once a few days earlier it's of no importance to
> the system anymore ... And certainly a reasonable driver (which I
> believe ath10k to be) would _not_ randomly start corrupting memory
> because its firmware crashed. Which really is what tainting the kernel
> is about.

I still see it as a support thing too. But discussing this further is
pointless as I agree that taint does not cover count and that it is
important.


Luis

2020-05-18 20:31:45

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:
> It's pretty clear, but even then, first of all I doubt this is the case
> for many of the places that you've sprinkled the annotation on, and
> secondly it actually hides useful information.
>
> Regardless of the support issue, I think this hiding of information is
> also problematic.
>
> I really think we'd all be better off if you just made a sysfs file (I
> mistyped debugfs in some other email, sorry, apparently you didn't see
> the correction in time) that listed which device(s) crashed and how many
> times. That would actually be useful. Because honestly, if a random
> device crashed for some random reason, that's pretty much a non-event.
> If it keeps happening, then we might even want to know about it.

Johannes - have you seen devlink health? I think we should just use
that interface, since it supports all the things you're requesting,
rather than duplicate it in sysfs.

2020-05-18 20:31:57

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:
> > It's pretty clear, but even then, first of all I doubt this is the case
> > for many of the places that you've sprinkled the annotation on, and
> > secondly it actually hides useful information.
> >
> > Regardless of the support issue, I think this hiding of information is
> > also problematic.
> >
> > I really think we'd all be better off if you just made a sysfs file (I
> > mistyped debugfs in some other email, sorry, apparently you didn't see
> > the correction in time) that listed which device(s) crashed and how many
> > times. That would actually be useful. Because honestly, if a random
> > device crashed for some random reason, that's pretty much a non-event.
> > If it keeps happening, then we might even want to know about it.
>
> Johannes - have you seen devlink health? I think we should just use
> that interface, since it supports all the things you're requesting,
> rather than duplicate it in sysfs.

I haven't, and I'm glad to hear that's there, sounds good!

I suspect that Luis wants something more generic though, that isn't just
applicable to netdevices, unless devlink grew some kind of non-netdev
stuff while I wasn't looking? :)

johannes

2020-05-18 20:45:06

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
>
> It's intended to be a generic netlink channel for configuring devices.
>
> All the firmware-related interfaces have no dependencies on netdevs,
> in fact that's one of the reasons we moved to devlink - we don't want
> to hold rtnl lock just for talking to device firmware.

Sounds good :)

So I guess Luis just has to add some way in devlink to hook up devlink
health in a simple way to drivers, perhaps? I mean, many drivers won't
really want to use devlink for anything else, so I guess it should be as
simple as the API that Luis proposed ("firmware crashed for this struct
device"), if nothing more interesting is done with devlink?

Dunno. But anyway sounds like it should somehow integrate there rather
than the way this patchset proposed?

johannes

2020-05-18 21:19:30

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 10:07:49PM +0200, Johannes Berg wrote:
> On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote:
>
> > > Err, no. Those two are most definitely related. Have you looked at (most
> > > or some or whatever) staging drivers recently? Those contain all kinds
> > > of garbage that might do whatever with your kernel.
> >
> > No, I stay away :)
>
> :)
>
> > > That's all fine, I just don't think it's appropriate to pretend that
> > > your kernel is now 'tainted' (think about the meaning of that word) when
> > > the firmware of some random device crashed.
> >
> > If the firmware crash *does* require driver remove / addition again,
> > or a reboot, would you think that this is a situation that merits a taint?
>
> Not really. In my experience, that's more likely a hardware issue (card
> not properly seated, for example) that a bus reset happens to "fix".
>
> > > It's pretty clear, but even then, first of all I doubt this is the case
> > > for many of the places that you've sprinkled the annotation on,
> >
> > We can remove it, for this driver I can vouch for its location as it did
> > reach a state where I required a reboot. And its not the first time this
> > has happened. This got me thinking about the bigger picture of the lack
> > of proper way to address these cases in the kernel, and how the user is
> > left dumbfounded.
>
> Fair, so the driver is still broken wrt. recovery here. I still don't
> think that's a situation where e.g. the system should say "hey you have
> a taint here, if your graphics go bad now you should not report that
> bug" (which is effectively what the single taint bit does).

But again, let's think about the generic type of issue, and the
unexpected type of state that can be reached. The circumstance here
*does* lead to a case which is not recoverable. Now, consider how
many cases in the kernel where similar situations can happen and leave
the device or driver in a non-functional state.

> > > and secondly it actually hides useful information.
> >
> > What is it hiding?
>
> Most importantly, which device crashed. Secondarily I'd say how many
> times (*).

The device is implied by the module, the taint is applied to both.
If you had multiple devices, however, yes, it would not be possible
to distinguish from the taint which exact device it happened on.

So the only thing *generic* which would be left out is count.

> The information "firmware crashed" is really only useful in relation to
> the device.

If you have to reboot to get a functional network again then the device
is quite useless for many people, regardless of which device that
happened on.

But from a support perspective a sysfs interface which provides a tiny
bit more generic information indeed provides more value than a taint.

Luis

2020-05-18 21:23:15

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 01:46:43PM -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote:
> > On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
> > > It's intended to be a generic netlink channel for configuring devices.
> > >
> > > All the firmware-related interfaces have no dependencies on netdevs,
> > > in fact that's one of the reasons we moved to devlink - we don't want
> > > to hold rtnl lock just for talking to device firmware.
> >
> > Sounds good :)
> >
> > So I guess Luis just has to add some way in devlink to hook up devlink
> > health in a simple way to drivers, perhaps? I mean, many drivers won't
> > really want to use devlink for anything else, so I guess it should be as
> > simple as the API that Luis proposed ("firmware crashed for this struct
> > device"), if nothing more interesting is done with devlink?
> >
> > Dunno. But anyway sounds like it should somehow integrate there rather
> > than the way this patchset proposed?
>
> Right, that'd be great. Simple API to register a devlink instance with
> whatever number of reporters the device would need. I'm happy to help.

Indeed my issue with devlink is that it did not seem generic enough for
all devices which use firmware and for which firmware can crash. Support
should not have to be "add devlink support" + "now use this new hook",
but rather a very lighweight devlink_crash(device) call we can sprinkly
with only the device as a functional requirement.

Luis

2020-05-18 22:17:51

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote:
> Indeed my issue with devlink is that it did not seem generic enough for
> all devices which use firmware and for which firmware can crash. Support
> should not have to be "add devlink support" + "now use this new hook",
> but rather a very lighweight devlink_crash(device) call we can sprinkly
> with only the device as a functional requirement.

We can provide a lightweight devlink_crash(device) which only generates
the notification, without the need to register the health reporter or a
devlink instance upfront. But then we loose the ability to control the
recovery, count errors, etc. So I'd think that's not the direction we
want to go in.

2020-05-19 01:06:39

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 03:16:45PM -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote:
> > Indeed my issue with devlink is that it did not seem generic enough for
> > all devices which use firmware and for which firmware can crash. Support
> > should not have to be "add devlink support" + "now use this new hook",
> > but rather a very lighweight devlink_crash(device) call we can sprinkly
> > with only the device as a functional requirement.
>
> We can provide a lightweight devlink_crash(device) which only generates
> the notification, without the need to register the health reporter or a
> devlink instance upfront. But then we loose the ability to control the
> recovery, count errors, etc. So I'd think that's not the direction we
> want to go in.

Care to show me what the diff stat for a non devlink driver would look
like? Just keep in mind this taint is 1 line addition. Granted, if we
can use SmPL grammar to automate addition of an initial framework to a
driver that'd be great, but since the device addition is subsystem
specific (device_add() and friends), I don't suspect this will be easily
automated.

Luis

2020-05-19 14:03:06

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> On Sat, May 16, 2020 at 6:51 AM Johannes Berg <[email protected]> wrote:
> > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > detect that the device is really wedged enough that the only way we can
> > still try to recover is by completely unbinding the driver from it, then
> > we give userspace a uevent for that. I don't remember exactly how and
> > where that gets used (ChromeOS) though, but it'd be nice to have that
> > sort of thing as part of the infrastructure, in a sort of two-level
> > notification?
>
> <slight side track>
> We use this on certain devices where we know the underlying hardware
> has design issues that may lead to device failure

Ah, after reading below I see you meant for iwlwifi.

If userspace can indeed grow to support this, that would be fantastic.

I should note that I don't discourage hiding firmware or hardware
issues. Quite the contrary, I suspect that taking pride in being
trasnparent about it, and dealing with it fast can help lead the pack.
I wrote about this long ago in 2015 [0], and stand by it.

[0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html

> -- then when we see
> this sort of unrecoverable "firmware-death", we remove the
> device[*]+driver, force-reset the PCI device (SBR), and try to
> reload/reattach the driver. This all happens by way of a udev rule.

So you've sprikled your own udev event here as part of your kernel delta?

> We
> also log this sort of stuff (and metrics around it) for bug reports
> and health statistics, since we really hope to not see this happen
> often.

Assuming perfection is ideal but silly. So, what infrastructure do you
use for this sort of issue?

> [*] "We" (user space) don't actually do this...it happens via the
> 'remove_when_gone' module parameter abomination found in iwlwifi.

Holy moly.. but hey, at least it may seem a bit more seemless than forcing
a reboot / manual driver removal / addition to the user.

BTW is this likely a place on iwlwifi where the firmware likely crashed?

> I'd
> personally rather see the EVENT=INACESSIBLE stuff on its own, and let
> user space deal with when and how to remove and reset the device. But
> I digress too much here ;)
> </slight side track>

This is all useful information. We are just touching the surface of the
topic by addressing networking first. Imagine when we address other
subsystems.

> I really came to this thread to say that I also love the idea of a
> generic mechanism (a la $subject) to report firmware crashes, but I
> also have no interest in seeing a taint flag for it. For Chrome OS, I
> would readily (as in, we're already looking at more-hacky /
> non-generic ways to do this for drivers we care about) process these
> kinds of stats as they happen, logging metrics for bug reports and/or
> for automated crash statistics, when we see a firmware crash.

Great!

> A uevent
> would suit us very well I think, although it would be nice if drivers
> could also supply some small amount of informative text along with it

A follow up to this series was to add a uevent to add_taint(), however
since a *count* is not considered I think it is correct to seek
alternatives at this point. The leaner the solution the better though.

Do you have a pointer to what guys use so I can read?

> (e.g., a sort of "reason code", in case we can possibly aggregate
> certain failure types). We already do this sort of thing for WARN()
> and friends (not via uevent, but via log parsing; at least it has nice
> "cut here" markers!).

Indeed, similar things can indeed be argued about WARN*()... this
however can be non-device specific. With panic-on-warn becoming a
"thing", the more important it becomes to really tally exactly *why*
these WARN*()s may trigger.

> Perhaps

Note below.

> devlink (as proposed down-thread) would also fit the bill. I
> don't think sysfs alone would fit our needs, as we'd like to process
> these things as they happen, not only when a user submits a bug
> report.

I think we've reached a point where using "*Perhaps*" does not suffice,
and if there is already a *user* of similar desired infrastructure I
think we should jump on the opportunity to replace what you have with
something which could be used by other devices / subsystems which
require firmware. And indeed, also even consider in the abstract sense,
the possibility to leverage something like this for WARN*()s later too.

> > Level 1: firmware crashed, but we're recovering, at least mostly, and
> > it's more informational
>
> Chrome OS would love to track these things too, since we'd like to see
> these minimized, even if they're usually recoverable ;)
>
> > Level 2: device is wedged, going to try to recover by some more forceful
> > means (perhaps some devices can be power-cycled? etc.) but (more) state
> > would be lost in these cases?
>
> And we'd definitely want to know about these. We already get this for
> the iwlwifi case described above, in a non-generic way.
>
> In general, it's probably not that easy to tell the difference between
> 1 and 2, since even as you and Luis have noted, with the same driver
> (and the same driver location), you find the same crashes may or may
> not be recoverable. iwlwifi has extracted certain level 2 cases into
> iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
> ways in which level 1 crashes actually lead to severe
> (non-recoverable) failure.

And that is fine, accepting these for what they are will help. However,
leaving the user in the *dark*, is what we should *not do*.

Luis

2020-05-19 21:16:47

by Jakub Kicinski

[permalink] [raw]
Subject: [RFC 2/2] i2400m: use devlink health reporter

It builds.

Signed-off-by: Jakub Kicinski <[email protected]>
---
drivers/net/wimax/i2400m/rx.c | 2 ++
drivers/net/wimax/i2400m/usb.c | 5 +++++
2 files changed, 7 insertions(+)

diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
index c9fb619a9e01..cc7fe78f2df0 100644
--- a/drivers/net/wimax/i2400m/rx.c
+++ b/drivers/net/wimax/i2400m/rx.c
@@ -144,6 +144,7 @@
* i2400m_msg_size_check
* wimax_msg
*/
+#include <linux/devlink.h>
#include <linux/slab.h>
#include <linux/kernel.h>
#include <linux/if_arp.h>
@@ -712,6 +713,7 @@ void __i2400m_roq_queue(struct i2400m *i2400m, struct i2400m_roq *roq,
dev_err(dev, "SW BUG? failed to insert packet\n");
dev_err(dev, "ERX: roq %p [ws %u] skb %p nsn %d sn %u\n",
roq, roq->ws, skb, nsn, roq_data->sn);
+ devlink_simple_fw_reporter_report_crash(dev);
skb_queue_walk(&roq->queue, skb_itr) {
roq_data_itr = (struct i2400m_roq_data *) &skb_itr->cb;
nsn_itr = __i2400m_roq_nsn(roq, roq_data_itr->sn);
diff --git a/drivers/net/wimax/i2400m/usb.c b/drivers/net/wimax/i2400m/usb.c
index 9659f9e1aaa6..5c811dccbf1d 100644
--- a/drivers/net/wimax/i2400m/usb.c
+++ b/drivers/net/wimax/i2400m/usb.c
@@ -49,6 +49,7 @@
* usb_reset_device()
*/
#include "i2400m-usb.h"
+#include <linux/devlink.h>
#include <linux/wimax/i2400m.h>
#include <linux/debugfs.h>
#include <linux/slab.h>
@@ -423,6 +424,8 @@ int i2400mu_probe(struct usb_interface *iface,
if (usb_dev->speed != USB_SPEED_HIGH)
dev_err(dev, "device not connected as high speed\n");

+ devlink_simple_fw_reporter_prepare(dev);
+
/* Allocate instance [calls i2400m_netdev_setup() on it]. */
result = -ENOMEM;
net_dev = alloc_netdev(sizeof(*i2400mu), "wmx%d", NET_NAME_UNKNOWN,
@@ -506,6 +509,7 @@ int i2400mu_probe(struct usb_interface *iface,
usb_put_dev(i2400mu->usb_dev);
free_netdev(net_dev);
error_alloc_netdev:
+ devlink_simple_fw_reporter_cleanup(dev);
return result;
}

@@ -532,6 +536,7 @@ void i2400mu_disconnect(struct usb_interface *iface)
usb_set_intfdata(iface, NULL);
usb_put_dev(i2400mu->usb_dev);
free_netdev(net_dev);
+ devlink_simple_fw_reporter_cleanup(dev);
d_fnend(3, dev, "(iface %p i2400m %p) = void\n", iface, i2400m);
}

--
2.25.4

2020-05-19 21:16:47

by Jakub Kicinski

[permalink] [raw]
Subject: [RFC 1/2] devlink: add simple fw crash helpers

Add infra for creating devlink instances for a device to report
fw crashes. This patch expects the devlink instance to be registered
at probe time. I belive to be the cleanest. We can also add a devm
version of the helpers, so that we don't have to do the clean up.
Or we can go even further and register the devlink instance only
once error has happened (for the first time, then we can just
find out if already registered by traversing the list like we
do here).

With the patch applied and a sample driver converted we get:

$ devlink dev
pci/0000:07:00.0

Then monitor for errors:

$ devlink mon health
[health,status] pci/0000:07:00.0:
reporter fw
state error error 1 recover 0
[health,status] pci/0000:07:00.0:
reporter fw
state error error 2 recover 0

These are the events I triggered on purpose. One can also inspect
the health of all devices capable of reporting fw errors:

$ devlink health
pci/0000:07:00.0:
reporter fw
state error error 7 recover 0

Obviously drivers may upgrade to the full devlink health API
which includes state dump, state dump auto-collect and automatic
error recovery control.

Signed-off-by: Jakub Kicinski <[email protected]>
---
include/linux/devlink.h | 11 +++
net/core/Makefile | 2 +-
net/core/devlink_simple_fw_reporter.c | 101 ++++++++++++++++++++++++++
3 files changed, 113 insertions(+), 1 deletion(-)
create mode 100644 include/linux/devlink.h
create mode 100644 net/core/devlink_simple_fw_reporter.c

diff --git a/include/linux/devlink.h b/include/linux/devlink.h
new file mode 100644
index 000000000000..2b73987eefca
--- /dev/null
+++ b/include/linux/devlink.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_DEVLINK_H_
+#define _LINUX_DEVLINK_H_
+
+struct device;
+
+void devlink_simple_fw_reporter_prepare(struct device *dev);
+void devlink_simple_fw_reporter_cleanup(struct device *dev);
+void devlink_simple_fw_reporter_report_crash(struct device *dev);
+
+#endif
diff --git a/net/core/Makefile b/net/core/Makefile
index 3e2c378e5f31..6f1513781c17 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
obj-$(CONFIG_DST_CACHE) += dst_cache.o
obj-$(CONFIG_HWBM) += hwbm.o
-obj-$(CONFIG_NET_DEVLINK) += devlink.o
+obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o
obj-$(CONFIG_GRO_CELLS) += gro_cells.o
obj-$(CONFIG_FAILOVER) += failover.o
obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
diff --git a/net/core/devlink_simple_fw_reporter.c b/net/core/devlink_simple_fw_reporter.c
new file mode 100644
index 000000000000..48dde9123c3c
--- /dev/null
+++ b/net/core/devlink_simple_fw_reporter.c
@@ -0,0 +1,101 @@
+#include <linux/devlink.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <net/devlink.h>
+
+struct devlink_simple_fw_reporter {
+ struct list_head list;
+ struct devlink_health_reporter *reporter;
+};
+
+
+static LIST_HEAD(devlink_simple_fw_reporters);
+static DEFINE_MUTEX(devlink_simple_fw_reporters_mutex);
+
+static const struct devlink_health_reporter_ops simple_devlink_health = {
+ .name = "fw",
+};
+
+static const struct devlink_ops simple_devlink_ops = {
+};
+
+static struct devlink_simple_fw_reporter *
+devlink_simple_fw_reporter_find_for_dev(struct device *dev)
+{
+ struct devlink_simple_fw_reporter *simple_devlink, *ret = NULL;
+ struct devlink *devlink;
+
+ mutex_lock(&devlink_simple_fw_reporters_mutex);
+ list_for_each_entry(simple_devlink, &devlink_simple_fw_reporters,
+ list) {
+ devlink = priv_to_devlink(simple_devlink);
+ if (devlink->dev == dev) {
+ ret = simple_devlink;
+ break;
+ }
+ }
+ mutex_unlock(&devlink_simple_fw_reporters_mutex);
+
+ return ret;
+}
+
+void devlink_simple_fw_reporter_report_crash(struct device *dev)
+{
+ struct devlink_simple_fw_reporter *simple_devlink;
+
+ simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
+ if (!simple_devlink)
+ return;
+
+ devlink_health_report(simple_devlink->reporter, "firmware crash", NULL);
+}
+EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_report_crash);
+
+void devlink_simple_fw_reporter_prepare(struct device *dev)
+{
+ struct devlink_simple_fw_reporter *simple_devlink;
+ struct devlink *devlink;
+
+ devlink = devlink_alloc(&simple_devlink_ops,
+ sizeof(struct devlink_simple_fw_reporter));
+ if (!devlink)
+ return;
+
+ if (devlink_register(devlink, dev))
+ goto err_free;
+
+ simple_devlink = devlink_priv(devlink);
+ simple_devlink->reporter =
+ devlink_health_reporter_create(devlink, &simple_devlink_health,
+ 0, NULL);
+ if (IS_ERR(simple_devlink->reporter))
+ goto err_unregister;
+
+ mutex_lock(&devlink_simple_fw_reporters_mutex);
+ list_add_tail(&simple_devlink->list, &devlink_simple_fw_reporters);
+ mutex_unlock(&devlink_simple_fw_reporters_mutex);
+
+ return;
+
+err_unregister:
+ devlink_unregister(devlink);
+err_free:
+ devlink_free(devlink);
+}
+EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_prepare);
+
+void devlink_simple_fw_reporter_cleanup(struct device *dev)
+{
+ struct devlink_simple_fw_reporter *simple_devlink;
+ struct devlink *devlink;
+
+ simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
+ if (!simple_devlink)
+ return;
+
+ devlink = priv_to_devlink(simple_devlink);
+ devlink_health_reporter_destroy(simple_devlink->reporter);
+ devlink_unregister(devlink);
+ devlink_free(devlink);
+}
+EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_cleanup);
--
2.25.4

2020-05-20 05:38:38

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

Hi all,

<top post intro>

Since I have been involved quite a bit in the firmware debugging
features in iwlwifi, I think I can give a few insights here.

But before this, we need to understand that there are several sources of issues:
1) the firmware may crash but the bus is still alive, you can still
use the bus to get the crash data
2) the bus is dead, when that happens, the firmware might even be in a
good condition, but since the bus is dead, you stop getting any
information about the firmware, and then, at some point, you get to
the conclusion that the firmware is dead. You can't get the crash data
that resides on the other side of the bus (you may have gathered data
in the DRAM directly, but that's a different thing), and you don't
have much recovery to do besides re-starting the PCI enumeration.

At Intel, we have seen both unfortunately. The bus issues are the ones
that are trickier obviously. Trickier to detect (because you just get
garbage from any request you issue on the bus), and trickier to
handle. One can argue that the kernel should *not* handle those and
let this in userspace hands. I guess it all depends on what component
you ship to your customer and what you customer asks from you :).

</top post intro>

>
> Hi Luis,
>
> On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <[email protected]> wrote:
> > On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> > > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <[email protected]> wrote:
> > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > > > detect that the device is really wedged enough that the only way we can
> > > > still try to recover is by completely unbinding the driver from it, then
> > > > we give userspace a uevent for that. I don't remember exactly how and
> > > > where that gets used (ChromeOS) though, but it'd be nice to have that
> > > > sort of thing as part of the infrastructure, in a sort of two-level
> > > > notification?
> > >
> > > <slight side track>
> > > We use this on certain devices where we know the underlying hardware
> > > has design issues that may lead to device failure
> >
> > Ah, after reading below I see you meant for iwlwifi.
>
> Sorry, I was replying to Johannes, who I believe had his "we"="Intel"
> hat (as iwlwifi maintainer) on, and was pointing at
> iwl_trans_pcie_removal_wk().
>

This pcie_removal thing is for the bus dead thing. My 2) above.

> > If userspace can indeed grow to support this, that would be fantastic.
>
> Well, Chrome OS tailors its user space a bit more to the hardware (and
> kernel/drivers in use) than the average distro might. We already do
> this (for some values of "this") today. Is that "fantastic" to you? :D

I guess it can be fantastic if other vendors also suffer from this. Or
maybe that could be done as part of the PCI bus driver inside the
kernel?

>
> > > -- then when we see
> > > this sort of unrecoverable "firmware-death", we remove the
> > > device[*]+driver, force-reset the PCI device (SBR), and try to
> > > reload/reattach the driver. This all happens by way of a udev rule.
> >
> > So you've sprikled your own udev event here as part of your kernel delta?
>
> No kernel delta -- the event is there already:
> iwl_trans_pcie_removal_wk()
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027
>
> And you can see our udev rules and scripts, in all their ugly details
> here, if you really care:
> https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/
>
> > > We
> > > also log this sort of stuff (and metrics around it) for bug reports
> > > and health statistics, since we really hope to not see this happen
> > > often.
> >
> > Assuming perfection is ideal but silly. So, what infrastructure do you
> > use for this sort of issue?
>
> We don't yet log firmware crashes generally, but for all our current
> crash reports (including WARN()), they go through this:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md
>
> For example, look for "cut here" in:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc
>
> For other specific metrics (like counting "EVENT=INACCESSIBLE"), we
> use the Chrome UMA system:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md
>
> I don't imagine the "infrastructure" side of any of that would be
> useful to you, but maybe the client-side gathering can at least show
> you what we do.
>
> > > [*] "We" (user space) don't actually do this...it happens via the
> > > 'remove_when_gone' module parameter abomination found in iwlwifi.
> >
> > BTW is this likely a place on iwlwifi where the firmware likely crashed?
>
> iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out
> in a way that is likely due to a dead PCIe endpoint. It's not directly
> a firmware crash, although there may be firmware crashes reported
> around the same time.

iwl_trans_pcie_removal_wk() is only because of a dead bus, not
because of a firmware crash.
>
> Intel folks can probably give a more nuanced answer, but their
> firmware crashes usually land in something like iwl_mvm_nic_error() ->
> iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make
> your proposal less punishing (e.g., removing the "taint" as Johannes
> requested), maybe iwlwifi would be another good candidate for
> $subject. iwlwifi generally expects to recover seamlessly, but it's
> also good to know if you've seen a hundred of these in a row.

Yes, you are right, mvm_nic_error is the place.

So here is the proposal.
I think that a standard interface that can allow a driver to report a
firmware crash along with a proprietary unique id that makes sense to
the vendor can be useful. Then, yes, distros can listen to this,
optionally open bugs (automatically?) when that happens. I even
planned to do this long ago and integrated with a specific distro, but
it never rolled out. The vendor supplied unique id is very important
for the bug de-duplication part. For iwlwifi, we have the SYSASSERT
number which is basically how the firmware tells us briefly what
happened. Of course, there is always more data that can be useful, but
for a first level of bug de-duplication this can be enough. Then, you
have a standard way to track the firmware crashes.
We need to remember that the firmware recovery path is expected to
work, it is, yet, less tested. I specifically remember a bug where a
crash because by a bad handling of a CSA frame caused a firmware crash
in a flow that caused the driver not to re-add a station or something
like that, and because of that, you get another firmware crash. So
sometimes it is interesting to see not only the data on which firmware
crash happened and how many times, but if there is a timing
relationship between them. I guess that's for the next level though...
Not really critical for now.

The next problem here is that when you tell the firmware folks that
they have a bug, the first they'll do is to ask for more data. Back
then, I enabled our firmware debug infra on Linux, and from there
devcoredump infra was born (thanks Johannes). What we do at Intel, is
that we have a script that runs when a udev event reports the creation
of a devcoredump. It parses the kernel log (dmesg) to determine the
unique id I mentioned earlier (the SYSSASSERT number basically) and
then it creates a file in /var/log/ whose name contain the SYSSASSERT
number. It is then quite easy to match the firmware data with the
firmware crash.
So I believe the right way to do this is to create the devcoredump
along with the id. Meaning that we basically don't need another
interface, we just need to use the same one, but to provide the unique
id of the crash. This unique id can then be part of the uevent that is
thrown to the userspace and userspace can use it to name the dump file
with the right id. This way, it is fairly easy (and standard across
vendors) to track the firmware crashes, but the most important is that
you already have the firmware data that goes along with it. It would
look like this.


driver_code.c:
void my_vendor_error_interrupt()
{
u64 uid = get_the_unique_id_from_your_device();
void *firmware_data = get_the_data_you_need();

dev_coredumpsg(dev_struct, firmware_data, firmware_data_len,
GFP_whatver, uid);
}

And then, this would cause a:
/var/log/wifi-crash-$(KBUILD_MODNAME)-,uid.bin to appear our your filesystem.

>
> > > A uevent
> > > would suit us very well I think, although it would be nice if drivers
> > > could also supply some small amount of informative text along with it
> >
> > A follow up to this series was to add a uevent to add_taint(), however
> > since a *count* is not considered I think it is correct to seek
> > alternatives at this point. The leaner the solution the better though.
> >
> > Do you have a pointer to what guys use so I can read?
>
> Hopefully the above pointers are useful to you. We don't yet have
> firmware-crash parsing implemented, so I don't have pointers for that
> piece yet.

See above. I don't think it is the device driver's role to count those.
We can add this count this in userspace I think. Debatable though.

>
> > > (e.g., a sort of "reason code", in case we can possibly aggregate
> > > certain failure types). We already do this sort of thing for WARN()
> > > and friends (not via uevent, but via log parsing; at least it has nice
> > > "cut here" markers!).
> >
> > Indeed, similar things can indeed be argued about WARN*()... this
> > however can be non-device specific. With panic-on-warn becoming a
> > "thing", the more important it becomes to really tally exactly *why*
> > these WARN*()s may trigger.
>
> panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm
> probably not supposed to publicize information related to how many
> Chromebooks are out there, but we regularly see a scary number of
> non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users
> every day, and a scary number of those are in WiFi drivers...
>
> > > Perhaps
> >
> > Note below.
>
> (My use of "perhaps" is only because I'm not familiar with devlink at all.)
>
> > > devlink (as proposed down-thread) would also fit the bill. I
> > > don't think sysfs alone would fit our needs, as we'd like to process
> > > these things as they happen, not only when a user submits a bug
> > > report.
> >
> > I think we've reached a point where using "*Perhaps*" does not suffice,
> > and if there is already a *user* of similar desired infrastructure I
> > think we should jump on the opportunity to replace what you have with
> > something which could be used by other devices / subsystems which
> > require firmware. And indeed, also even consider in the abstract sense,
> > the possibility to leverage something like this for WARN*()s later too.
>
> To precisely lay out what Chrome OS does today:
>
> * WARN() and similar: implemented, see anomaly_detector.cc above. It's
> just log parsing, and that handy "cut here" stuff -- I'm nearly
> certain there are other distros using this already? A uevent would
> probably be nice, but log parsing is good enough for us today.
> * EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked
> above. It's a uevent, and we handle it via udev. Code is linked above.
> * Other firmware crashes: not yet implemented here, but we're looking
> at it. Currently, we plan to do something similar to WARN(), but it
> will be ugly and non-generic. A uevent would be a lovely replacement,
> if it has some basic metadata with it. Stats in sysfs might be icing
> on the cake, but mostly redundant for us, if we can grab it on the fly
> via uevent.

So I believe we already have this uevent, it is the devcoredump. All
we need is to add the unique id.
Note that all this is good for firmware crashes, and not for bus dead
scenarios, but those two are fundamentally different even if they look
alike at the beginning of your error detection flow.
>
> HTH,
> Brian

2020-05-20 08:33:45

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Wed, May 20, 2020 at 8:40 AM Emmanuel Grumbach <[email protected]> wrote:

> Since I have been involved quite a bit in the firmware debugging
> features in iwlwifi, I think I can give a few insights here.
>
> But before this, we need to understand that there are several sources of issues:
> 1) the firmware may crash but the bus is still alive, you can still
> use the bus to get the crash data
> 2) the bus is dead, when that happens, the firmware might even be in a
> good condition, but since the bus is dead, you stop getting any
> information about the firmware, and then, at some point, you get to
> the conclusion that the firmware is dead. You can't get the crash data
> that resides on the other side of the bus (you may have gathered data
> in the DRAM directly, but that's a different thing), and you don't
> have much recovery to do besides re-starting the PCI enumeration.
>
> At Intel, we have seen both unfortunately. The bus issues are the ones
> that are trickier obviously. Trickier to detect (because you just get
> garbage from any request you issue on the bus), and trickier to
> handle. One can argue that the kernel should *not* handle those and
> let this in userspace hands. I guess it all depends on what component
> you ship to your customer and what you customer asks from you :).

Or the two best approaches:
1) get rid of firmware completely;
2) make it OSS (like SOF).

I think any of these is a right thing to do in long-term perspective.

How many firmwares average computer has? 50? 100? Any of them is a
burden and PITA.

--
With Best Regards,
Andy Shevchenko

2020-05-21 19:04:23

by Brian Norris

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <[email protected]> wrote:
> So I believe we already have this uevent, it is the devcoredump. All
> we need is to add the unique id.

I think there are a few reasons that devcoredump doesn't satisfy what
either Luis or I want.

1) it can be disabled entirely [1], for good reasons (e.g., think of
non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
with the opaque dumps provided by closed-source firmware)
2) not all drivers necessarily have a useful dump to provide when
there's a crash; look at the rest of Luis's series to see the kinds of
drivers-with-firmware that are crashing, some of which aren't dumping
anything
3) for those that do support devcoredump, it may be used for purposes
that are not "crashes" -- e.g., some provide debugfs or other knobs to
initiate dumps, for diagnostic or debugging purposes

Brian

[1] devcd_disabled
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22

2020-05-22 05:15:46

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

>
> On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <[email protected]> wrote:
> > So I believe we already have this uevent, it is the devcoredump. All
> > we need is to add the unique id.
>
> I think there are a few reasons that devcoredump doesn't satisfy what
> either Luis or I want.
>
> 1) it can be disabled entirely [1], for good reasons (e.g., think of
> non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
> with the opaque dumps provided by closed-source firmware)

Ok, if all you're interested into is the information that this event
happen (as opposed to report a bug and providing the data), then I
agree. True, not everybody want or can enable devcoredump. I am just a
bit concerned that we may end up with two interface that notify the
same event basically. The ideal maybe would be to be able to
optionally reduce the content of the devoredump to nothing more that
is already in the dmesg output. But then, it is not what it is meant
to be: namely, a core dump..

> 2) not all drivers necessarily have a useful dump to provide when
> there's a crash; look at the rest of Luis's series to see the kinds of
> drivers-with-firmware that are crashing, some of which aren't dumping
> anything

Fair enouh.

> 3) for those that do support devcoredump, it may be used for purposes
> that are not "crashes" -- e.g., some provide debugfs or other knobs to
> initiate dumps, for diagnostic or debugging purposes

Not sure I really think we need to care about those cases, but you
already have 2 good arguments :)

>
> Brian
>
> [1] devcd_disabled
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22

2020-05-22 05:24:06

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC 1/2] devlink: add simple fw crash helpers

On Tue, May 19, 2020 at 02:15:30PM -0700, Jakub Kicinski wrote:
> Add infra for creating devlink instances for a device to report

Thanks for doing this series as a PoC, counter to the module_firmware_crash()
which I proposed to taint the kernel with a firmware crash flag to the kernel
and module.

For those not famliar about devlink:

https://lwn.net/Articles/677967/
https://www.kernel.org/doc/html/latest/networking/devlink/index.html

The github page also is now 404 as Jiri merged that stuff into iproute2:

git://git.kernel.org/pub/scm/network/iproute2/iproute2.git

> fw crashes. This patch expects the devlink instance to be registered
> at probe time. I belive to be the cleanest. We can also add a devm
> version of the helpers, so that we don't have to do the clean up.
> Or we can go even further and register the devlink instance only
> once error has happened (for the first time, then we can just
> find out if already registered by traversing the list like we
> do here).
>
> With the patch applied and a sample driver converted we get:
>
> $ devlink dev
> pci/0000:07:00.0
>
> Then monitor for errors:
>
> $ devlink mon health
> [health,status] pci/0000:07:00.0:
> reporter fw
> state error error 1 recover 0
> [health,status] pci/0000:07:00.0:
> reporter fw
> state error error 2 recover 0
>
> These are the events I triggered on purpose. One can also inspect
> the health of all devices capable of reporting fw errors:
>
> $ devlink health
> pci/0000:07:00.0:
> reporter fw
> state error error 7 recover 0
>
> Obviously drivers may upgrade to the full devlink health API
> which includes state dump, state dump auto-collect and automatic
> error recovery control.
>
> Signed-off-by: Jakub Kicinski <[email protected]>
> ---
> include/linux/devlink.h | 11 +++
> net/core/Makefile | 2 +-
> net/core/devlink_simple_fw_reporter.c | 101 ++++++++++++++++++++++++++
> 3 files changed, 113 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/devlink.h
> create mode 100644 net/core/devlink_simple_fw_reporter.c
>
> diff --git a/include/linux/devlink.h b/include/linux/devlink.h
> new file mode 100644
> index 000000000000..2b73987eefca
> --- /dev/null
> +++ b/include/linux/devlink.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef _LINUX_DEVLINK_H_
> +#define _LINUX_DEVLINK_H_
> +
> +struct device;
> +
> +void devlink_simple_fw_reporter_prepare(struct device *dev);
> +void devlink_simple_fw_reporter_cleanup(struct device *dev);
> +void devlink_simple_fw_reporter_report_crash(struct device *dev);
> +
> +#endif
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 3e2c378e5f31..6f1513781c17 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
> obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
> obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> -obj-$(CONFIG_NET_DEVLINK) += devlink.o
> +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o

This was looking super sexy up to here. This is networking specific.
We want something generic for *anything* that requests firmware.

I'm afraid this won't work for something generic. I don't think its
throw-away work though, the idea to provide a generic interface to
dump firmware through netlink might be nice for networking, or other
things.

But I have a feeling we'll want something still more generic than this.

So networking may want to be aware that a firmware crash happened as
part of this network device health thing, but firmware crashing is a
generic thing.

I have now extended my patch set to include uvents and I am more set on
that we need the taint now more than ever.

Luis

> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
> obj-$(CONFIG_FAILOVER) += failover.o
> obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
> diff --git a/net/core/devlink_simple_fw_reporter.c b/net/core/devlink_simple_fw_reporter.c
> new file mode 100644
> index 000000000000..48dde9123c3c
> --- /dev/null
> +++ b/net/core/devlink_simple_fw_reporter.c
> @@ -0,0 +1,101 @@
> +#include <linux/devlink.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <net/devlink.h>
> +
> +struct devlink_simple_fw_reporter {
> + struct list_head list;
> + struct devlink_health_reporter *reporter;
> +};
> +
> +
> +static LIST_HEAD(devlink_simple_fw_reporters);
> +static DEFINE_MUTEX(devlink_simple_fw_reporters_mutex);
> +
> +static const struct devlink_health_reporter_ops simple_devlink_health = {
> + .name = "fw",
> +};
> +
> +static const struct devlink_ops simple_devlink_ops = {
> +};
> +
> +static struct devlink_simple_fw_reporter *
> +devlink_simple_fw_reporter_find_for_dev(struct device *dev)
> +{
> + struct devlink_simple_fw_reporter *simple_devlink, *ret = NULL;
> + struct devlink *devlink;
> +
> + mutex_lock(&devlink_simple_fw_reporters_mutex);
> + list_for_each_entry(simple_devlink, &devlink_simple_fw_reporters,
> + list) {
> + devlink = priv_to_devlink(simple_devlink);
> + if (devlink->dev == dev) {
> + ret = simple_devlink;
> + break;
> + }
> + }
> + mutex_unlock(&devlink_simple_fw_reporters_mutex);
> +
> + return ret;
> +}
> +
> +void devlink_simple_fw_reporter_report_crash(struct device *dev)
> +{
> + struct devlink_simple_fw_reporter *simple_devlink;
> +
> + simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
> + if (!simple_devlink)
> + return;
> +
> + devlink_health_report(simple_devlink->reporter, "firmware crash", NULL);
> +}
> +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_report_crash);
> +
> +void devlink_simple_fw_reporter_prepare(struct device *dev)
> +{
> + struct devlink_simple_fw_reporter *simple_devlink;
> + struct devlink *devlink;
> +
> + devlink = devlink_alloc(&simple_devlink_ops,
> + sizeof(struct devlink_simple_fw_reporter));
> + if (!devlink)
> + return;
> +
> + if (devlink_register(devlink, dev))
> + goto err_free;
> +
> + simple_devlink = devlink_priv(devlink);
> + simple_devlink->reporter =
> + devlink_health_reporter_create(devlink, &simple_devlink_health,
> + 0, NULL);
> + if (IS_ERR(simple_devlink->reporter))
> + goto err_unregister;
> +
> + mutex_lock(&devlink_simple_fw_reporters_mutex);
> + list_add_tail(&simple_devlink->list, &devlink_simple_fw_reporters);
> + mutex_unlock(&devlink_simple_fw_reporters_mutex);
> +
> + return;
> +
> +err_unregister:
> + devlink_unregister(devlink);
> +err_free:
> + devlink_free(devlink);
> +}
> +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_prepare);
> +
> +void devlink_simple_fw_reporter_cleanup(struct device *dev)
> +{
> + struct devlink_simple_fw_reporter *simple_devlink;
> + struct devlink *devlink;
> +
> + simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
> + if (!simple_devlink)
> + return;
> +
> + devlink = priv_to_devlink(simple_devlink);
> + devlink_health_reporter_destroy(simple_devlink->reporter);
> + devlink_unregister(devlink);
> + devlink_free(devlink);
> +}
> +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_cleanup);
> --
> 2.25.4
>

2020-05-22 21:50:29

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC 1/2] devlink: add simple fw crash helpers

On Fri, May 22, 2020 at 10:17:38AM -0700, Jakub Kicinski wrote:
> On Fri, 22 May 2020 05:20:46 +0000 Luis Chamberlain wrote:
> > > diff --git a/net/core/Makefile b/net/core/Makefile
> > > index 3e2c378e5f31..6f1513781c17 100644
> > > --- a/net/core/Makefile
> > > +++ b/net/core/Makefile
> > > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
> > > obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
> > > obj-$(CONFIG_DST_CACHE) += dst_cache.o
> > > obj-$(CONFIG_HWBM) += hwbm.o
> > > -obj-$(CONFIG_NET_DEVLINK) += devlink.o
> > > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o
> >
> > This was looking super sexy up to here. This is networking specific.
> > We want something generic for *anything* that requests firmware.
>
> You can't be serious. It's network specific because of how the Kconfig
> is named?

Kconfig? What has that to do with anything? The issue I have is that the
solution I am looking for is for it to be agnostic to the subsystem. I
have found similar firmware crashes on gpu, media, scsci.

> Working for a company operating large data centers I would strongly
> prefer if we didn't have ten different ways of reporting firmware
> problems in the fleet.

Indeed.

> > I'm afraid this won't work for something generic. I don't think its
> > throw-away work though, the idea to provide a generic interface to
> > dump firmware through netlink might be nice for networking, or other
> > things.
> >
> > But I have a feeling we'll want something still more generic than this.
>
> Please be specific. Saying generic a lot is not helpful. The code (as
> you can see in this patch) is in no way network specific. Or are you
> saying there are machines out there running without netlink sockets?

No, I am saying I want something to work with any struct device.

> > So networking may want to be aware that a firmware crash happened as
> > part of this network device health thing, but firmware crashing is a
> > generic thing.
> >
> > I have now extended my patch set to include uvents and I am more set on
> > that we need the taint now more than ever.
>
> Please expect my nack if you're trying to add this to networking
> drivers.

The uevent mechanism is not for networking.

The taint however is, and I'd like to undertand how it is you do not see
that an undesirable requirement for a reboot is a clear case for a taint.

> The irony is you have a problem with a networking device and all the
> devices your initial set touched are networking. Two of the drivers
> you touched either have or will soon have devlink health reporters
> implemented.

That is all great, and I don't think its a bad idea to add
infrastructure / extend it to get more information about a firmware
crash dump. However, suggesting that devlink is the only solution we
need in the kernel without considering other subsystems is what I am
suggesting doesn't suit my needs. Networking was just the first
subsystem I am taclking now but I have patches where similar situations
happen across the kernel.

Luis

2020-05-22 21:52:27

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC 1/2] devlink: add simple fw crash helpers

On Fri, May 22, 2020 at 10:46:07PM +0200, Johannes Berg wrote:
> FWIW, I still completely disagree on that taint. You (Luis) obviously
> have been running into a bug in that driver, I doubt the firmware
> actually managed to wedge the hardware.

This hasn't happened just once, its happed many times sporadically now,
once a week or two weeks I'd say. And the system isn't being moved
around.

> But even if it did, that's still not really a kernel taint. The kernel
> itself isn't in any way affected by this.

Of course it is, a full reboot is required.

> Yes, the system is in a weird state now. But that's *not* equivalent to
> "kernel tainted".

Requiring a full reboot is a dire situation to be in, and loosing
connectivity to the point this is not recoverable likewise.

You guys are making out a taint to be the end of the world. We have a
taint even for a kernel warning, and as others have mentioned mac80211
already produces these.

What exactly is the opposition to a taint to clarify that a device
firmware has crashed and your system requires a full reboot?

Luis

2020-05-25 09:09:38

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [RFC 1/2] devlink: add simple fw crash helpers

On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote:
> On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <[email protected]> wrote:

> I had to go RTFM re: kernel taints because it has been a very long
> time since I looked at them. It had always seemed to me that most were
> caused by "kernel-unfriendly" user actions. The most famous of course
> is loading proprietary modules, out-of-tree modules, forced module
> loads, etc... Honestly, I had forgotten the large variety of uses of
> the taint flags. For anyone who hasn't looked at taints recently, I
> recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html
>
> In light of this I don't object to setting a taint on this anymore.
> I'm a little uneasy, but I've softened on it now, and now I feel it
> depends on implementation.
>
> Specifically, I don't think we should set a taint flag when a driver
> easily handles a routine firmware crash and is confident that things
> have come up just fine again. In other words, triggering the taint in
> every driver module where it spits out a log comment that it had a
> firmware crash and had to recover seems too much. Sure, firmware
> shouldn't crash, sure it should be open source so we can fix it,
> whatever...

While it may sound idealistic the firmware for the end-user, and even for mere
kernel developer like me, is a complete blackbox which has more access than
root user in the kernel. We have tons of firmwares and each of them potentially
dangerous beast. As a user I really care about my data and privacy (hacker can
oops a firmware in order to set a specific vector attack). So, tainting kernel
is _a least_ we can do there, the strict rules would be to reboot immediately.

> those sort of wishful comments simply ignore reality and
> our ability to affect effective change.

We can encourage users not to buy cheap crap for the starter.

> A lot of WiFi firmware crashes
> and for well-known cases the drivers handle them well. And in some
> cases, not so well and that should be a place the driver should detect
> and thus raise a red flag. If a WiFi firmware crash can bring down
> the kernel, there's either a major driver bug or some very funky
> hardware crap going on. That sort of thing we should be able to
> detect, mark with a taint (or something), and fix if within our sphere
> of influence. I guess what it comes down to me is how aggressive we
> are about setting the flag.

--
With Best Regards,
Andy Shevchenko


2020-05-25 21:27:52

by Ben Greear

[permalink] [raw]
Subject: Re: [RFC 1/2] devlink: add simple fw crash helpers



On 05/25/2020 02:07 AM, Andy Shevchenko wrote:
> On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote:
>> On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <[email protected]> wrote:
>
>> I had to go RTFM re: kernel taints because it has been a very long
>> time since I looked at them. It had always seemed to me that most were
>> caused by "kernel-unfriendly" user actions. The most famous of course
>> is loading proprietary modules, out-of-tree modules, forced module
>> loads, etc... Honestly, I had forgotten the large variety of uses of
>> the taint flags. For anyone who hasn't looked at taints recently, I
>> recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html
>>
>> In light of this I don't object to setting a taint on this anymore.
>> I'm a little uneasy, but I've softened on it now, and now I feel it
>> depends on implementation.
>>
>> Specifically, I don't think we should set a taint flag when a driver
>> easily handles a routine firmware crash and is confident that things
>> have come up just fine again. In other words, triggering the taint in
>> every driver module where it spits out a log comment that it had a
>> firmware crash and had to recover seems too much. Sure, firmware
>> shouldn't crash, sure it should be open source so we can fix it,
>> whatever...
>
> While it may sound idealistic the firmware for the end-user, and even for mere
> kernel developer like me, is a complete blackbox which has more access than
> root user in the kernel. We have tons of firmwares and each of them potentially
> dangerous beast. As a user I really care about my data and privacy (hacker can
> oops a firmware in order to set a specific vector attack). So, tainting kernel
> is _a least_ we can do there, the strict rules would be to reboot immediately.
>
>> those sort of wishful comments simply ignore reality and
>> our ability to affect effective change.
>
> We can encourage users not to buy cheap crap for the starter.

There is no stable wifi firmware for any price.

There is also no obvious feedback from even name-brand NICs like ath10k or AX200
when you report a crash.

That said, at least in my experience with ath10k-ct, the OS normally recovers fine
from firmware crashes. ath10k already reports full crash reports on udev, so
easy for user-space to notice and report bug reports upstream if it cares to. Probably
other NICs do the same, and if not, they certainly could.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-05-25 21:56:29

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC 1/2] devlink: add simple fw crash helpers

On Fri, 22 May 2020 22:46:07 +0200 Johannes Berg wrote:
> > The irony is you have a problem with a networking device and all the
> > devices your initial set touched are networking. Two of the drivers
> > you touched either have or will soon have devlink health reporters
> > implemented.
>
> Like I said above, do you think it'd be feasible to make a devcoredump
> out of devlink health reports? And can the report be in a way that we
> control the file format, or are there limits? I guess I should read the
> code to find out, but I figure you probably just know. But feel free to
> tell me to read it :)
>
> The reason I'm asking is that it's starting to sound like we really
> ought to be implementing devlink, but we've got a bunch of
> infrastructure that uses the devcoredump, and it'll take time
> (significantly so) to change all that...

In devlink world pure FW core dumps are exposed by devlink regions.
An API allowing reading device memory, registers, etc., but also
creating dumps of memory regions when things go wrong. It should be
a fairly straightforward migration.

Devlink health is more targeted, the dump is supposed to contain only
relevant information, selected and formatted by the driver. When device
misbehaves driver reads the relevant registers and FW state and creates
a formatted state dump. I believe each element of the dump must fit
into a netlink message (but there may be multiple elements, see
devlink_fmsg_prepare_skb()).

We should be able to convert dl-regions dumps and dl-health dumps into
devcoredumps, but since health reporters are supposed to be more
targeted there's usually multiple of them per device.

Conversely devcoredumps can be trivially exposed as dl-region dumps,
but I believe dl-health would require driver changes to format the
information appropriately.