Device driver firmware can crash, and sometimes, this can leave your
system in a state which makes the device or subsystem completely
useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
of scraping some magical words from the kernel log, which is driver
specific, is much easier. So instead this series provides a helper which
lets drivers annotate this and shows how to use this on networking
drivers.
My methodology for finding when firmware crashes is to git grep for
"crash" and then doing some study of the code to see if this indeed
a place where the firmware crashes. In some places this is quite
obvious.
I'm starting off with networking first, if this gets merged later on I
can focus on the other drivers, but I already have some work done on
other subsytems.
Review, flames, etc are greatly appreciated.
This work, only on networking drivers, can be found on my git tree as well:
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200509-taint-firmware-net
Luis Chamberlain (15):
taint: add module firmware crash taint support
ethernet/839: use new module_firmware_crashed()
bnx2x: use new module_firmware_crashed()
bnxt: use new module_firmware_crashed()
bna: use new module_firmware_crashed()
liquidio: use new module_firmware_crashed()
cxgb4: use new module_firmware_crashed()
ehea: use new module_firmware_crashed()
qed: use new module_firmware_crashed()
soc: qcom: ipa: use new module_firmware_crashed()
wimax/i2400m: use new module_firmware_crashed()
ath10k: use new module_firmware_crashed()
ath6kl: use new module_firmware_crashed()
brcm80211: use new module_firmware_crashed()
mwl8k: use new module_firmware_crashed()
drivers/net/ethernet/8390/axnet_cs.c | 4 +++-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 1 +
drivers/net/ethernet/brocade/bna/bfa_ioc.c | 1 +
drivers/net/ethernet/cavium/liquidio/lio_main.c | 1 +
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 1 +
drivers/net/ethernet/ibm/ehea/ehea_main.c | 2 ++
drivers/net/ethernet/qlogic/qed/qed_debug.c | 3 +++
drivers/net/ipa/ipa_modem.c | 1 +
drivers/net/wimax/i2400m/rx.c | 1 +
drivers/net/wireless/ath/ath10k/pci.c | 2 ++
drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
drivers/net/wireless/ath/ath10k/snoc.c | 1 +
drivers/net/wireless/ath/ath6kl/hif.c | 1 +
.../net/wireless/broadcom/brcm80211/brcmfmac/core.c | 1 +
drivers/net/wireless/marvell/mwl8k.c | 1 +
include/linux/kernel.h | 3 ++-
include/linux/module.h | 13 +++++++++++++
include/trace/events/module.h | 3 ++-
kernel/module.c | 5 +++--
kernel/panic.c | 1 +
21 files changed, 44 insertions(+), 5 deletions(-)
--
2.25.1
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.
Using a taint flag allows us to annotate when this happens clearly.
Cc: Vishal Kulkarni <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index a70018f067aa..c67fc86c0e42 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3646,6 +3646,7 @@ void t4_fatal_err(struct adapter *adap)
* could be exposed to the adapter. RDMA MWs for example...
*/
t4_shutdown_adapter(adap);
+ module_firmware_crashed();
for_each_port(adap, port) {
struct net_device *dev = adap->port[port];
--
2.25.1
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.
Using a taint flag allows us to annotate when this happens clearly.
Cc: Ariel Elior <[email protected]>
Cc: Sudarsana Kalluru <[email protected]>
CC: [email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index db5107e7937c..c38b8c9c8af0 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -909,6 +909,7 @@ void bnx2x_panic_dump(struct bnx2x *bp, bool disable_int)
bp->eth_stats.unrecoverable_error++;
DP(BNX2X_MSG_STATS, "stats_state - DISABLED\n");
+ module_firmware_crashed();
BNX2X_ERR("begin crash dump -----------------\n");
/* Indices */
--
2.25.1
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.
Using a taint flag allows us to annotate when this happens clearly.
Cc: Rasesh Mody <[email protected]>
Cc: Sudarsana Kalluru <[email protected]>
Cc: [email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/net/ethernet/brocade/bna/bfa_ioc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/brocade/bna/bfa_ioc.c b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
index e17bfc87da90..b3f44a912574 100644
--- a/drivers/net/ethernet/brocade/bna/bfa_ioc.c
+++ b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
@@ -927,6 +927,7 @@ bfa_iocpf_sm_disabled(struct bfa_iocpf *iocpf, enum iocpf_event event)
static void
bfa_iocpf_sm_initfail_sync_entry(struct bfa_iocpf *iocpf)
{
+ module_firmware_crashed();
bfa_nw_ioc_debug_save_ftrc(iocpf->ioc);
bfa_ioc_hw_sem_get(iocpf->ioc);
}
--
2.25.1
Device driver firmware can crash, and sometimes, this can leave your
system in a state which makes the device or subsystem completely
useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
of scraping some magical words from the kernel log, which is driver
specific, is much easier. So instead provide a helper which lets drivers
annotate this.
Once this happens, scrapers can easily look for modules taint flags
for a firmware crash. This will taint both the kernel and respective
calling module.
The new helper module_firmware_crashed() uses LOCKDEP_STILL_OK as this
fact should in no way shape or form affect lockdep. This taint is device
driver specific.
Signed-off-by: Luis Chamberlain <[email protected]>
---
include/linux/kernel.h | 3 ++-
include/linux/module.h | 13 +++++++++++++
include/trace/events/module.h | 3 ++-
kernel/module.c | 5 +++--
kernel/panic.c | 1 +
5 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 04a5885cec1b..19e1541c82c7 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -601,7 +601,8 @@ extern enum system_states {
#define TAINT_LIVEPATCH 15
#define TAINT_AUX 16
#define TAINT_RANDSTRUCT 17
-#define TAINT_FLAGS_COUNT 18
+#define TAINT_FIRMWARE_CRASH 18
+#define TAINT_FLAGS_COUNT 19
struct taint_flag {
char c_true; /* character printed when tainted */
diff --git a/include/linux/module.h b/include/linux/module.h
index 2c2e988bcf10..221200078180 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -697,6 +697,14 @@ static inline bool is_livepatch_module(struct module *mod)
bool is_module_sig_enforced(void);
void set_module_sig_enforced(void);
+void add_taint_module(struct module *mod, unsigned flag,
+ enum lockdep_ok lockdep_ok);
+
+static inline void module_firmware_crashed(void)
+{
+ add_taint_module(THIS_MODULE, TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
+}
+
#else /* !CONFIG_MODULES... */
static inline struct module *__module_address(unsigned long addr)
@@ -844,6 +852,11 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr)
return ptr;
}
+static inline void module_firmware_crashed(void)
+{
+ add_taint(TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
+}
+
#endif /* CONFIG_MODULES */
#ifdef CONFIG_SYSFS
diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index 097485c73c01..b749ea25affd 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -26,7 +26,8 @@ struct module;
{ (1UL << TAINT_OOT_MODULE), "O" }, \
{ (1UL << TAINT_FORCED_MODULE), "F" }, \
{ (1UL << TAINT_CRAP), "C" }, \
- { (1UL << TAINT_UNSIGNED_MODULE), "E" })
+ { (1UL << TAINT_UNSIGNED_MODULE), "E" }, \
+ { (1UL << TAINT_FIRMWARE_CRASH), "Q" })
TRACE_EVENT(module_load,
diff --git a/kernel/module.c b/kernel/module.c
index 80faaf2116dd..f98e8c25c6b4 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -325,12 +325,13 @@ static inline int strong_try_module_get(struct module *mod)
return -ENOENT;
}
-static inline void add_taint_module(struct module *mod, unsigned flag,
- enum lockdep_ok lockdep_ok)
+void add_taint_module(struct module *mod, unsigned flag,
+ enum lockdep_ok lockdep_ok)
{
add_taint(flag, lockdep_ok);
set_bit(flag, &mod->taints);
}
+EXPORT_SYMBOL_GPL(add_taint_module);
/*
* A thread that wants to hold a reference to a module only while it
diff --git a/kernel/panic.c b/kernel/panic.c
index ec6d7d788ce7..504fb926947e 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -384,6 +384,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
[ TAINT_LIVEPATCH ] = { 'K', ' ', true },
[ TAINT_AUX ] = { 'X', ' ', true },
[ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
+ [ TAINT_FIRMWARE_CRASH ] = { 'Q', ' ', true },
};
/**
--
2.25.1
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.
Using a taint flag allows us to annotate when this happens clearly.
Cc: Alex Elder <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/net/ipa/ipa_modem.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ipa/ipa_modem.c b/drivers/net/ipa/ipa_modem.c
index ed10818dd99f..1790b87446ed 100644
--- a/drivers/net/ipa/ipa_modem.c
+++ b/drivers/net/ipa/ipa_modem.c
@@ -285,6 +285,7 @@ static void ipa_modem_crashed(struct ipa *ipa)
struct device *dev = &ipa->pdev->dev;
int ret;
+ module_firmware_crashed();
ipa_endpoint_modem_pause_all(ipa, true);
ipa_endpoint_modem_hol_block_clear_all(ipa);
--
2.25.1
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.
Using a taint flag allows us to annotate when this happens clearly.
Cc: [email protected]
Cc: [email protected]
Cc: Kalle Valo <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/net/wireless/ath/ath10k/pci.c | 2 ++
drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
drivers/net/wireless/ath/ath10k/snoc.c | 1 +
3 files changed, 5 insertions(+)
diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 1d941d53fdc9..6bd0f3b518b9 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
scnprintf(guid, sizeof(guid), "n/a");
ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+ module_firmware_crashed();
ath10k_print_driver_info(ar);
ath10k_pci_dump_registers(ar, crash_data);
ath10k_ce_dump_registers(ar, crash_data);
@@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
if (ret) {
if (ath10k_pci_has_fw_crashed(ar)) {
ath10k_warn(ar, "firmware crashed during chip reset\n");
+ module_firmware_crashed();
ath10k_pci_fw_crashed_clear(ar);
ath10k_pci_fw_crashed_dump(ar);
}
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index e2aff2254a40..d34ad289380f 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
/* TODO: Add firmware crash handling */
ath10k_warn(ar, "firmware crashed\n");
+ module_firmware_crashed();
/* read counter to clear the interrupt, the debug error interrupt is
* counter 0.
@@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
ath10k_err(ar, "firmware crashed!\n");
queue_work(ar->workqueue, &ar->restart_work);
+ module_firmware_crashed();
}
return ret;
}
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 354d49b1cd45..7cfc123c345c 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
scnprintf(guid, sizeof(guid), "n/a");
ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+ module_firmware_crashed();
ath10k_print_driver_info(ar);
ath10k_msa_dump_memory(ar, crash_data);
mutex_unlock(&ar->dump_mutex);
--
2.25.1
On Sat, May 09, 2020 at 04:35:38AM +0000, Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead provide a helper which lets drivers
> annotate this.
>
> Once this happens, scrapers can easily look for modules taint flags
> for a firmware crash. This will taint both the kernel and respective
> calling module.
>
> The new helper module_firmware_crashed() uses LOCKDEP_STILL_OK as this
> fact should in no way shape or form affect lockdep. This taint is device
> driver specific.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> include/linux/kernel.h | 3 ++-
> include/linux/module.h | 13 +++++++++++++
> include/trace/events/module.h | 3 ++-
> kernel/module.c | 5 +++--
> kernel/panic.c | 1 +
> 5 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 04a5885cec1b..19e1541c82c7 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -601,7 +601,8 @@ extern enum system_states {
> #define TAINT_LIVEPATCH 15
> #define TAINT_AUX 16
> #define TAINT_RANDSTRUCT 17
> -#define TAINT_FLAGS_COUNT 18
> +#define TAINT_FIRMWARE_CRASH 18
> +#define TAINT_FLAGS_COUNT 19
>
We are still missing the documentation bits for this
new flag, though.
How about having a blurb similar to:
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 71e9184a9079..5c6a9e2478b0 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
15 _/K 32768 kernel has been live patched
16 _/X 65536 auxiliary taint, defined for and used by distros
17 _/T 131072 kernel was built with the struct randomization plugin
+ 18 _/Q 262144 driver firmware crash annotation
=== === ====== ========================================================
Note: The character ``_`` is representing a blank in this table to make reading
@@ -162,3 +163,7 @@ More detailed explanation for tainting
produce extremely unusual kernel structure layouts (even performance
pathological ones), which is important to know when debugging. Set at
build time.
+
+ 18) ``Q`` Device drivers might annotate the kernel with this taint, in cases
+ their firmware might have crashed leaving the driver in a crippled and
+ potentially useless state.
> struct taint_flag {
> char c_true; /* character printed when tainted */
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 2c2e988bcf10..221200078180 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -697,6 +697,14 @@ static inline bool is_livepatch_module(struct module *mod)
> bool is_module_sig_enforced(void);
> void set_module_sig_enforced(void);
>
> +void add_taint_module(struct module *mod, unsigned flag,
> + enum lockdep_ok lockdep_ok);
> +
> +static inline void module_firmware_crashed(void)
> +{
> + add_taint_module(THIS_MODULE, TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
> +}
> +
> #else /* !CONFIG_MODULES... */
>
> static inline struct module *__module_address(unsigned long addr)
> @@ -844,6 +852,11 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr)
> return ptr;
> }
>
> +static inline void module_firmware_crashed(void)
> +{
> + add_taint(TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
> +}
> +
> #endif /* CONFIG_MODULES */
>
> #ifdef CONFIG_SYSFS
> diff --git a/include/trace/events/module.h b/include/trace/events/module.h
> index 097485c73c01..b749ea25affd 100644
> --- a/include/trace/events/module.h
> +++ b/include/trace/events/module.h
> @@ -26,7 +26,8 @@ struct module;
> { (1UL << TAINT_OOT_MODULE), "O" }, \
> { (1UL << TAINT_FORCED_MODULE), "F" }, \
> { (1UL << TAINT_CRAP), "C" }, \
> - { (1UL << TAINT_UNSIGNED_MODULE), "E" })
> + { (1UL << TAINT_UNSIGNED_MODULE), "E" }, \
> + { (1UL << TAINT_FIRMWARE_CRASH), "Q" })
>
> TRACE_EVENT(module_load,
>
> diff --git a/kernel/module.c b/kernel/module.c
> index 80faaf2116dd..f98e8c25c6b4 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -325,12 +325,13 @@ static inline int strong_try_module_get(struct module *mod)
> return -ENOENT;
> }
>
> -static inline void add_taint_module(struct module *mod, unsigned flag,
> - enum lockdep_ok lockdep_ok)
> +void add_taint_module(struct module *mod, unsigned flag,
> + enum lockdep_ok lockdep_ok)
> {
> add_taint(flag, lockdep_ok);
> set_bit(flag, &mod->taints);
> }
> +EXPORT_SYMBOL_GPL(add_taint_module);
>
> /*
> * A thread that wants to hold a reference to a module only while it
> diff --git a/kernel/panic.c b/kernel/panic.c
> index ec6d7d788ce7..504fb926947e 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -384,6 +384,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
> [ TAINT_LIVEPATCH ] = { 'K', ' ', true },
> [ TAINT_AUX ] = { 'X', ' ', true },
> [ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
> + [ TAINT_FIRMWARE_CRASH ] = { 'Q', ' ', true },
> };
>
> /**
> --
> 2.25.1
>
On Sat, May 09, 2020 at 11:18:29AM -0400, Rafael Aquini wrote:
> We are still missing the documentation bits for this
> new flag, though.
Ah yeah sorry about that.
> How about having a blurb similar to:
>
> diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> index 71e9184a9079..5c6a9e2478b0 100644
> --- a/Documentation/admin-guide/tainted-kernels.rst
> +++ b/Documentation/admin-guide/tainted-kernels.rst
> @@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
> 15 _/K 32768 kernel has been live patched
> 16 _/X 65536 auxiliary taint, defined for and used by distros
> 17 _/T 131072 kernel was built with the struct randomization plugin
> + 18 _/Q 262144 driver firmware crash annotation
> === === ====== ========================================================
>
> Note: The character ``_`` is representing a blank in this table to make reading
> @@ -162,3 +163,7 @@ More detailed explanation for tainting
> produce extremely unusual kernel structure layouts (even performance
> pathological ones), which is important to know when debugging. Set at
> build time.
> +
> + 18) ``Q`` Device drivers might annotate the kernel with this taint, in cases
> + their firmware might have crashed leaving the driver in a crippled and
> + potentially useless state.
Sure, I'll modify it a bit to add the use case to help with support
issues, ie, to help rule out firmware issues.
I'm starting to think that to make this even more usesul later we may
want to add a uevent to add_taint() so that userspace can decide to look
into this, ignore it, or report something to the user, say on their
desktop.
Luis
On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead this series provides a helper which
> lets drivers annotate this and shows how to use this on networking
> drivers.
>
> My methodology for finding when firmware crashes is to git grep for
> "crash" and then doing some study of the code to see if this indeed
> a place where the firmware crashes. In some places this is quite
> obvious.
>
> I'm starting off with networking first, if this gets merged later on I
> can focus on the other drivers, but I already have some work done on
> other subsytems.
>
> Review, flames, etc are greatly appreciated.
Tainting itself may be useful, but that's just the first step. I'd much
rather see folks start using the devlink health infrastructure. Devlink
is netlink based, but it's _not_ networking specific (many of its
optional features obviously are, but don't let that mislead you).
With devlink health we get (a) a standard notification on the failure;
(b) information/state dump in a (somewhat) structured form, which can be
collected & shared with vendors; (c) automatic remediation (usually
device reset of some scope).
Now regarding the tainting - as I said it may be useful, but don't we
have to define what constitutes a "firmware crash"? There are many
failure modes, some perfectly recoverable (e.g. processing queue hang),
some mere bugs (e.g. device fails to initialize some functions). All of
them may impact the functioning of the system. How do we choose those
that taint?
On 5/8/20 9:35 PM, Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead this series provides a helper which
> lets drivers annotate this and shows how to use this on networking
> drivers.
>
If the driver is able to detect that the device firmware has come back
alive, through user intervention or whatever, should there be a way to
"untaint" the kernel? Or would you expect it to remain tainted?
sln
On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote:
> On 5/8/20 9:35 PM, Luis Chamberlain wrote:
> > Device driver firmware can crash, and sometimes, this can leave your
> > system in a state which makes the device or subsystem completely
> > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> > of scraping some magical words from the kernel log, which is driver
> > specific, is much easier. So instead this series provides a helper which
> > lets drivers annotate this and shows how to use this on networking
> > drivers.
> >
> If the driver is able to detect that the device firmware has come back
> alive, through user intervention or whatever, should there be a way to
> "untaint" the kernel?? Or would you expect it to remain tainted?
Hi Shannon
In general, you don't want to be able to untained. Say a non-GPL
licenced module is loaded, which taints the kernel. It might then try
to untaint the kernel to hide its.
As for firmware, how much damage can the firmware do as it crashed? If
it is a DMA master, it could of splattered stuff through
memory. Restarting the firmware is not going to reverse the damage it
has done.
Andrew
On 5/9/20 6:58 PM, Andrew Lunn wrote:
> On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote:
>> On 5/8/20 9:35 PM, Luis Chamberlain wrote:
>>> Device driver firmware can crash, and sometimes, this can leave your
>>> system in a state which makes the device or subsystem completely
>>> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
>>> of scraping some magical words from the kernel log, which is driver
>>> specific, is much easier. So instead this series provides a helper which
>>> lets drivers annotate this and shows how to use this on networking
>>> drivers.
>>>
>> If the driver is able to detect that the device firmware has come back
>> alive, through user intervention or whatever, should there be a way to
>> "untaint" the kernel? Or would you expect it to remain tainted?
> Hi Shannon
>
> In general, you don't want to be able to untained. Say a non-GPL
> licenced module is loaded, which taints the kernel. It might then try
> to untaint the kernel to hide its.
Yeah, obviously we don't want this to be abuseable. I was just
wondering about reversing this particular status if the broken device
could get itself fixed.
>
> As for firmware, how much damage can the firmware do as it crashed? If
> it is a DMA master, it could of splattered stuff through
> memory. Restarting the firmware is not going to reverse the damage it
> has done.
>
True, and tho' the driver might get the thing restarted, it wouldn't
necessarily know what kind of damage had ensued.
Carry on,
sln
On 5/9/20 9:46 AM, Luis Chamberlain wrote:
> On Sat, May 09, 2020 at 11:18:29AM -0400, Rafael Aquini wrote:
>> We are still missing the documentation bits for this
>> new flag, though.
>
> Ah yeah sorry about that.
>
>> How about having a blurb similar to:
>>
>> diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
>> index 71e9184a9079..5c6a9e2478b0 100644
>> --- a/Documentation/admin-guide/tainted-kernels.rst
>> +++ b/Documentation/admin-guide/tainted-kernels.rst
>> @@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
>> 15 _/K 32768 kernel has been live patched
>> 16 _/X 65536 auxiliary taint, defined for and used by distros
>> 17 _/T 131072 kernel was built with the struct randomization plugin
>> + 18 _/Q 262144 driver firmware crash annotation
>> === === ====== ========================================================
>>
>> Note: The character ``_`` is representing a blank in this table to make reading
>> @@ -162,3 +163,7 @@ More detailed explanation for tainting
>> produce extremely unusual kernel structure layouts (even performance
>> pathological ones), which is important to know when debugging. Set at
>> build time.
>> +
>> + 18) ``Q`` Device drivers might annotate the kernel with this taint, in cases
>> + their firmware might have crashed leaving the driver in a crippled and
>> + potentially useless state.
>
> Sure, I'll modify it a bit to add the use case to help with support
> issues, ie, to help rule out firmware issues.
Please also update tools/debugging/kernel-chktaint.
> I'm starting to think that to make this even more usesul later we may
> want to add a uevent to add_taint() so that userspace can decide to look
> into this, ignore it, or report something to the user, say on their
> desktop.
thanks.
--
~Randy
On Sat, May 09, 2020 at 11:35:46AM -0700, Jakub Kicinski wrote:
> On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote:
> > Device driver firmware can crash, and sometimes, this can leave your
> > system in a state which makes the device or subsystem completely
> > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> > of scraping some magical words from the kernel log, which is driver
> > specific, is much easier. So instead this series provides a helper which
> > lets drivers annotate this and shows how to use this on networking
> > drivers.
> >
> > My methodology for finding when firmware crashes is to git grep for
> > "crash" and then doing some study of the code to see if this indeed
> > a place where the firmware crashes. In some places this is quite
> > obvious.
> >
> > I'm starting off with networking first, if this gets merged later on I
> > can focus on the other drivers, but I already have some work done on
> > other subsytems.
> >
> > Review, flames, etc are greatly appreciated.
>
> Tainting itself may be useful, but that's just the first step. I'd much
> rather see folks start using the devlink health infrastructure. Devlink
> is netlink based, but it's _not_ networking specific (many of its
> optional features obviously are, but don't let that mislead you).
>
> With devlink health we get (a) a standard notification on the failure;
> (b) information/state dump in a (somewhat) structured form, which can be
> collected & shared with vendors; (c) automatic remediation (usually
> device reset of some scope).
It indeed sounds very useful!
> Now regarding the tainting - as I said it may be useful, but don't we
> have to define what constitutes a "firmware crash"?
Yes indeed, I missed clarifying this in the documentation. I'll do so
in my next respin.
> There are many
> failure modes, some perfectly recoverable (e.g. processing queue hang),
> some mere bugs (e.g. device fails to initialize some functions). All of
> them may impact the functioning of the system. How do we choose those
> that taint?
Its up to the maintainers of the device driver, what I was aiming for
were those firmware crashes which indeed *can* have an impact on user
experience, and can *even* potentially require a driver removal / addition
to to get things back in order again.
Luis
On Sat, May 09, 2020 at 07:15:23PM -0700, Shannon Nelson wrote:
> On 5/9/20 6:58 PM, Andrew Lunn wrote:
> > On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote:
> > As for firmware, how much damage can the firmware do as it crashed? If
> > it is a DMA master, it could of splattered stuff through
> > memory. Restarting the firmware is not going to reverse the damage it
> > has done.
> >
> True, and tho' the driver might get the thing restarted, it wouldn't
> necessarily know what kind of damage had ensued.
Indeed, it is those uknowns which we currently assume is just fine, but
in reality can be damaging. Today we just move on with life, but such
information is useful for analysis.
Luis
On Sat, 9 May 2020 18:01:51 -0700
Shannon Nelson <[email protected]> wrote:
> If the driver is able to detect that the device firmware has come back
> alive, through user intervention or whatever, should there be a way to
> "untaint" the kernel? Or would you expect it to remain tainted?
The only way to untaint a kernel is a reboot. A taint just means "something
happened to this kernel since it was booted". It's used as a hint, and
that's all.
I agree with the other comments in this thread. Use devlink health or
whatever tool to look further into causes. But from what I see here, this
code is "good enough" for a taint.
-- Steve