2022-04-28 03:00:39

by Guilherme G. Piccoli

[permalink] [raw]
Subject: [PATCH 24/30] panic: Refactor the panic path

The panic() function is somewhat convoluted - a lot of changes were
made over the years, adding comments that might be misleading/outdated
now, it has a code structure that is a bit complex to follow, with
lots of conditionals, for example. The panic notifier list is something
else - a single list, with multiple callbacks of different purposes,
that run in a non-deterministic order and may affect hardly kdump
reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.

This patch proposes a major refactor on the panic path based on Petr's
idea [0] - basically we split the notifiers list in three, having a set
of different call points in the panic path. Below a list of changes
proposed in this patch, culminating in the panic notifiers level
concept:

(a) First of all, we improved comments all over the function
and removed useless variables / includes. Also, as part of this
clean-up we concentrate the console flushing functions in a helper.

(b) As mentioned before, there is a split of the panic notifier list
in three, based on the purpose of the callback. The code contains
good documentation in form of comments, but a summary of the three
lists follows:

- the hypervisor list aims low-risk procedures to inform hypervisors
or firmware about the panic event, also includes LED-related functions;

- the informational list contains callbacks that provide more details,
like kernel offset or trace dump (if enabled) and also includes the
callbacks aimed at reducing log pollution or warns, like the RCU and
hung task disable callbacks;

- finally, the pre_reboot list is the old notifier list renamed,
containing the more risky callbacks that didn't fit the previous
lists. There is also a 4th list (the post_reboot one), but it's not
related with the original list - it contains late time architecture
callbacks aimed at stopping the machine, for example.

The 3 notifiers lists execute in different moments, hypervisor being
the first, followed by informational and finally the pre_reboot list.

(c) But then, there is the ordering problem of the notifiers against
the crash_kernel() call - kdump must be as reliable as possible.
For that, a simple binary "switch" as "crash_kexec_post_notifiers"
is not enough, hence we introduce here concept of panic notifier
levels: there are 5 levels, from 0 (no notifier executes before
kdump) until 4 (all notifiers run before kdump); the default level
is 2, in which the hypervisor and (iff we have any kmsg dumper)
the informational notifiers execute before kdump.

The detailed documentation of the levels is present in code comments
and in the kernel-parameters.txt file; as an analogy with the previous
panic() implementation, the level 0 is exactly the same as the old
behavior of notifiers, running all after kdump, and the level 4 is
the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
a deprecated one).

(d) Finally, an important change made here: we now use only the
function "crash_smp_send_stop()" to shut all the secondary CPUs
in the panic path. Before, there was a case of using the regular
"smp_send_stop()", but the better approach is to simplify the
code and try to use the function which was created exclusively
for the panic path. Experiments showed that it works fine, and
code was very simplified with that.

Functional change is expected from this refactor, since now we
call some notifiers by default before kdump, but the goal here
besides code clean-up is to have a better panic path, more
reliable and deterministic, but also very customizable.

[0] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Petr Mladek <[email protected]>
Signed-off-by: Guilherme G. Piccoli <[email protected]>
---

Special thanks to Petr and Baoquan for the suggestion and feedback in a previous
email thread. There's some important design decisions that worth mentioning and
discussing:

* The default panic notifiers level is 2, based on Petr Mladek's suggestion,
which makes a lot of sense. Of course, this is customizable through the
parameter, but would be something worthwhile to have a KConfig option to set
the default level? It would help distros that want the old behavior
(no notifiers before kdump) as default.

* The implementation choice was to _avoid_ intricate if conditionals in the
panic path, which would _definitely_ be present with the panic notifiers levels
idea; so, instead of lots of if conditionals, the set/clear bits approach with
functions called in 2 points (but executing only in one of them) is much easier
to follow an was used here; the ordering helper function and the comments also
help a lot to avoid confusion (hopefully).

* Choice was to *always* use crash_smp_send_stop() instead of sometimes making
use of the regular smp_send_stop(); for most architectures they are the same,
including Xen (on x86). For the ones that override it, all should work fine,
in the powerpc case it's even more correct (see the subsequent patch
"powerpc: Do not force all panic notifiers to execute before kdump")

There seems to be 2 cases that requires some plumbing to work 100% right:
- ARM doesn't disable local interrupts / FIQs in the crash version of
send_stop(); we patched that early in this series;
- x86 could face an issue if we have VMX and do use crash_smp_send_stop()
_without_ kdump, but this is fixed in the first patch of the series (and
it's a bug present even before this refactor).

* Notice we didn't add a sysrq for panic notifiers level - should have it?
Alejandro proposed recently to add a sysrq for "crash_kexec_post_notifiers",
let me know if you feel the need here Alejandro, since the core parameters are
present in /sys, I didn't consider much gain in having a sysrq, but of course
I'm open to suggestions!

Thanks advance for the review!

.../admin-guide/kernel-parameters.txt | 42 ++-
include/linux/panic_notifier.h | 1 +
kernel/kexec_core.c | 8 +-
kernel/panic.c | 292 +++++++++++++-----
.../selftests/pstore/pstore_crash_test | 5 +-
5 files changed, 252 insertions(+), 96 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..8d3524060ce3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -829,6 +829,13 @@
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.

+ crash_kexec_post_notifiers
+ This was DEPRECATED - users should always prefer the
+ parameter "panic_notifiers_level" - check its entry
+ in this documentation for details on how it works.
+ Setting this parameter is exactly the same as setting
+ "panic_notifiers_level=4".
+
cryptomgr.notests
[KNL] Disable crypto self-tests

@@ -3784,6 +3791,33 @@
timeout < 0: reboot immediately
Format: <timeout>

+ panic_notifiers_level=
+ [KNL] Set the panic notifiers execution order.
+ Format: <unsigned int>
+ We currently have 4 lists of panic notifiers; based
+ on the functionality and risk (for panic success) the
+ callbacks are added in a given list. The lists are:
+ - hypervisor/FW notification list (low risk);
+ - informational list (low/medium risk);
+ - pre_reboot list (higher risk);
+ - post_reboot list (only run late in panic and after
+ kdump, not configurable for now).
+ This parameter defines the ordering of the first 3
+ lists with regards to kdump; the levels determine
+ which set of notifiers execute before kdump. The
+ accepted levels are:
+ 0: kdump is the first thing to run, NO list is
+ executed before kdump.
+ 1: only the hypervisor list is executed before kdump.
+ 2 (default level): the hypervisor list and (*if*
+ there's any kmsg_dumper defined) the informational
+ list are executed before kdump.
+ 3: both the hypervisor and the informational lists
+ (always) execute before kdump.
+ 4: the 3 lists (hypervisor, info and pre_reboot)
+ execute before kdump - this behavior is analog to the
+ deprecated parameter "crash_kexec_post_notifiers".
+
panic_print= Bitmask for printing system info when panic happens.
User can chose combination of the following bits:
bit 0: print all tasks info
@@ -3814,14 +3848,6 @@
panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN().

- crash_kexec_post_notifiers
- Run kdump after running panic-notifiers and dumping
- kmsg. This only for the users who doubt kdump always
- succeeds in any situation.
- Note that this also increases risks of kdump failure,
- because some panic notifiers can make the crashed
- kernel more unstable.
-
parkbd.port= [HW] Parallel port number the keyboard adapter is
connected to, default is 0.
Format: <parport#>
diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h
index bcf6a5ea9d7f..b5041132321d 100644
--- a/include/linux/panic_notifier.h
+++ b/include/linux/panic_notifier.h
@@ -10,6 +10,7 @@ extern struct atomic_notifier_head panic_info_list;
extern struct atomic_notifier_head panic_pre_reboot_list;
extern struct atomic_notifier_head panic_post_reboot_list;

+bool panic_notifiers_before_kdump(void);
extern bool crash_kexec_post_notifiers;

enum panic_notifier_val {
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 68480f731192..f8906db8ca72 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -74,11 +74,11 @@ struct resource crashk_low_res = {
int kexec_should_crash(struct task_struct *p)
{
/*
- * If crash_kexec_post_notifiers is enabled, don't run
- * crash_kexec() here yet, which must be run after panic
- * notifiers in panic().
+ * In case any panic notifiers are set to execute before kexec,
+ * don't run crash_kexec() here yet, which must run after these
+ * panic notifiers are executed, in the panic() path.
*/
- if (crash_kexec_post_notifiers)
+ if (panic_notifiers_before_kdump())
return 0;
/*
* There are 4 panic() calls in make_task_dead() path, each of which
diff --git a/kernel/panic.c b/kernel/panic.c
index bf792102b43e..b7c055d4f87f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -15,7 +15,6 @@
#include <linux/kgdb.h>
#include <linux/kmsg_dump.h>
#include <linux/kallsyms.h>
-#include <linux/notifier.h>
#include <linux/vt_kern.h>
#include <linux/module.h>
#include <linux/random.h>
@@ -52,14 +51,23 @@ static unsigned long tainted_mask =
static int pause_on_oops;
static int pause_on_oops_flag;
static DEFINE_SPINLOCK(pause_on_oops_lock);
-bool crash_kexec_post_notifiers;
+
int panic_on_warn __read_mostly;
+bool panic_on_taint_nousertaint;
unsigned long panic_on_taint;
-bool panic_on_taint_nousertaint = false;

int panic_timeout = CONFIG_PANIC_TIMEOUT;
EXPORT_SYMBOL_GPL(panic_timeout);

+/* Initialized with all notifiers set to run before kdump */
+static unsigned long panic_notifiers_bits = 15;
+
+/* Default level is 2, see kernel-parameters.txt */
+unsigned int panic_notifiers_level = 2;
+
+/* DEPRECATED in favor of panic_notifiers_level */
+bool crash_kexec_post_notifiers;
+
#define PANIC_PRINT_TASK_INFO 0x00000001
#define PANIC_PRINT_MEM_INFO 0x00000002
#define PANIC_PRINT_TIMER_INFO 0x00000004
@@ -109,10 +117,14 @@ void __weak nmi_panic_self_stop(struct pt_regs *regs)
}

/*
- * Stop other CPUs in panic. Architecture dependent code may override this
- * with more suitable version. For example, if the architecture supports
- * crash dump, it should save registers of each stopped CPU and disable
- * per-CPU features such as virtualization extensions.
+ * Stop other CPUs in panic context.
+ *
+ * Architecture dependent code may override this with more suitable version.
+ * For example, if the architecture supports crash dump, it should save the
+ * registers of each stopped CPU and disable per-CPU features such as
+ * virtualization extensions. When not overridden in arch code (and for
+ * x86/xen), this is exactly the same as execute smp_send_stop(), but
+ * guarded against duplicate execution.
*/
void __weak crash_smp_send_stop(void)
{
@@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush)
ftrace_dump(DUMP_ALL);
}

+/*
+ * Helper that accumulates all console flushing routines executed on panic.
+ */
+static void console_flushing(void)
+{
+#ifdef CONFIG_VT
+ unblank_screen();
+#endif
+ console_unblank();
+
+ /*
+ * In this point, we may have disabled other CPUs, hence stopping the
+ * CPU holding the lock while still having some valuable data in the
+ * console buffer.
+ *
+ * Try to acquire the lock then release it regardless of the result.
+ * The release will also print the buffers out. Locks debug should
+ * be disabled to avoid reporting bad unlock balance when panic()
+ * is not being called from OOPS.
+ */
+ debug_locks_off();
+ console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+
+ panic_print_sys_info(true);
+}
+
+#define PN_HYPERVISOR_BIT 0
+#define PN_INFO_BIT 1
+#define PN_PRE_REBOOT_BIT 2
+#define PN_POST_REBOOT_BIT 3
+
+/*
+ * Determine the order of panic notifiers with regards to kdump.
+ *
+ * This function relies in the "panic_notifiers_level" kernel parameter
+ * to determine how to order the notifiers with regards to kdump. We
+ * have currently 5 levels. For details, please check the kernel docs for
+ * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt.
+ *
+ * Default level is 2, which means the panic hypervisor and informational
+ * (unless we don't have any kmsg_dumper) lists will execute before kdump.
+ */
+static void order_panic_notifiers_and_kdump(void)
+{
+ /*
+ * The parameter "crash_kexec_post_notifiers" is deprecated, but
+ * valid. Users that set it want really all panic notifiers to
+ * execute before kdump, so it's effectively the same as setting
+ * the panic notifiers level to 4.
+ */
+ if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers)
+ return;
+
+ /*
+ * Based on the level configured (smaller than 4), we clear the
+ * proper bits in "panic_notifiers_bits". Notice that this bitfield
+ * is initialized with all notifiers set.
+ */
+ switch (panic_notifiers_level) {
+ case 3:
+ clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+ break;
+ case 2:
+ clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+
+ if (!kmsg_has_dumpers())
+ clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
+ break;
+ case 1:
+ clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+ clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
+ break;
+ case 0:
+ clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
+ clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
+ clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
+ break;
+ }
+}
+
+/*
+ * Set of helpers to execute the panic notifiers only once.
+ * Just the informational notifier cares about the return.
+ */
+static inline bool notifier_run_once(struct atomic_notifier_head head,
+ char *buf, long bit)
+{
+ if (test_and_change_bit(bit, &panic_notifiers_bits)) {
+ atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf);
+ return true;
+ }
+ return false;
+}
+
+#define panic_notifier_hypervisor_once(buf)\
+ notifier_run_once(panic_hypervisor_list, buf, PN_HYPERVISOR_BIT)
+
+#define panic_notifier_info_once(buf)\
+ notifier_run_once(panic_info_list, buf, PN_INFO_BIT)
+
+#define panic_notifier_pre_reboot_once(buf)\
+ notifier_run_once(panic_pre_reboot_list, buf, PN_PRE_REBOOT_BIT)
+
+#define panic_notifier_post_reboot_once(buf)\
+ notifier_run_once(panic_post_reboot_list, buf, PN_POST_REBOOT_BIT)
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -198,32 +316,29 @@ void panic(const char *fmt, ...)
long i, i_next = 0, len;
int state = 0;
int old_cpu, this_cpu;
- bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers;

- if (panic_on_warn) {
- /*
- * This thread may hit another WARN() in the panic path.
- * Resetting this prevents additional WARN() from panicking the
- * system on this thread. Other threads are blocked by the
- * panic_mutex in panic().
- */
- panic_on_warn = 0;
- }
+ /*
+ * This thread may hit another WARN() in the panic path, so
+ * resetting this option prevents additional WARN() from
+ * re-panicking the system here.
+ */
+ panic_on_warn = 0;

/*
* Disable local interrupts. This will prevent panic_smp_self_stop
- * from deadlocking the first cpu that invokes the panic, since
- * there is nothing to prevent an interrupt handler (that runs
- * after setting panic_cpu) from invoking panic() again.
+ * from deadlocking the first cpu that invokes the panic, since there
+ * is nothing to prevent an interrupt handler (that runs after setting
+ * panic_cpu) from invoking panic() again. Also disables preemption
+ * here - notice it's not safe to rely on interrupt disabling to avoid
+ * preemption, since any cond_resched() or cond_resched_lock() might
+ * trigger a reschedule if the preempt count is 0 (for reference, see
+ * Documentation/locking/preempt-locking.rst). Some functions called
+ * from here want preempt disabled, so no point enabling it later.
*/
local_irq_disable();
preempt_disable_notrace();

/*
- * It's possible to come here directly from a panic-assertion and
- * not have preempt disabled. Some functions called from here want
- * preempt to be disabled. No point enabling it later though...
- *
* Only one CPU is allowed to execute the panic code from here. For
* multiple parallel invocations of panic, all other CPUs either
* stop themself or will wait until they are stopped by the 1st CPU
@@ -266,73 +381,75 @@ void panic(const char *fmt, ...)
kgdb_panic(buf);

/*
- * If we have crashed and we have a crash kernel loaded let it handle
- * everything else.
- * If we want to run this after calling panic_notifiers, pass
- * the "crash_kexec_post_notifiers" option to the kernel.
+ * Here lies one of the most subtle parts of the panic path,
+ * the panic notifiers and their order with regards to kdump.
+ * We currently have 4 sets of notifiers:
*
- * Bypass the panic_cpu check and call __crash_kexec directly.
+ * - the hypervisor list is composed by callbacks that are related
+ * to warn the FW / hypervisor about panic, or non-invasive LED
+ * controlling functions - (hopefully) low-risk for kdump, should
+ * run early if possible.
+ *
+ * - the informational list is composed by functions dumping data
+ * like kernel offsets, device error registers or tracing buffer;
+ * also log flooding prevention callbacks fit in this list. It is
+ * relatively safe to run before kdump.
+ *
+ * - the pre_reboot list basically is everything else, all the
+ * callbacks that don't fit in the 2 previous lists. It should
+ * run *after* kdump if possible, as it contains high-risk
+ * functions that may break kdump.
+ *
+ * - we also have a 4th list of notifiers, the post_reboot
+ * callbacks. This is not strongly related to kdump since it's
+ * always executed late in the panic path, after the restart
+ * mechanism (if set); its goal is to provide a way for
+ * architecture code effectively power-off/disable the system.
+ *
+ * The kernel provides the "panic_notifiers_level" parameter
+ * to adjust the ordering in which these notifiers should run
+ * with regards to kdump - the default level is 2, so both the
+ * hypervisor and informational notifiers should execute before
+ * the __crash_kexec(); the info notifier won't run by default
+ * unless there's some kmsg_dumper() registered. For details
+ * about it, check Documentation/admin-guide/kernel-parameters.txt.
+ *
+ * Notice that the code relies in bits set/clear operations to
+ * determine the ordering, functions *_once() execute only one
+ * time, as their name implies. The goal is to prevent too much
+ * if conditionals and more confusion. Finally, regarding CPUs
+ * disabling: unless NO panic notifier executes before kdump,
+ * we always disable secondary CPUs before __crash_kexec() and
+ * the notifiers execute.
*/
- if (!_crash_kexec_post_notifiers) {
+ order_panic_notifiers_and_kdump();
+
+ /* If no level, we should kdump ASAP. */
+ if (!panic_notifiers_level)
__crash_kexec(NULL);

- /*
- * Note smp_send_stop is the usual smp shutdown function, which
- * unfortunately means it may not be hardened to work in a
- * panic situation.
- */
- smp_send_stop();
- } else {
- /*
- * If we want to do crash dump after notifier calls and
- * kmsg_dump, we will need architecture dependent extra
- * works in addition to stopping other CPUs.
- */
- crash_smp_send_stop();
- }
+ crash_smp_send_stop();
+ panic_notifier_hypervisor_once(buf);

- /*
- * Run any panic handlers, including those that might need to
- * add information to the kmsg dump output.
- */
- atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf);
- atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf);
- atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf);
+ if (panic_notifier_info_once(buf)) {
+ panic_print_sys_info(false);
+ kmsg_dump(KMSG_DUMP_PANIC);
+ }

- panic_print_sys_info(false);
+ panic_notifier_pre_reboot_once(buf);

- kmsg_dump(KMSG_DUMP_PANIC);
+ __crash_kexec(NULL);

- /*
- * If you doubt kdump always works fine in any situation,
- * "crash_kexec_post_notifiers" offers you a chance to run
- * panic_notifiers and dumping kmsg before kdump.
- * Note: since some panic_notifiers can make crashed kernel
- * more unstable, it can increase risks of the kdump failure too.
- *
- * Bypass the panic_cpu check and call __crash_kexec directly.
- */
- if (_crash_kexec_post_notifiers)
- __crash_kexec(NULL);
+ panic_notifier_hypervisor_once(buf);

-#ifdef CONFIG_VT
- unblank_screen();
-#endif
- console_unblank();
-
- /*
- * We may have ended up stopping the CPU holding the lock (in
- * smp_send_stop()) while still having some valuable data in the console
- * buffer. Try to acquire the lock then release it regardless of the
- * result. The release will also print the buffers out. Locks debug
- * should be disabled to avoid reporting bad unlock balance when
- * panic() is not being callled from OOPS.
- */
- debug_locks_off();
- console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+ if (panic_notifier_info_once(buf)) {
+ panic_print_sys_info(false);
+ kmsg_dump(KMSG_DUMP_PANIC);
+ }

- panic_print_sys_info(true);
+ panic_notifier_pre_reboot_once(buf);

+ console_flushing();
if (!panic_blink)
panic_blink = no_blink;

@@ -363,8 +480,7 @@ void panic(const char *fmt, ...)
emergency_restart();
}

- atomic_notifier_call_chain(&panic_post_reboot_list,
- PANIC_NOTIFIER, buf);
+ panic_notifier_post_reboot_once(buf);

pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);

@@ -383,6 +499,15 @@ void panic(const char *fmt, ...)

EXPORT_SYMBOL(panic);

+/*
+ * Helper used in the kexec code, to validate if any
+ * panic notifier is set to execute early, before kdump.
+ */
+inline bool panic_notifiers_before_kdump(void)
+{
+ return panic_notifiers_level || crash_kexec_post_notifiers;
+}
+
/*
* TAINT_FORCED_RMMOD could be a per-module flag but the module
* is being removed anyway.
@@ -692,6 +817,9 @@ core_param(panic, panic_timeout, int, 0644);
core_param(panic_print, panic_print, ulong, 0644);
core_param(pause_on_oops, pause_on_oops, int, 0644);
core_param(panic_on_warn, panic_on_warn, int, 0644);
+core_param(panic_notifiers_level, panic_notifiers_level, uint, 0644);
+
+/* DEPRECATED in favor of panic_notifiers_level */
core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644);

static int __init oops_setup(char *s)
diff --git a/tools/testing/selftests/pstore/pstore_crash_test b/tools/testing/selftests/pstore/pstore_crash_test
index 2a329bbb4aca..1e60ce4501aa 100755
--- a/tools/testing/selftests/pstore/pstore_crash_test
+++ b/tools/testing/selftests/pstore/pstore_crash_test
@@ -25,6 +25,7 @@ touch $REBOOT_FLAG
sync

# cause crash
-# Note: If you use kdump and want to see kmesg-* files after reboot, you should
-# specify 'crash_kexec_post_notifiers' in 1st kernel's cmdline.
+# Note: If you use kdump and want to see kmsg-* files after reboot, you should
+# be sure that the parameter "panic_notifiers_level" is more than '2' (the
+# default value for this parameter is '2') in the first kernel's cmdline.
echo c > /proc/sysrq-trigger
--
2.36.0


2022-04-28 04:38:22

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 24/30] panic: Refactor the panic path



On 4/27/22 15:49, Guilherme G. Piccoli wrote:
> + crash_kexec_post_notifiers
> + This was DEPRECATED - users should always prefer the

This is DEPRECATED - users should always prefer the

> + parameter "panic_notifiers_level" - check its entry
> + in this documentation for details on how it works.
> + Setting this parameter is exactly the same as setting
> + "panic_notifiers_level=4".

--
~Randy

2022-04-30 09:27:40

by Guilherme G. Piccoli

[permalink] [raw]
Subject: Re: [PATCH 24/30] panic: Refactor the panic path

On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> From: Guilherme G. Piccoli <[email protected]> Sent: Wednesday, April 27, 2022 3:49 PM
>> [...]
>> + panic_notifiers_level=
>> + [KNL] Set the panic notifiers execution order.
>> + Format: <unsigned int>
>> + We currently have 4 lists of panic notifiers; based
>> + on the functionality and risk (for panic success) the
>> + callbacks are added in a given list. The lists are:
>> + - hypervisor/FW notification list (low risk);
>> + - informational list (low/medium risk);
>> + - pre_reboot list (higher risk);
>> + - post_reboot list (only run late in panic and after
>> + kdump, not configurable for now).
>> + This parameter defines the ordering of the first 3
>> + lists with regards to kdump; the levels determine
>> + which set of notifiers execute before kdump. The
>> + accepted levels are:
>> + 0: kdump is the first thing to run, NO list is
>> + executed before kdump.
>> + 1: only the hypervisor list is executed before kdump.
>> + 2 (default level): the hypervisor list and (*if*
>> + there's any kmsg_dumper defined) the informational
>> + list are executed before kdump.
>> + 3: both the hypervisor and the informational lists
>> + (always) execute before kdump.
>
> I'm not clear on why level 2 exists. What is the scenario where
> execution of the info list before kdump should be conditional on the
> existence of a kmsg_dumper? Maybe the scenario is described
> somewhere in the patch set and I just missed it.
>

Hi Michael, thanks for your review/consideration. So, this idea started
kind of some time ago. It all started with a need of exposing more
information on kernel log *before* kdump and *before* pstore -
specifically, we're talking about panic_print. But this cause some
reactions, Baoquan was very concerned with that [0]. Soon after, I've
proposed a panic notifiers filter (orthogonal) approach, to which Petr
suggested instead doing a major refactor [1] - it finally is alive in
the form of this series.

The theory behind the level 2 is to allow a scenario of kdump with the
minimum amount of notifiers - what is the point in printing more
information if the user doesn't care, since it's going to kdump? Now, if
there is a kmsg dumper, it means that there is likely some interest in
collecting information, and that might as well be required before the
potential kdump (which is my case, hence the proposal on [0]).

Instead of forcing one of the two behaviors (level 1 or level 3), we
have a middle-term/compromise: if there's interest in collecting such
data (in the form of a kmsg dumper), we then execute the informational
notifiers before kdump. If not, why to increase (even slightly) the risk
for kdump?

I'm OK in removing the level 2 if people prefer, but I don't feel it's a
burden, quite opposite - seems a good way to accommodate the somewhat
antagonistic ideas (jump to kdump ASAP vs collecting more info in the
panicked kernel log).

[0] https://lore.kernel.org/lkml/[email protected]/

[1] https://lore.kernel.org/lkml/[email protected]/


>[...]
>> + * Based on the level configured (smaller than 4), we clear the
>> + * proper bits in "panic_notifiers_bits". Notice that this bitfield
>> + * is initialized with all notifiers set.
>> + */
>> + switch (panic_notifiers_level) {
>> + case 3:
>> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> + break;
>> + case 2:
>> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +
>> + if (!kmsg_has_dumpers())
>> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> + break;
>> + case 1:
>> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> + break;
>> + case 0:
>> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> + clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
>> + break;
>> + }
>
> I think the above switch statement could be done as follows:
>
> if (panic_notifiers_level <= 3)
> clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level <= 2)
> if (!kmsg_has_dumpers())
> clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level <=1)
> clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level == 0)
> clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
>
> That's about half the lines of code. It's somewhat a matter of style,
> so treat this as just a suggestion to consider. I just end up looking
> for a better solution when I see the same line of code repeated
> 3 or 4 times!
>

It's a good idea - I liked your code. The switch seems more
natural/explicit for me, even duplicating some lines, but in case more
people prefer your way, I can definitely change the code - thanks for
the suggestion.
Cheers,


Guilherme

2022-05-03 00:11:19

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 24/30] panic: Refactor the panic path

From: Guilherme G. Piccoli <[email protected]> Sent: Wednesday, April 27, 2022 3:49 PM
>
> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
>
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
>
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
>
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
>
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
>
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
>
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
>
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
>
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
>
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
>
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
>
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
>
> [0] https://lore.kernel.org/lkml/[email protected]/
>
> Suggested-by: Petr Mladek <[email protected]>
> Signed-off-by: Guilherme G. Piccoli <[email protected]>
> ---
>
> Special thanks to Petr and Baoquan for the suggestion and feedback in a previous
> email thread. There's some important design decisions that worth mentioning and
> discussing:
>
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but would be something worthwhile to have a KConfig option to set
> the default level? It would help distros that want the old behavior
> (no notifiers before kdump) as default.
>
> * The implementation choice was to _avoid_ intricate if conditionals in the
> panic path, which would _definitely_ be present with the panic notifiers levels
> idea; so, instead of lots of if conditionals, the set/clear bits approach with
> functions called in 2 points (but executing only in one of them) is much easier
> to follow an was used here; the ordering helper function and the comments also
> help a lot to avoid confusion (hopefully).
>
> * Choice was to *always* use crash_smp_send_stop() instead of sometimes making
> use of the regular smp_send_stop(); for most architectures they are the same,
> including Xen (on x86). For the ones that override it, all should work fine,
> in the powerpc case it's even more correct (see the subsequent patch
> "powerpc: Do not force all panic notifiers to execute before kdump")
>
> There seems to be 2 cases that requires some plumbing to work 100% right:
> - ARM doesn't disable local interrupts / FIQs in the crash version of
> send_stop(); we patched that early in this series;
> - x86 could face an issue if we have VMX and do use crash_smp_send_stop()
> _without_ kdump, but this is fixed in the first patch of the series (and
> it's a bug present even before this refactor).
>
> * Notice we didn't add a sysrq for panic notifiers level - should have it?
> Alejandro proposed recently to add a sysrq for "crash_kexec_post_notifiers",
> let me know if you feel the need here Alejandro, since the core parameters are
> present in /sys, I didn't consider much gain in having a sysrq, but of course
> I'm open to suggestions!
>
> Thanks advance for the review!
>
> .../admin-guide/kernel-parameters.txt | 42 ++-
> include/linux/panic_notifier.h | 1 +
> kernel/kexec_core.c | 8 +-
> kernel/panic.c | 292 +++++++++++++-----
> .../selftests/pstore/pstore_crash_test | 5 +-
> 5 files changed, 252 insertions(+), 96 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..8d3524060ce3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -829,6 +829,13 @@
> It will be ignored when crashkernel=X,high is not used
> or memory reserved is below 4G.
>
> + crash_kexec_post_notifiers
> + This was DEPRECATED - users should always prefer the
> + parameter "panic_notifiers_level" - check its entry
> + in this documentation for details on how it works.
> + Setting this parameter is exactly the same as setting
> + "panic_notifiers_level=4".
> +
> cryptomgr.notests
> [KNL] Disable crypto self-tests
>
> @@ -3784,6 +3791,33 @@
> timeout < 0: reboot immediately
> Format: <timeout>
>
> + panic_notifiers_level=
> + [KNL] Set the panic notifiers execution order.
> + Format: <unsigned int>
> + We currently have 4 lists of panic notifiers; based
> + on the functionality and risk (for panic success) the
> + callbacks are added in a given list. The lists are:
> + - hypervisor/FW notification list (low risk);
> + - informational list (low/medium risk);
> + - pre_reboot list (higher risk);
> + - post_reboot list (only run late in panic and after
> + kdump, not configurable for now).
> + This parameter defines the ordering of the first 3
> + lists with regards to kdump; the levels determine
> + which set of notifiers execute before kdump. The
> + accepted levels are:
> + 0: kdump is the first thing to run, NO list is
> + executed before kdump.
> + 1: only the hypervisor list is executed before kdump.
> + 2 (default level): the hypervisor list and (*if*
> + there's any kmsg_dumper defined) the informational
> + list are executed before kdump.
> + 3: both the hypervisor and the informational lists
> + (always) execute before kdump.

I'm not clear on why level 2 exists. What is the scenario where
execution of the info list before kdump should be conditional on the
existence of a kmsg_dumper? Maybe the scenario is described
somewhere in the patch set and I just missed it.

> + 4: the 3 lists (hypervisor, info and pre_reboot)
> + execute before kdump - this behavior is analog to the
> + deprecated parameter "crash_kexec_post_notifiers".
> +
> panic_print= Bitmask for printing system info when panic happens.
> User can chose combination of the following bits:
> bit 0: print all tasks info
> @@ -3814,14 +3848,6 @@
> panic_on_warn panic() instead of WARN(). Useful to cause kdump
> on a WARN().
>
> - crash_kexec_post_notifiers
> - Run kdump after running panic-notifiers and dumping
> - kmsg. This only for the users who doubt kdump always
> - succeeds in any situation.
> - Note that this also increases risks of kdump failure,
> - because some panic notifiers can make the crashed
> - kernel more unstable.
> -
> parkbd.port= [HW] Parallel port number the keyboard adapter is
> connected to, default is 0.
> Format: <parport#>
> diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h
> index bcf6a5ea9d7f..b5041132321d 100644
> --- a/include/linux/panic_notifier.h
> +++ b/include/linux/panic_notifier.h
> @@ -10,6 +10,7 @@ extern struct atomic_notifier_head panic_info_list;
> extern struct atomic_notifier_head panic_pre_reboot_list;
> extern struct atomic_notifier_head panic_post_reboot_list;
>
> +bool panic_notifiers_before_kdump(void);
> extern bool crash_kexec_post_notifiers;
>
> enum panic_notifier_val {
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 68480f731192..f8906db8ca72 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -74,11 +74,11 @@ struct resource crashk_low_res = {
> int kexec_should_crash(struct task_struct *p)
> {
> /*
> - * If crash_kexec_post_notifiers is enabled, don't run
> - * crash_kexec() here yet, which must be run after panic
> - * notifiers in panic().
> + * In case any panic notifiers are set to execute before kexec,
> + * don't run crash_kexec() here yet, which must run after these
> + * panic notifiers are executed, in the panic() path.
> */
> - if (crash_kexec_post_notifiers)
> + if (panic_notifiers_before_kdump())
> return 0;
> /*
> * There are 4 panic() calls in make_task_dead() path, each of which
> diff --git a/kernel/panic.c b/kernel/panic.c
> index bf792102b43e..b7c055d4f87f 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -15,7 +15,6 @@
> #include <linux/kgdb.h>
> #include <linux/kmsg_dump.h>
> #include <linux/kallsyms.h>
> -#include <linux/notifier.h>
> #include <linux/vt_kern.h>
> #include <linux/module.h>
> #include <linux/random.h>
> @@ -52,14 +51,23 @@ static unsigned long tainted_mask =
> static int pause_on_oops;
> static int pause_on_oops_flag;
> static DEFINE_SPINLOCK(pause_on_oops_lock);
> -bool crash_kexec_post_notifiers;
> +
> int panic_on_warn __read_mostly;
> +bool panic_on_taint_nousertaint;
> unsigned long panic_on_taint;
> -bool panic_on_taint_nousertaint = false;
>
> int panic_timeout = CONFIG_PANIC_TIMEOUT;
> EXPORT_SYMBOL_GPL(panic_timeout);
>
> +/* Initialized with all notifiers set to run before kdump */
> +static unsigned long panic_notifiers_bits = 15;
> +
> +/* Default level is 2, see kernel-parameters.txt */
> +unsigned int panic_notifiers_level = 2;
> +
> +/* DEPRECATED in favor of panic_notifiers_level */
> +bool crash_kexec_post_notifiers;
> +
> #define PANIC_PRINT_TASK_INFO 0x00000001
> #define PANIC_PRINT_MEM_INFO 0x00000002
> #define PANIC_PRINT_TIMER_INFO 0x00000004
> @@ -109,10 +117,14 @@ void __weak nmi_panic_self_stop(struct pt_regs *regs)
> }
>
> /*
> - * Stop other CPUs in panic. Architecture dependent code may override this
> - * with more suitable version. For example, if the architecture supports
> - * crash dump, it should save registers of each stopped CPU and disable
> - * per-CPU features such as virtualization extensions.
> + * Stop other CPUs in panic context.
> + *
> + * Architecture dependent code may override this with more suitable version.
> + * For example, if the architecture supports crash dump, it should save the
> + * registers of each stopped CPU and disable per-CPU features such as
> + * virtualization extensions. When not overridden in arch code (and for
> + * x86/xen), this is exactly the same as execute smp_send_stop(), but
> + * guarded against duplicate execution.
> */
> void __weak crash_smp_send_stop(void)
> {
> @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush)
> ftrace_dump(DUMP_ALL);
> }
>
> +/*
> + * Helper that accumulates all console flushing routines executed on panic.
> + */
> +static void console_flushing(void)
> +{
> +#ifdef CONFIG_VT
> + unblank_screen();
> +#endif
> + console_unblank();
> +
> + /*
> + * In this point, we may have disabled other CPUs, hence stopping the
> + * CPU holding the lock while still having some valuable data in the
> + * console buffer.
> + *
> + * Try to acquire the lock then release it regardless of the result.
> + * The release will also print the buffers out. Locks debug should
> + * be disabled to avoid reporting bad unlock balance when panic()
> + * is not being called from OOPS.
> + */
> + debug_locks_off();
> + console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> +
> + panic_print_sys_info(true);
> +}
> +
> +#define PN_HYPERVISOR_BIT 0
> +#define PN_INFO_BIT 1
> +#define PN_PRE_REBOOT_BIT 2
> +#define PN_POST_REBOOT_BIT 3
> +
> +/*
> + * Determine the order of panic notifiers with regards to kdump.
> + *
> + * This function relies in the "panic_notifiers_level" kernel parameter
> + * to determine how to order the notifiers with regards to kdump. We
> + * have currently 5 levels. For details, please check the kernel docs for
> + * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt.
> + *
> + * Default level is 2, which means the panic hypervisor and informational
> + * (unless we don't have any kmsg_dumper) lists will execute before kdump.
> + */
> +static void order_panic_notifiers_and_kdump(void)
> +{
> + /*
> + * The parameter "crash_kexec_post_notifiers" is deprecated, but
> + * valid. Users that set it want really all panic notifiers to
> + * execute before kdump, so it's effectively the same as setting
> + * the panic notifiers level to 4.
> + */
> + if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers)
> + return;
> +
> + /*
> + * Based on the level configured (smaller than 4), we clear the
> + * proper bits in "panic_notifiers_bits". Notice that this bitfield
> + * is initialized with all notifiers set.
> + */
> + switch (panic_notifiers_level) {
> + case 3:
> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> + break;
> + case 2:
> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +
> + if (!kmsg_has_dumpers())
> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> + break;
> + case 1:
> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> + break;
> + case 0:
> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> + clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> + break;
> + }

I think the above switch statement could be done as follows:

if (panic_notifiers_level <= 3)
clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
if (panic_notifiers_level <= 2)
if (!kmsg_has_dumpers())
clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
if (panic_notifiers_level <=1)
clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
if (panic_notifiers_level == 0)
clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);

That's about half the lines of code. It's somewhat a matter of style,
so treat this as just a suggestion to consider. I just end up looking
for a better solution when I see the same line of code repeated
3 or 4 times!

> +}
> +
> +/*
> + * Set of helpers to execute the panic notifiers only once.
> + * Just the informational notifier cares about the return.
> + */
> +static inline bool notifier_run_once(struct atomic_notifier_head head,
> + char *buf, long bit)
> +{
> + if (test_and_change_bit(bit, &panic_notifiers_bits)) {
> + atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf);
> + return true;
> + }
> + return false;
> +}
> +
> +#define panic_notifier_hypervisor_once(buf)\
> + notifier_run_once(panic_hypervisor_list, buf, PN_HYPERVISOR_BIT)
> +
> +#define panic_notifier_info_once(buf)\
> + notifier_run_once(panic_info_list, buf, PN_INFO_BIT)
> +
> +#define panic_notifier_pre_reboot_once(buf)\
> + notifier_run_once(panic_pre_reboot_list, buf, PN_PRE_REBOOT_BIT)
> +
> +#define panic_notifier_post_reboot_once(buf)\
> + notifier_run_once(panic_post_reboot_list, buf, PN_POST_REBOOT_BIT)
> +
> /**
> * panic - halt the system
> * @fmt: The text string to print
> @@ -198,32 +316,29 @@ void panic(const char *fmt, ...)
> long i, i_next = 0, len;
> int state = 0;
> int old_cpu, this_cpu;
> - bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers;
>
> - if (panic_on_warn) {
> - /*
> - * This thread may hit another WARN() in the panic path.
> - * Resetting this prevents additional WARN() from panicking the
> - * system on this thread. Other threads are blocked by the
> - * panic_mutex in panic().
> - */
> - panic_on_warn = 0;
> - }
> + /*
> + * This thread may hit another WARN() in the panic path, so
> + * resetting this option prevents additional WARN() from
> + * re-panicking the system here.
> + */
> + panic_on_warn = 0;
>
> /*
> * Disable local interrupts. This will prevent panic_smp_self_stop
> - * from deadlocking the first cpu that invokes the panic, since
> - * there is nothing to prevent an interrupt handler (that runs
> - * after setting panic_cpu) from invoking panic() again.
> + * from deadlocking the first cpu that invokes the panic, since there
> + * is nothing to prevent an interrupt handler (that runs after setting
> + * panic_cpu) from invoking panic() again. Also disables preemption
> + * here - notice it's not safe to rely on interrupt disabling to avoid
> + * preemption, since any cond_resched() or cond_resched_lock() might
> + * trigger a reschedule if the preempt count is 0 (for reference, see
> + * Documentation/locking/preempt-locking.rst). Some functions called
> + * from here want preempt disabled, so no point enabling it later.
> */
> local_irq_disable();
> preempt_disable_notrace();
>
> /*
> - * It's possible to come here directly from a panic-assertion and
> - * not have preempt disabled. Some functions called from here want
> - * preempt to be disabled. No point enabling it later though...
> - *
> * Only one CPU is allowed to execute the panic code from here. For
> * multiple parallel invocations of panic, all other CPUs either
> * stop themself or will wait until they are stopped by the 1st CPU
> @@ -266,73 +381,75 @@ void panic(const char *fmt, ...)
> kgdb_panic(buf);
>
> /*
> - * If we have crashed and we have a crash kernel loaded let it handle
> - * everything else.
> - * If we want to run this after calling panic_notifiers, pass
> - * the "crash_kexec_post_notifiers" option to the kernel.
> + * Here lies one of the most subtle parts of the panic path,
> + * the panic notifiers and their order with regards to kdump.
> + * We currently have 4 sets of notifiers:
> *
> - * Bypass the panic_cpu check and call __crash_kexec directly.
> + * - the hypervisor list is composed by callbacks that are related
> + * to warn the FW / hypervisor about panic, or non-invasive LED
> + * controlling functions - (hopefully) low-risk for kdump, should
> + * run early if possible.
> + *
> + * - the informational list is composed by functions dumping data
> + * like kernel offsets, device error registers or tracing buffer;
> + * also log flooding prevention callbacks fit in this list. It is
> + * relatively safe to run before kdump.
> + *
> + * - the pre_reboot list basically is everything else, all the
> + * callbacks that don't fit in the 2 previous lists. It should
> + * run *after* kdump if possible, as it contains high-risk
> + * functions that may break kdump.
> + *
> + * - we also have a 4th list of notifiers, the post_reboot
> + * callbacks. This is not strongly related to kdump since it's
> + * always executed late in the panic path, after the restart
> + * mechanism (if set); its goal is to provide a way for
> + * architecture code effectively power-off/disable the system.
> + *
> + * The kernel provides the "panic_notifiers_level" parameter
> + * to adjust the ordering in which these notifiers should run
> + * with regards to kdump - the default level is 2, so both the
> + * hypervisor and informational notifiers should execute before
> + * the __crash_kexec(); the info notifier won't run by default
> + * unless there's some kmsg_dumper() registered. For details
> + * about it, check Documentation/admin-guide/kernel-parameters.txt.
> + *
> + * Notice that the code relies in bits set/clear operations to
> + * determine the ordering, functions *_once() execute only one
> + * time, as their name implies. The goal is to prevent too much
> + * if conditionals and more confusion. Finally, regarding CPUs
> + * disabling: unless NO panic notifier executes before kdump,
> + * we always disable secondary CPUs before __crash_kexec() and
> + * the notifiers execute.
> */
> - if (!_crash_kexec_post_notifiers) {
> + order_panic_notifiers_and_kdump();
> +
> + /* If no level, we should kdump ASAP. */
> + if (!panic_notifiers_level)
> __crash_kexec(NULL);
>
> - /*
> - * Note smp_send_stop is the usual smp shutdown function, which
> - * unfortunately means it may not be hardened to work in a
> - * panic situation.
> - */
> - smp_send_stop();
> - } else {
> - /*
> - * If we want to do crash dump after notifier calls and
> - * kmsg_dump, we will need architecture dependent extra
> - * works in addition to stopping other CPUs.
> - */
> - crash_smp_send_stop();
> - }
> + crash_smp_send_stop();
> + panic_notifier_hypervisor_once(buf);
>
> - /*
> - * Run any panic handlers, including those that might need to
> - * add information to the kmsg dump output.
> - */
> - atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf);
> - atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf);
> - atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf);
> + if (panic_notifier_info_once(buf)) {
> + panic_print_sys_info(false);
> + kmsg_dump(KMSG_DUMP_PANIC);
> + }
>
> - panic_print_sys_info(false);
> + panic_notifier_pre_reboot_once(buf);
>
> - kmsg_dump(KMSG_DUMP_PANIC);
> + __crash_kexec(NULL);
>
> - /*
> - * If you doubt kdump always works fine in any situation,
> - * "crash_kexec_post_notifiers" offers you a chance to run
> - * panic_notifiers and dumping kmsg before kdump.
> - * Note: since some panic_notifiers can make crashed kernel
> - * more unstable, it can increase risks of the kdump failure too.
> - *
> - * Bypass the panic_cpu check and call __crash_kexec directly.
> - */
> - if (_crash_kexec_post_notifiers)
> - __crash_kexec(NULL);
> + panic_notifier_hypervisor_once(buf);
>
> -#ifdef CONFIG_VT
> - unblank_screen();
> -#endif
> - console_unblank();
> -
> - /*
> - * We may have ended up stopping the CPU holding the lock (in
> - * smp_send_stop()) while still having some valuable data in the console
> - * buffer. Try to acquire the lock then release it regardless of the
> - * result. The release will also print the buffers out. Locks debug
> - * should be disabled to avoid reporting bad unlock balance when
> - * panic() is not being callled from OOPS.
> - */
> - debug_locks_off();
> - console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> + if (panic_notifier_info_once(buf)) {
> + panic_print_sys_info(false);
> + kmsg_dump(KMSG_DUMP_PANIC);
> + }
>
> - panic_print_sys_info(true);
> + panic_notifier_pre_reboot_once(buf);
>
> + console_flushing();
> if (!panic_blink)
> panic_blink = no_blink;
>
> @@ -363,8 +480,7 @@ void panic(const char *fmt, ...)
> emergency_restart();
> }
>
> - atomic_notifier_call_chain(&panic_post_reboot_list,
> - PANIC_NOTIFIER, buf);
> + panic_notifier_post_reboot_once(buf);
>
> pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
>
> @@ -383,6 +499,15 @@ void panic(const char *fmt, ...)
>
> EXPORT_SYMBOL(panic);
>
> +/*
> + * Helper used in the kexec code, to validate if any
> + * panic notifier is set to execute early, before kdump.
> + */
> +inline bool panic_notifiers_before_kdump(void)
> +{
> + return panic_notifiers_level || crash_kexec_post_notifiers;
> +}
> +
> /*
> * TAINT_FORCED_RMMOD could be a per-module flag but the module
> * is being removed anyway.
> @@ -692,6 +817,9 @@ core_param(panic, panic_timeout, int, 0644);
> core_param(panic_print, panic_print, ulong, 0644);
> core_param(pause_on_oops, pause_on_oops, int, 0644);
> core_param(panic_on_warn, panic_on_warn, int, 0644);
> +core_param(panic_notifiers_level, panic_notifiers_level, uint, 0644);
> +
> +/* DEPRECATED in favor of panic_notifiers_level */
> core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644);
>
> static int __init oops_setup(char *s)
> diff --git a/tools/testing/selftests/pstore/pstore_crash_test
> b/tools/testing/selftests/pstore/pstore_crash_test
> index 2a329bbb4aca..1e60ce4501aa 100755
> --- a/tools/testing/selftests/pstore/pstore_crash_test
> +++ b/tools/testing/selftests/pstore/pstore_crash_test
> @@ -25,6 +25,7 @@ touch $REBOOT_FLAG
> sync
>
> # cause crash
> -# Note: If you use kdump and want to see kmesg-* files after reboot, you should
> -# specify 'crash_kexec_post_notifiers' in 1st kernel's cmdline.
> +# Note: If you use kdump and want to see kmsg-* files after reboot, you should
> +# be sure that the parameter "panic_notifiers_level" is more than '2' (the
> +# default value for this parameter is '2') in the first kernel's cmdline.
> echo c > /proc/sysrq-trigger
> --
> 2.36.0

2022-05-03 01:09:50

by Guilherme G. Piccoli

[permalink] [raw]
Subject: Re: [PATCH 24/30] panic: Refactor the panic path

On 27/04/2022 21:28, Randy Dunlap wrote:
>
>
> On 4/27/22 15:49, Guilherme G. Piccoli wrote:
>> + crash_kexec_post_notifiers
>> + This was DEPRECATED - users should always prefer the
>
> This is DEPRECATED - users should always prefer the
>
>> + parameter "panic_notifiers_level" - check its entry
>> + in this documentation for details on how it works.
>> + Setting this parameter is exactly the same as setting
>> + "panic_notifiers_level=4".
>

Thanks Randy, for your suggestion - but I confess I couldn't understand
it properly. It's related to spaces/tabs, right? What you suggest me to
change in this formatting? Just by looking the email I can't parse.

Cheers,


Guilherme

2022-05-03 22:52:37

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 24/30] panic: Refactor the panic path

From: Guilherme G. Piccoli <[email protected]> Sent: Friday, April 29, 2022 1:38 PM
>
> On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> > From: Guilherme G. Piccoli <[email protected]> Sent: Wednesday, April 27, 2022
> 3:49 PM
> >> [...]
> >> + panic_notifiers_level=
> >> + [KNL] Set the panic notifiers execution order.
> >> + Format: <unsigned int>
> >> + We currently have 4 lists of panic notifiers; based
> >> + on the functionality and risk (for panic success) the
> >> + callbacks are added in a given list. The lists are:
> >> + - hypervisor/FW notification list (low risk);
> >> + - informational list (low/medium risk);
> >> + - pre_reboot list (higher risk);
> >> + - post_reboot list (only run late in panic and after
> >> + kdump, not configurable for now).
> >> + This parameter defines the ordering of the first 3
> >> + lists with regards to kdump; the levels determine
> >> + which set of notifiers execute before kdump. The
> >> + accepted levels are:
> >> + 0: kdump is the first thing to run, NO list is
> >> + executed before kdump.
> >> + 1: only the hypervisor list is executed before kdump.
> >> + 2 (default level): the hypervisor list and (*if*
> >> + there's any kmsg_dumper defined) the informational
> >> + list are executed before kdump.
> >> + 3: both the hypervisor and the informational lists
> >> + (always) execute before kdump.
> >
> > I'm not clear on why level 2 exists. What is the scenario where
> > execution of the info list before kdump should be conditional on the
> > existence of a kmsg_dumper? Maybe the scenario is described
> > somewhere in the patch set and I just missed it.
> >
>
> Hi Michael, thanks for your review/consideration. So, this idea started
> kind of some time ago. It all started with a need of exposing more
> information on kernel log *before* kdump and *before* pstore -
> specifically, we're talking about panic_print. But this cause some
> reactions, Baoquan was very concerned with that [0]. Soon after, I've
> proposed a panic notifiers filter (orthogonal) approach, to which Petr
> suggested instead doing a major refactor [1] - it finally is alive in
> the form of this series.
>
> The theory behind the level 2 is to allow a scenario of kdump with the
> minimum amount of notifiers - what is the point in printing more
> information if the user doesn't care, since it's going to kdump? Now, if
> there is a kmsg dumper, it means that there is likely some interest in
> collecting information, and that might as well be required before the
> potential kdump (which is my case, hence the proposal on [0]).
>
> Instead of forcing one of the two behaviors (level 1 or level 3), we
> have a middle-term/compromise: if there's interest in collecting such
> data (in the form of a kmsg dumper), we then execute the informational
> notifiers before kdump. If not, why to increase (even slightly) the risk
> for kdump?
>
> I'm OK in removing the level 2 if people prefer, but I don't feel it's a
> burden, quite opposite - seems a good way to accommodate the somewhat
> antagonistic ideas (jump to kdump ASAP vs collecting more info in the
> panicked kernel log).
>
> [0] https://lore.kernel.org/lkml/[email protected]/
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>

To me, it's a weak correlation between having a kmsg dumper, and
wanting or not wanting the info level output to come before kdump.
Hyper-V is one of only a few places that register a kmsg dumper, so most
Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
the info level output after kdump. It seems like anyone who cared strongly
about the info level output would set the panic_notifier_level to 1 or to 3
so that the result is more deterministic. But that's just my opinion, and
it's probably an opinion that is not as well informed on the topic as some
others in the discussion. So keeping things as in your patch set is not a
show-stopper for me.

However, I would request a clarification in the documentation. The
panic_notifier_level affects not only the hypervisor, informational,
and pre_reboot lists, but it also affects panic_print_sys_info() and
kmsg_dump(). Specifically, at level 1, panic_print_sys_info() and
kmsg_dump() will not be run before kdump. At level 3, they will
always be run before kdump. Your documentation above mentions
"informational lists" (plural), which I take to vaguely include
kmsg_dump() and panic_print_sys_info(), but being explicit about
the effect would be better.

Michael

>
> >[...]
> >> + * Based on the level configured (smaller than 4), we clear the
> >> + * proper bits in "panic_notifiers_bits". Notice that this bitfield
> >> + * is initialized with all notifiers set.
> >> + */
> >> + switch (panic_notifiers_level) {
> >> + case 3:
> >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> + break;
> >> + case 2:
> >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> +
> >> + if (!kmsg_has_dumpers())
> >> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> >> + break;
> >> + case 1:
> >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> >> + break;
> >> + case 0:
> >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> >> + clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> >> + break;
> >> + }
> >
> > I think the above switch statement could be done as follows:
> >
> > if (panic_notifiers_level <= 3)
> > clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> > if (panic_notifiers_level <= 2)
> > if (!kmsg_has_dumpers())
> > clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> > if (panic_notifiers_level <=1)
> > clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> > if (panic_notifiers_level == 0)
> > clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> >
> > That's about half the lines of code. It's somewhat a matter of style,
> > so treat this as just a suggestion to consider. I just end up looking
> > for a better solution when I see the same line of code repeated
> > 3 or 4 times!
> >
>
> It's a good idea - I liked your code. The switch seems more
> natural/explicit for me, even duplicating some lines, but in case more
> people prefer your way, I can definitely change the code - thanks for
> the suggestion.
> Cheers,
>
>
> Guilherme

2022-05-04 03:26:36

by Guilherme G. Piccoli

[permalink] [raw]
Subject: Re: [PATCH 24/30] panic: Refactor the panic path

On 03/05/2022 14:31, Michael Kelley (LINUX) wrote:
> [...]
>
> To me, it's a weak correlation between having a kmsg dumper, and
> wanting or not wanting the info level output to come before kdump.
> Hyper-V is one of only a few places that register a kmsg dumper, so most
> Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
> the info level output after kdump. It seems like anyone who cared strongly
> about the info level output would set the panic_notifier_level to 1 or to 3
> so that the result is more deterministic. But that's just my opinion, and
> it's probably an opinion that is not as well informed on the topic as some
> others in the discussion. So keeping things as in your patch set is not a
> show-stopper for me.
>
> However, I would request a clarification in the documentation. The
> panic_notifier_level affects not only the hypervisor, informational,
> and pre_reboot lists, but it also affects panic_print_sys_info() and
> kmsg_dump(). Specifically, at level 1, panic_print_sys_info() and
> kmsg_dump() will not be run before kdump. At level 3, they will
> always be run before kdump. Your documentation above mentions
> "informational lists" (plural), which I take to vaguely include
> kmsg_dump() and panic_print_sys_info(), but being explicit about
> the effect would be better.
>
> Michael

Thanks again Michael, to express your points and concerns - great idea
of documentation improvement here, I'll do that for V2, for sure.

The idea of "defaulting" to skip the info list on kdump (if no
kmsg_dump() is set) is again a mechanism that aims at accommodating all
users and concerns of antagonistic goals, kdump vs notifier lists.

Before this patch set, by default no notifier executed before kdump. So,
the "pendulum" was strongly on kdump side, and clearly this was a
sub-optimal decision - proof of that is that both Hyper-V / PowerPC code
forcibly set the "crash_kexec_post_notifiers". The goal here is to have
a more lightweight list that by default runs before kdump, a secondary
list that only runs before kdump if there's usage for that (either user
sets that or kmsg_dumper set is considered a valid user), and the
remaining notifiers run by default only after kdump, all of that very
customizable through the levels idea.

Now, one thing we could do to improve consistency for the hyper-v case:
having a kmsg_dump_once() helper, and *for Hyper-V only*, call it on the
hypervisor list, within the info notifier (that would be moved to
hypervisor list, ofc).
Let's wait for more feedback on that, just throwing some ideas in order
we can have everyone happy with the end-result!

Cheers,


Guilherme