From: William Roche <[email protected]>
Force push panic messages to the console as panic() can be called from NMI
interrupt handler functions where printed messages can't always reach the
console without an explicit push provided by printk_safe_flush_on_panic()
and console_flush_on_panic().
This is the case with the MCE handler that can lead to a system panic
giving information on the fatal MCE root cause that must reach the console.
Signed-off-by: William Roche <[email protected]>
---
Notes:
While testing MCE injection and kernel reaction, we discovered a bug
in the way the kernel provides the panic reason information: When dealing
with fatal MCE, the machine (physical or virtual) can reboot without
leaving any message on the console.
This behavior can be reproduced on Intel with the mce-inject tool
with a simple:
# modprobe mce-inject
# mce-inject test/uncorrected
The investigations showed that the MCE panic can be totally message-less
or can give a small set of messages. This behavior depends on the use of the
crash_kexec mechanism (using the "crashkernel" parameter). Not using this
parameter, we get a partial [Hardware Error] information on panic, but some
important notifications can be missing. And when using it, a fatal MCE can
panic the system without leaving any information.
. Without "crashkernel", a Fatal MCE injection shows:
[ 212.153928] mce: Machine check injector initialized
[ 236.730682] mce: Triggering MCE exception on CPU 0
[ 236.731304] Disabling lock debugging due to kernel taint
[ 236.731947] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 1: b000000000000000
[ 236.731948] mce: [Hardware Error]: TSC 78418fb4a83f
[ 236.731949] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1605312952 SOCKET 0 APIC 0 microcode 1
[ 236.731949] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 236.731950] mce: [Hardware Error]: Machine check: MCIP not set in MCA handler
[ 236.731950] Kernel panic - not syncing: Fatal machine check
[ 236.732047] Kernel Offset: disabled
The system hangs 30 seconds without any additional message, and finally
reboots.
. With the use of "crashkernel", a Fatal MCE injection shows only the
injection message
[ 80.811708] mce: Machine check injector initialized
[ 92.298755] mce: Triggering MCE exception on CPU 0
[ 92.299362] Disabling lock debugging due to kernel taint
No other messages is displayed and the system reboots immediately.
The reason behind this defective behavior is the fact that
do_machine_check() MCE handler can call mce_panic() to finish in panic()
trying to print out messages on the console. So we end up trying to print
out messages directly from this interrupt handler, which can require the
help of additional function calls like printk_safe_flush_on_panic() and
console_flush_on_panic().
With the suggested fix here, the above output turns into the expected:
. Without crashkernel:
[ 92.025010] mce: Machine check injector initialized
[ 113.006798] mce: Triggering MCE exception on CPU 0
[ 113.007391] Disabling lock debugging due to kernel taint
[ 113.008061] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 1: b000000000000000
[ 113.008062] mce: [Hardware Error]: TSC 2ec380a0f187
[ 113.008063] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1606223541 SOCKET 0 APIC 0 microcode 1
[ 113.008063] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 113.008064] mce: [Hardware Error]: Machine check: MCIP not set in MCA handler
[ 113.008064] Kernel panic - not syncing: Fatal machine check
[ 113.013010] Kernel Offset: disabled
[ 113.013426] Rebooting in 30 seconds..
Note that we now have the important information about the Reboot in
30 seconds.
. With crashkernel:
[ 75.770143] mce: Machine check injector initialized
[ 100.923650] mce: Triggering MCE exception on CPU 0
[ 100.924415] Disabling lock debugging due to kernel taint
[ 100.925094] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 1: b000000000000000
[ 100.925095] mce: [Hardware Error]: TSC 2f3aa07b03de
[ 100.925096] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1606223692 SOCKET 0 APIC 0 microcode 1
[ 100.925097] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 100.925098] mce: [Hardware Error]: Machine check: MCIP not set in MCA handler
[ 100.925100] Kernel panic - not syncing: Fatal machine check
Here the reboot is immediate but the MCE error message reaches the
console.
Could you please review this fix ?
Thanks in advance for any feedback you could have.
William.
kernel/panic.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/panic.c b/kernel/panic.c
index 332736a..eb90cc0 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -166,6 +166,15 @@ static void panic_print_sys_info(void)
ftrace_dump(DUMP_ALL);
}
+/*
+ * Force flush messages to the console.
+ */
+static void panic_flush_to_console(void)
+{
+ printk_safe_flush_on_panic();
+ console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+}
+
/**
* panic - halt the system
* @fmt: The text string to print
@@ -247,7 +256,7 @@ void panic(const char *fmt, ...)
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
if (!_crash_kexec_post_notifiers) {
- printk_safe_flush_on_panic();
+ panic_flush_to_console();
__crash_kexec(NULL);
/*
@@ -271,9 +280,8 @@ void panic(const char *fmt, ...)
*/
atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
- /* Call flush even twice. It tries harder with a single online CPU */
- printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
+ panic_flush_to_console();
/*
* If you doubt kdump always works fine in any situation,
@@ -298,7 +306,7 @@ void panic(const char *fmt, ...)
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
- * panic() is not being callled from OOPS.
+ * panic() is not being called from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);
@@ -314,6 +322,7 @@ void panic(const char *fmt, ...)
* We can't use the "normal" timers since we just panicked.
*/
pr_emerg("Rebooting in %d seconds..\n", panic_timeout);
+ panic_flush_to_console();
for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
touch_nmi_watchdog();
@@ -347,6 +356,7 @@ void panic(const char *fmt, ...)
disabled_wait();
#endif
pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
+ panic_flush_to_console();
/* Do not scroll important messages printed above */
suppress_printk = 1;
--
1.8.3.1
On (21/01/04 16:15), “William Roche wrote:
[..]
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 332736a..eb90cc0 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -166,6 +166,15 @@ static void panic_print_sys_info(void)
> ftrace_dump(DUMP_ALL);
> }
>
> +/*
> + * Force flush messages to the console.
> + */
> +static void panic_flush_to_console(void)
> +{
> + printk_safe_flush_on_panic();
> + console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> +}
You must debug_locks_off() as the very first step here. But see below.
> /**
> * panic - halt the system
> * @fmt: The text string to print
> @@ -247,7 +256,7 @@ void panic(const char *fmt, ...)
> * Bypass the panic_cpu check and call __crash_kexec directly.
> */
> if (!_crash_kexec_post_notifiers) {
> - printk_safe_flush_on_panic();
> + panic_flush_to_console();
> __crash_kexec(NULL);
It's dangerous to call console_flush_on_panic() before we stop secondary
CPUs. console_flush_on_panic() ignores the state console_sem, so if any
of the secondary is currently printing something on the consoles you'll
get corrupted messages - we use `static char buffer` for messages we
push to consoles.
Another issue is that with this panic_flush_to_console() call console_sem
can end up being locked once (by secondary CPU) and unlocked twice (by
second and panic CPUs) [*]
> /*
> @@ -271,9 +280,8 @@ void panic(const char *fmt, ...)
> */
> atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
>
> - /* Call flush even twice. It tries harder with a single online CPU */
> - printk_safe_flush_on_panic();
> kmsg_dump(KMSG_DUMP_PANIC);
> + panic_flush_to_console();
Why?
> /*
> * If you doubt kdump always works fine in any situation,
> @@ -298,7 +306,7 @@ void panic(const char *fmt, ...)
> * buffer. Try to acquire the lock then release it regardless of the
> * result. The release will also print the buffers out. Locks debug
> * should be disabled to avoid reporting bad unlock balance when
> - * panic() is not being callled from OOPS.
> + * panic() is not being called from OOPS.
> */
> debug_locks_off();
> console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> @@ -314,6 +322,7 @@ void panic(const char *fmt, ...)
> * We can't use the "normal" timers since we just panicked.
> */
> pr_emerg("Rebooting in %d seconds..\n", panic_timeout);
> + panic_flush_to_console();
[*] So this
> for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
> touch_nmi_watchdog();
> @@ -347,6 +356,7 @@ void panic(const char *fmt, ...)
> disabled_wait();
> #endif
> pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
> + panic_flush_to_console();
[*] and this are both very interesting points.
Those extra console_flush_on_panic() calls indicate that normal printk()
cannot succeed in locking the console_sem so it doesn't try to
console_unlock(): either because we killed the secondary CPU while it
was holding the lock, or because we locked it once and unlocked it twice.
I think it would make sense to just re-init console_sem, so that normal
printk()-s will have chance to grab the console_sem lock and then we don't
need any extra panic_flush_to_console() calls. Maybe we can do something
like this
---
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ffdd0dc7ec6d..4bd2e29c8cc0 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2638,6 +2638,7 @@ void console_flush_on_panic(enum con_flush_mode mode)
* context and we don't want to get preempted while flushing,
* ensure may_schedule is cleared.
*/
+ sema_init(&console_sem, 1);
console_trylock();
console_may_schedule = 0;
On 06/01/2021 05:35, Sergey Senozhatsky wrote:
> On (21/01/04 16:15), “William Roche wrote:
> [..]
>> diff --git a/kernel/panic.c b/kernel/panic.c
>> index 332736a..eb90cc0 100644
>> --- a/kernel/panic.c
>> +++ b/kernel/panic.c
>> @@ -166,6 +166,15 @@ static void panic_print_sys_info(void)
>> ftrace_dump(DUMP_ALL);
>> }
>>
>> +/*
>> + * Force flush messages to the console.
>> + */
>> +static void panic_flush_to_console(void)
>> +{
>> + printk_safe_flush_on_panic();
>> + console_flush_on_panic(CONSOLE_FLUSH_PENDING);
>> +}
>
> You must debug_locks_off() as the very first step here. But see below.
Thanks I've missed this point, I will call debug_locks_off() before
the first call to console_flush_on_panic(CONSOLE_FLUSH_PENDING);
>
>> /**
>> * panic - halt the system
>> * @fmt: The text string to print
>> @@ -247,7 +256,7 @@ void panic(const char *fmt, ...)
>> * Bypass the panic_cpu check and call __crash_kexec directly.
>> */
>> if (!_crash_kexec_post_notifiers) {
>> - printk_safe_flush_on_panic();
>> + panic_flush_to_console();
>> __crash_kexec(NULL);
>
> It's dangerous to call console_flush_on_panic() before we stop secondary
> CPUs. console_flush_on_panic() ignores the state console_sem, so if any
> of the secondary is currently printing something on the consoles you'll
> get corrupted messages - we use `static char buffer` for messages we
> push to consoles.
I understand that there is a risk here, and the __crash_exec() call is
not supposed to return when crashexec is ready. This is our last chance
here to push to the console the pending messages.
> Another issue is that with this panic_flush_to_console() call console_sem
> can end up being locked once (by secondary CPU) and unlocked twice (by
> second and panic CPUs) [*]
I agree, this is a risk. But what we see is that even the message
previously submitted by the pr_emerg() call of the panic thread is
currently discarded when __crash_kexec(NULL) is called.
In the case described here, no other message is printed out by a
secondary CPU, but I understand that it could happen, and we
could avoid calling panic_flush_to_console() when kexec_crash isn't
loaded so that we can wait for the SMP stop and leave the current
steps to run untouched until the last printed messages.
>
>> /*
>> @@ -271,9 +280,8 @@ void panic(const char *fmt, ...)
>> */
>> atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
>>
>> - /* Call flush even twice. It tries harder with a single online CPU */
>> - printk_safe_flush_on_panic();
>> kmsg_dump(KMSG_DUMP_PANIC);
>> + panic_flush_to_console();
>
> Why?
Here, we are supposed to push the pending messages, as I could verify
that they don't reach the console until the console_unblank(). So I
wanted to push them with panic_flush_to_console() before the possible
upcoming __crash_kexec() call.
Are you suggesting that I should leave the printk_safe_flush_on_panic()
call before the kmsg_dump(KMSG_DUMP_PANIC) call ?
But let's leave that untouched too.
>
>> /*
>> * If you doubt kdump always works fine in any situation,
>> @@ -298,7 +306,7 @@ void panic(const char *fmt, ...)
>> * buffer. Try to acquire the lock then release it regardless of the
>> * result. The release will also print the buffers out. Locks debug
>> * should be disabled to avoid reporting bad unlock balance when
>> - * panic() is not being callled from OOPS.
>> + * panic() is not being called from OOPS.
>> */
>> debug_locks_off();
>> console_flush_on_panic(CONSOLE_FLUSH_PENDING);
>> @@ -314,6 +322,7 @@ void panic(const char *fmt, ...)
>> * We can't use the "normal" timers since we just panicked.
>> */
>> pr_emerg("Rebooting in %d seconds..\n", panic_timeout);
>> + panic_flush_to_console();
>
> [*] So this
>
>> for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
>> touch_nmi_watchdog();
>> @@ -347,6 +356,7 @@ void panic(const char *fmt, ...)
>> disabled_wait();
>> #endif
>> pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
>> + panic_flush_to_console();
>
> [*] and this are both very interesting points.
>
> Those extra console_flush_on_panic() calls indicate that normal printk()
> cannot succeed in locking the console_sem so it doesn't try to
> console_unlock(): either because we killed the secondary CPU while it
> was holding the lock, or because we locked it once and unlocked it twice.
>
> I think it would make sense to just re-init console_sem, so that normal
> printk()-s will have chance to grab the console_sem lock and then we
don't
> need any extra panic_flush_to_console() calls. Maybe we can do something
> like this
>
> ---
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ffdd0dc7ec6d..4bd2e29c8cc0 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2638,6 +2638,7 @@ void console_flush_on_panic(enum con_flush_mode
mode)
> * context and we don't want to get preempted while flushing,
> * ensure may_schedule is cleared.
> */
> + sema_init(&console_sem, 1);
> console_trylock();
> console_may_schedule = 0;
>
>
I've tried your suggestion, and even with this sema_init() call, the
pr_emerg("Rebooting in %d seconds..\n", panic_timeout);
message is not printed to the console without a subsequent call to
panic_flush_to_console().
See below a code modification that would illustrate a possible change
including the above comments. Would something like that be more
appropriate ?
This is the end of this running kernel, and panic is called from the NMI
context where printk() messages have to be pushed. The mechanism I use
seems to work in this case, but there could be another way to achieve
this goal (?)
Thanks again for your help,
William.
--- kernel/panic.c 2021-01-07 18:20:01.992582942 -0500
***************
*** 166,171 ****
--- 166,180 ----
ftrace_dump(DUMP_ALL);
}
+ /*
+ * Force flush messages to the console.
+ */
+ static void panic_flush_to_console(void)
+ {
+ printk_safe_flush_on_panic();
+ console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+ }
+
/**
* panic - halt the system
* @fmt: The text string to print
***************
*** 247,253 ****
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
if (!_crash_kexec_post_notifiers) {
! printk_safe_flush_on_panic();
__crash_kexec(NULL);
/*
--- 256,265 ----
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
if (!_crash_kexec_post_notifiers) {
! if (kexec_crash_loaded()) {
! debug_locks_off();
! panic_flush_to_console();
! }
__crash_kexec(NULL);
/*
***************
*** 284,291 ****
*
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
! if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);
#ifdef CONFIG_VT
unblank_screen();
--- 296,308 ----
*
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
! if (_crash_kexec_post_notifiers) {
! if (kexec_crash_loaded()) {
! debug_locks_off();
! panic_flush_to_console();
! }
__crash_kexec(NULL);
+ }
#ifdef CONFIG_VT
unblank_screen();
***************
*** 298,304 ****
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
! * panic() is not being callled from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);
--- 315,321 ----
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
! * panic() is not being called from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);
***************
*** 314,319 ****
--- 331,337 ----
* We can't use the "normal" timers since we just panicked.
*/
pr_emerg("Rebooting in %d seconds..\n", panic_timeout);
+ panic_flush_to_console();
for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
touch_nmi_watchdog();
***************
*** 347,352 ****
--- 365,371 ----
disabled_wait();
#endif
pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
+ panic_flush_to_console();
/* Do not scroll important messages printed above */
suppress_printk = 1;
On Mon 2021-01-04 16:15:55, “William Roche wrote:
> From: William Roche <[email protected]>
>
> Force push panic messages to the console as panic() can be called from NMI
> interrupt handler functions where printed messages can't always reach the
> console without an explicit push provided by printk_safe_flush_on_panic()
> and console_flush_on_panic().
> This is the case with the MCE handler that can lead to a system panic
> giving information on the fatal MCE root cause that must reach the console.
>
> Signed-off-by: William Roche <[email protected]>
> ---
>
> Notes:
> While testing MCE injection and kernel reaction, we discovered a bug
> in the way the kernel provides the panic reason information: When dealing
> with fatal MCE, the machine (physical or virtual) can reboot without
> leaving any message on the console.
>
> This behavior can be reproduced on Intel with the mce-inject tool
> with a simple:
> # modprobe mce-inject
> # mce-inject test/uncorrected
>
> The investigations showed that the MCE panic can be totally message-less
> or can give a small set of messages. This behavior depends on the use of the
> crash_kexec mechanism (using the "crashkernel" parameter). Not using this
> parameter, we get a partial [Hardware Error] information on panic, but some
> important notifications can be missing. And when using it, a fatal MCE can
> panic the system without leaving any information.
>
> . Without "crashkernel", a Fatal MCE injection shows:
>
> [ 212.153928] mce: Machine check injector initialized
> [ 236.730682] mce: Triggering MCE exception on CPU 0
> [ 236.731304] Disabling lock debugging due to kernel taint
> [ 236.731947] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 1: b000000000000000
> [ 236.731948] mce: [Hardware Error]: TSC 78418fb4a83f
> [ 236.731949] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1605312952 SOCKET 0 APIC 0 microcode 1
> [ 236.731949] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [ 236.731950] mce: [Hardware Error]: Machine check: MCIP not set in MCA handler
> [ 236.731950] Kernel panic - not syncing: Fatal machine check
> [ 236.732047] Kernel Offset: disabled
>
> The system hangs 30 seconds without any additional message, and finally
> reboots.
>
> . With the use of "crashkernel", a Fatal MCE injection shows only the
> injection message
>
> [ 80.811708] mce: Machine check injector initialized
> [ 92.298755] mce: Triggering MCE exception on CPU 0
> [ 92.299362] Disabling lock debugging due to kernel taint
>
> No other messages is displayed and the system reboots immediately.
But you could find the messages in the crashdump. Aren't you?
It works this way by "design". The idea is the following:
Taking any locks from NMI context might lead to a deadlock.
Re-initializing the locks might lead to deadlock as well
because of possible double unlock. Ignoring the locks might
lead to problems either.
A compromise is needed:
1. crashdump disabled
console_flush_on_panic() is called. It tries hard to get the
messages on the console because it is the only chance.
It does console_trylock(). It is called after
bust_spinlocks(1) so that even the console-specific locks
are taken only with trylock, see oops_in_progress.
BTW: There are people that do not like this because there
is still a risk of a deadlock. Some code paths
take locks without checking oops_in_progress.
For these people, more reliable reboot is more
important because they want to have the system
back ASAP (cloud people).
2. crashdump enabled:
Only printk_safe_flush_on_panic() is called. It does the best effort
to flush messages from the per-CPU buffers into the main log buffer
so that they can be found easily in the core.
It it pretty reliable. It should not be needed at all once the new
lockless ringbuffer gets fully integrated,
It does not try to flush the messages to the console. Getting
the crash dump is more important than risking a deadlock with
consoles.
Best Regards,
Petr
On Fri 2021-01-08 01:26:06, William Roche wrote:
> On 06/01/2021 05:35, Sergey Senozhatsky wrote:
> > On (21/01/04 16:15), “William Roche wrote:
> >> @@ -271,9 +280,8 @@ void panic(const char *fmt, ...)
> >> */
> >> atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
> >>
> >> - /* Call flush even twice. It tries harder with a single online CPU */
> >> - printk_safe_flush_on_panic();
> >> kmsg_dump(KMSG_DUMP_PANIC);
> >> + panic_flush_to_console();
This is wrong. kmsg_dump() flushes the messages into the registered
dumpers, e.g. pstore. It handles only messages in the main
log buffer.
printk_safe_flush_on_panic() must be called before. It moves any
pending messages from the temporary per-CPU buffers into the main
log buffer so that they are visible for the dumpers.
> > Why?
>
> Here, we are supposed to push the pending messages, as I could verify
> that they don't reach the console until the console_unblank(). So I
> wanted to push them with panic_flush_to_console() before the possible
> upcoming __crash_kexec() call.
I do not see how the ordering of printk_safe_flush_on_panic()
and kmsg_dump() would change anything for the upcoming
__crash_kexec() call.
Best Regards,
Petr
Thank you for your clarification, and let me explain what I was
talking about before providing my last code version.
>>>> kmsg_dump(KMSG_DUMP_PANIC);
>>>> + panic_flush_to_console();
>
> This is wrong. kmsg_dump() flushes the messages into the registered
> dumpers, e.g. pstore. It handles only messages in the main
> log buffer.
>
> printk_safe_flush_on_panic() must be called before. It moves any
> pending messages from the temporary per-CPU buffers into the main
> log buffer so that they are visible for the dumpers.
Ok, this is very clear.
>>> Why?
>>
>> Here, we are supposed to push the pending messages, as I could verify
>> that they don't reach the console until the console_unblank(). So I
>> wanted to push them with panic_flush_to_console() before the possible
>> upcoming __crash_kexec() call.
>
> I do not see how the ordering of printk_safe_flush_on_panic()
> and kmsg_dump() would change anything for the upcoming
> __crash_kexec() call.
I was wrong to remove the call to printk_safe_flush_on_panic() before
kmsg_dump() as it was in panic_flush_to_console().
The call panic_flush_to_console() idea here was to push any messages
that could have been submitted (even by any dumper) to the console
before the possible upcoming __crash_kexec() call. Otherwise they could
be discarded (if the crash messages can't be retrieved).
And that's the reason why I've changed this code in my previous version
to force a call to panic_flush_to_console() before the __crash_kexec().
Which seems to be too insecure as far as I understand it now.
In your previous email you also indicate:
>> . With the use of "crashkernel", a Fatal MCE injection shows
only the
>> injection message
>>
>> [ 80.811708] mce: Machine check injector initialized
>> [ 92.298755] mce: Triggering MCE exception on CPU 0
>> [ 92.299362] Disabling lock debugging due to kernel taint
>>
>> No other messages is displayed and the system reboots immediately.
>
> But you could find the messages in the crashdump. Aren't you?
Yes, but _only_ with the latest kexec-tools-2.0.21.git version !
[My installed kexec-tools.x86_64 2.0.20-34.0.4.el8_3 doesn't seem to
work, and I guess there is a separate problem that I will investigate
later.]
>
> It works this way by "design". The idea is the following:
>
That's a rather disarming statement :)
I'm sure we can agree that getting the information directly (and of
course safely) from the running kernel console would be a better option,
instead of having a 2 steps process to get the crash reason (even when
it works) as it adds a risk not to get this crash reason at all. And I
know for a fact that Cloud people do want to know why a platform goes
away ;)
> Taking any locks from NMI context might lead to a deadlock.
> Re-initializing the locks might lead to deadlock as well
> because of possible double unlock. Ignoring the locks might
> lead to problems either.
>
> A compromise is needed:
>
> 1. crashdump disabled
>
> console_flush_on_panic() is called. It tries hard to get the
> messages on the console because it is the only chance.
>
> It does console_trylock(). It is called after
> bust_spinlocks(1) so that even the console-specific locks
> are taken only with trylock, see oops_in_progress.
>
> BTW: There are people that do not like this because there
> is still a risk of a deadlock. Some code paths
> take locks without checking oops_in_progress.
> For these people, more reliable reboot is more
> important because they want to have the system
> back ASAP (cloud people).
>
>
> 2. crashdump enabled:
>
> Only printk_safe_flush_on_panic() is called. It does the best effort
> to flush messages from the per-CPU buffers into the main log buffer
> so that they can be found easily in the core.
>
> It is pretty reliable. It should not be needed at all once the new
> lockless ringbuffer gets fully integrated,
>
> It does not try to flush the messages to the console. Getting
> the crash dump is more important than risking a deadlock with
> consoles.
Thanks again for the good clarification. I did not realize that calling
console_flush_on_panic(CONSOLE_FLUSH_PENDING) from panic() before our
__crash_kexec() was not safe enough.
So here is a new version of a (minimal) fix trying to follow your
suggestions:
(And you'll probably remove the added calls to
printk_safe_flush_on_panic() when integrating the new lockless
ringbuffer)
What do you think of this simplified possibility ?:
[If this fix is correct, I'll submit it as a v2 fix, with an updated
comment and Notes]
--- kernel/panic.c 2021-01-11 20:33:25.268047057 -0500
***************
*** 166,171 ****
--- 166,180 ----
ftrace_dump(DUMP_ALL);
}
+ /*
+ * Force flush messages to the console.
+ */
+ static void panic_flush_to_console(void)
+ {
+ printk_safe_flush_on_panic();
+ console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+ }
+
/**
* panic - halt the system
* @fmt: The text string to print
***************
*** 284,291 ****
*
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
! if (_crash_kexec_post_notifiers)
__crash_kexec(NULL);
#ifdef CONFIG_VT
unblank_screen();
--- 293,302 ----
*
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
! if (_crash_kexec_post_notifiers) {
! printk_safe_flush_on_panic();
__crash_kexec(NULL);
+ }
#ifdef CONFIG_VT
unblank_screen();
***************
*** 298,304 ****
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
! * panic() is not being callled from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);
--- 309,315 ----
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
! * panic() is not being called from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);
***************
*** 314,319 ****
--- 325,331 ----
* We can't use the "normal" timers since we just panicked.
*/
pr_emerg("Rebooting in %d seconds..\n", panic_timeout);
+ panic_flush_to_console();
for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
touch_nmi_watchdog();
***************
*** 347,352 ****
--- 359,365 ----
disabled_wait();
#endif
pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
+ panic_flush_to_console();
/* Do not scroll important messages printed above */
suppress_printk = 1;