The pr_crap() functions generate idiotic output; use printk().
Broken output:
[ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [perf_fuzzer:2646]
[ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
Fixed output:
[14180.328194] BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:13650]
[ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11
Cc: Don Zickus <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/watchdog.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -350,7 +350,7 @@ static void watchdog_overflow_callback(s
if (__this_cpu_read(hard_watchdog_warn) == true)
return;
- pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
+ printk(KERN_EMERG "BUG: NMI detected hard LOCKUP on cpu %d", this_cpu);
print_modules();
print_irqtrace_events(current);
if (regs)
@@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_tim
}
}
- pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
+ printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
smp_processor_id(), duration,
current->comm, task_pid_nr(current));
__this_cpu_write(softlockup_task_ptr_saved, current);
On Fri, Mar 18, 2016 at 04:28:03PM +0100, Peter Zijlstra wrote:
> The pr_crap() functions generate idiotic output; use printk().
>
> Broken output:
>
> [ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [perf_fuzzer:2646]
> [ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
>
> Fixed output:
>
> [14180.328194] BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:13650]
> [ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11
Would something like this be a better patch?
Cheers,
Don
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b3ace6e..e3302a0 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -9,7 +9,7 @@
* to those contributors as well.
*/
-#define pr_fmt(fmt) "NMI watchdog: " fmt
+#define pr_fmt(fmt) "Lockup detector: " fmt
#include <linux/mm.h>
#include <linux/cpu.h>
@@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
if (__this_cpu_read(hard_watchdog_warn) == true)
return;
- pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
+ pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
print_modules();
print_irqtrace_events(current);
if (regs)
@@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
}
}
- pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
+ pr_emerg("Detected soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
smp_processor_id(), duration,
current->comm, task_pid_nr(current));
__this_cpu_write(softlockup_task_ptr_saved, current);
>
> Cc: Don Zickus <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/watchdog.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(s
> if (__this_cpu_read(hard_watchdog_warn) == true)
> return;
>
> - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> + printk(KERN_EMERG "BUG: NMI detected hard LOCKUP on cpu %d", this_cpu);
> print_modules();
> print_irqtrace_events(current);
> if (regs)
> @@ -467,7 +467,7 @@ static enum hrtimer_restart watchdog_tim
> }
> }
>
> - pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
> + printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
> smp_processor_id(), duration,
> current->comm, task_pid_nr(current));
> __this_cpu_write(softlockup_task_ptr_saved, current);
>
>
On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> On Fri, Mar 18, 2016 at 04:28:03PM +0100, Peter Zijlstra wrote:
> > The pr_crap() functions generate idiotic output; use printk().
> >
> > Broken output:
> >
> > [ 3538.718135] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [perf_fuzzer:2646]
> > [ 3383.233583] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
> >
> > Fixed output:
> >
> > [14180.328194] BUG: soft lockup - CPU#2 stuck for 23s! [perf_fuzzer:13650]
> > [ 1064.914925] BUG: NMI Watchdog detected hard LOCKUP on cpu 11
>
> Would something like this be a better patch?
> -#define pr_fmt(fmt) "NMI watchdog: " fmt
> +#define pr_fmt(fmt) "Lockup detector: " fmt
> - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> + pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
That still results in something like:
Lockup detector: Detected hard LOCKUP on cpu 22
Which is still weird, doesn't mention NMI nor BUG (and when the NMI
triggers it really rather frequently is a dead system).
On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > Would something like this be a better patch?
>
> > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > +#define pr_fmt(fmt) "Lockup detector: " fmt
You might want to audit all the other pr_ nonsense in that file; a lot
of the msgs don't particularly make sense when combined with the prefix.
Some are for the NMI thing, others for the soft lockup one.
I would just chuck pr_crap() altogether and use sensible text.
On Fri, 2016-03-18 at 17:48 +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> >
> > On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > >
> > > Would something like this be a better patch?
> > >
> > > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > > +#define pr_fmt(fmt) "Lockup detector: " fmt
> You might want to audit all the other pr_ nonsense in that file; a
> lot
> of the msgs don't particularly make sense when combined with the
> prefix.
>
> Some are for the NMI thing, others for the soft lockup one.
>
> I would just chuck pr_crap() altogether and use sensible text.
or change to #define pr_fmt(fmt) fmt
and output whatever unprefixed content you want.
btw: the first printk/pr_emerg should add a newline
pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
sb:
pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", this_cpu);
On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> +++ b/kernel/watchdog.c
> @@ -9,7 +9,7 @@
> * to those contributors as well.
> */
>
> -#define pr_fmt(fmt) "NMI watchdog: " fmt
> +#define pr_fmt(fmt) "Lockup detector: " fmt
>
> #include <linux/mm.h>
> #include <linux/cpu.h>
> @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> if (__this_cpu_read(hard_watchdog_warn) == true)
> return;
>
> - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> + pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
> print_modules();
> print_irqtrace_events(current);
> if (regs)
It was Jiri who made this mess by replacing WARN(), which has a very
distinct format, with this custom stuff.
I think we should go back to the WARN() thing.
On Fri, Mar 18, 2016 at 05:48:58PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 18, 2016 at 05:44:41PM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 18, 2016 at 12:37:48PM -0400, Don Zickus wrote:
> > > Would something like this be a better patch?
> >
> > > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > > +#define pr_fmt(fmt) "Lockup detector: " fmt
>
> You might want to audit all the other pr_ nonsense in that file; a lot
> of the msgs don't particularly make sense when combined with the prefix.
>
> Some are for the NMI thing, others for the soft lockup one.
>
> I would just chuck pr_crap() altogether and use sensible text.
Hehe. I'll look into that next week. :-)
Cheers,
Don
On Fri, 18 Mar 2016, Peter Zijlstra wrote:
> > -#define pr_fmt(fmt) "NMI watchdog: " fmt
> > +#define pr_fmt(fmt) "Lockup detector: " fmt
> >
> > #include <linux/mm.h>
> > #include <linux/cpu.h>
> > @@ -350,7 +350,7 @@ static void watchdog_overflow_callback(struct perf_event *event,
> > if (__this_cpu_read(hard_watchdog_warn) == true)
> > return;
> >
> > - pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > + pr_emerg("Detected hard LOCKUP on cpu %d", this_cpu);
> > print_modules();
> > print_irqtrace_events(current);
> > if (regs)
>
> It was Jiri who made this mess by replacing WARN(), which has a very
> distinct format, with this custom stuff.
Right, this was in 55537871e ("kernel/watchdog.c: perform all-CPU
backtrace in case of hard lockup").
To be really honest, I don't really have 100% reliable explanation why I
did it this way; I *think* it was because otherwise, in case WARN is
preserved, we'll either
- have different output format for current (warning) CPU and all the
others (trigger_allbutself_cpu_backtrace())
or
- have duplicated backtrace for current CPU, one coming from WARN and one
coming from the all-cpu backtrace
> I think we should go back to the WARN() thing.
I don't really have strong opinion on this, but someone should pick the
poison from the two options above; I tried in the mentioned commit, but
apparently not to general satisfaction :)
Thanks,
--
Jiri Kosina
SUSE Labs
On Sat, 19 Mar 2016, Jiri Kosina wrote:
> > It was Jiri who made this mess by replacing WARN(), which has a very
> > distinct format, with this custom stuff.
>
> Right, this was in 55537871e ("kernel/watchdog.c: perform all-CPU
> backtrace in case of hard lockup").
>
> To be really honest, I don't really have 100% reliable explanation why I
> did it this way; I *think* it was because otherwise, in case WARN is
> preserved, we'll either
>
> - have different output format for current (warning) CPU and all the
> others (trigger_allbutself_cpu_backtrace())
>
> or
>
> - have duplicated backtrace for current CPU, one coming from WARN and one
> coming from the all-cpu backtrace
>
> > I think we should go back to the WARN() thing.
>
> I don't really have strong opinion on this, but someone should pick the
> poison from the two options above; I tried in the mentioned commit, but
> apparently not to general satisfaction :)
Hmm, or we can actually make the whole thing behave the way that was
introduced by 55537871e only in case of
sysctl_hardlockup_all_cpu_backtrace being set, and use WARN() otherwise.
I'll send out a patch for this during / after the weekend if noone else
beats me to do it.
--
Jiri Kosina
SUSE Labs