2019-02-20 01:49:33

by Olof Johansson

[permalink] [raw]
Subject: [PATCH] x86/nmi: ratelimit unknown nmi logs

Getting notified of unknown NMIs is obviously important, but getting
notified on every single one, especially on larger systems with slow
(serial) console causes more harm than good when it's a known noisy
non-relevant event.

So, let's ratelimit to avoid locking up the system.

Signed-off-by: Olof Johansson <[email protected]>
---
arch/x86/kernel/nmi.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 18bc9b51ac9b9..44050cbfee136 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -292,14 +292,14 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)

__this_cpu_add(nmi_stats.unknown, 1);

- pr_emerg("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
+ pr_emerg_ratelimited("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
reason, smp_processor_id());

- pr_emerg("Do you have a strange power saving mode enabled?\n");
+ pr_emerg_ratelimited("Do you have a strange power saving mode enabled?\n");
if (unknown_nmi_panic || panic_on_unrecovered_nmi)
nmi_panic(regs, "NMI: Not continuing");

- pr_emerg("Dazed and confused, but trying to continue\n");
+ pr_emerg_ratelimited("Dazed and confused, but trying to continue\n");
}
NOKPROBE_SYMBOL(unknown_nmi_error);

--
2.11.0



2019-02-20 09:01:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/nmi: ratelimit unknown nmi logs

On Tue, Feb 19, 2019 at 05:48:36PM -0800, Olof Johansson wrote:
> Getting notified of unknown NMIs is obviously important, but getting
> notified on every single one, especially on larger systems with slow
> (serial) console causes more harm than good when it's a known noisy
> non-relevant event.
>
> So, let's ratelimit to avoid locking up the system.

What kind of bonghit broken crap system is that?

That is; this _really_ should not happen, and this is a bandaid, not
fixing the cause.

2019-02-20 18:01:17

by Olof Johansson

[permalink] [raw]
Subject: Re: [PATCH] x86/nmi: ratelimit unknown nmi logs

On Wed, Feb 20, 2019 at 12:59 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Feb 19, 2019 at 05:48:36PM -0800, Olof Johansson wrote:
> > Getting notified of unknown NMIs is obviously important, but getting
> > notified on every single one, especially on larger systems with slow
> > (serial) console causes more harm than good when it's a known noisy
> > non-relevant event.
> >
> > So, let's ratelimit to avoid locking up the system.
>
> What kind of bonghit broken crap system is that?
>
> That is; this _really_ should not happen, and this is a bandaid, not
> fixing the cause.

Oh, I agree -- this shouldn't happen, and it's being debugged and fixed.

So, I'm not looking at this as a bandaid to the real problem, but
there's also no reason to DoS the system with prink when it does
occur. If you want to configure the system to panic on unknown NMI
there are already hooks for it.

I'm obviously happy to carry local patches for this, since it's a
temporary problem. But yet again, I don't see a reason to have the
kernel run off the rails for this condition.


-Olof

2019-02-26 11:55:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/nmi: ratelimit unknown nmi logs

On Wed, Feb 20, 2019 at 10:00:28AM -0800, Olof Johansson wrote:
> On Wed, Feb 20, 2019 at 12:59 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Feb 19, 2019 at 05:48:36PM -0800, Olof Johansson wrote:
> > > Getting notified of unknown NMIs is obviously important, but getting
> > > notified on every single one, especially on larger systems with slow
> > > (serial) console causes more harm than good when it's a known noisy
> > > non-relevant event.
> > >
> > > So, let's ratelimit to avoid locking up the system.
> >
> > What kind of bonghit broken crap system is that?

Still interested to know what system and why this happens.

> > That is; this _really_ should not happen, and this is a bandaid, not
> > fixing the cause.
>
> Oh, I agree -- this shouldn't happen, and it's being debugged and fixed.
>
> So, I'm not looking at this as a bandaid to the real problem, but
> there's also no reason to DoS the system with prink when it does
> occur. If you want to configure the system to panic on unknown NMI
> there are already hooks for it.
>
> I'm obviously happy to carry local patches for this, since it's a
> temporary problem. But yet again, I don't see a reason to have the
> kernel run off the rails for this condition.

Fair enough I suppose. Personally I don't care either way; you could
just boot without the slow serial in order to install a new kernel.