2013-03-04 21:51:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
> So it looks Bjorn has taken most of them and the e1000e one will go
> through the e1000e maintainers. I'll test after the merge window is
> done.

Issue still persists on 3.9-rc1 :-( :

Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--


2013-03-05 00:17:14

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

[+cc e1000-devel, Jeff, Bruce]

On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <[email protected]> wrote:
> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
>> So it looks Bjorn has taken most of them and the e1000e one will go
>> through the e1000e maintainers. I'll test after the merge window is
>> done.
>
> Issue still persists on 3.9-rc1 :-( :
>
> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue

The e1000e changes didn't get merged, did they? I don't see the
following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
3.9-rc1:

e1000e: fix pci-device enable-counter balance
e1000e: fix runtime power management transitions
e1000e: fix accessing to suspended device

2013-03-05 09:42:24

by Jiri Slaby

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On 03/05/2013 01:16 AM, Bjorn Helgaas wrote:
> [+cc e1000-devel, Jeff, Bruce]
>
> On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <[email protected]> wrote:
>> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
>>> So it looks Bjorn has taken most of them and the e1000e one will go
>>> through the e1000e maintainers. I'll test after the merge window is
>>> done.
>>
>> Issue still persists on 3.9-rc1 :-( :
>>
>> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
>> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
>> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
>> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
>> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue
>
> The e1000e changes didn't get merged, did they? I don't see the
> following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
> 3.9-rc1:
>
> e1000e: fix pci-device enable-counter balance
> e1000e: fix runtime power management transitions
> e1000e: fix accessing to suspended device

You're right. They are not even in -next :(.

--
js
suse labs

2013-03-05 09:58:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, Mar 05, 2013 at 10:42:17AM +0100, Jiri Slaby wrote:
> On 03/05/2013 01:16 AM, Bjorn Helgaas wrote:
> > [+cc e1000-devel, Jeff, Bruce]
> >
> > On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <[email protected]> wrote:
> >> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
> >>> So it looks Bjorn has taken most of them and the e1000e one will go
> >>> through the e1000e maintainers. I'll test after the merge window is
> >>> done.
> >>
> >> Issue still persists on 3.9-rc1 :-( :
> >>
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue
> >
> > The e1000e changes didn't get merged, did they? I don't see the
> > following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
> > 3.9-rc1:
> >
> > e1000e: fix pci-device enable-counter balance
> > e1000e: fix runtime power management transitions
> > e1000e: fix accessing to suspended device
>
> You're right. They are not even in -next :(.

Oh, and there's another issue with this driver I reported yesterday:
http://marc.info/?l=linux-kernel&m=136243374114892&w=2:

"Trying to free already-free IRQ 20"

which happens during suspend so it seems also related.

Rafael, what's the state of those patches here:
https://lkml.org/lkml/2013/2/4/185, are they ready to be tested or you
still have issues with them?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-03-05 10:01:08

by Jiri Slaby

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On 03/05/2013 10:58 AM, Borislav Petkov wrote:
> Rafael, what's the state of those patches here:
> https://lkml.org/lkml/2013/2/4/185, are they ready to be tested or you
> still have issues with them?

Note there is a resend version:
https://lkml.org/lkml/2013/2/25/3

with a note from Jeff Kirsher:
I have added this patch to my e1000e patch queue.

thanks,
--
js
suse labs

2013-03-05 10:01:23

by Jeff Kirsher

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, 2013-03-05 at 10:42 +0100, Jiri Slaby wrote:
> On 03/05/2013 01:16 AM, Bjorn Helgaas wrote:
> > [+cc e1000-devel, Jeff, Bruce]
> >
> > On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <[email protected]> wrote:
> >> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
> >>> So it looks Bjorn has taken most of them and the e1000e one will go
> >>> through the e1000e maintainers. I'll test after the merge window is
> >>> done.
> >>
> >> Issue still persists on 3.9-rc1 :-( :
> >>
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue
> >
> > The e1000e changes didn't get merged, did they? I don't see the
> > following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
> > 3.9-rc1:
> >
> > e1000e: fix pci-device enable-counter balance
> > e1000e: fix runtime power management transitions
> > e1000e: fix accessing to suspended device
>
> You're right. They are not even in -next :(.
>

I have them in my queue for net, so I should be pushing them later this
week once validation has a chance to look at them.


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2013-03-05 10:02:52

by Jeff Kirsher

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, 2013-03-05 at 10:58 +0100, Borislav Petkov wrote:
> On Tue, Mar 05, 2013 at 10:42:17AM +0100, Jiri Slaby wrote:
> > On 03/05/2013 01:16 AM, Bjorn Helgaas wrote:
> > > [+cc e1000-devel, Jeff, Bruce]
> > >
> > > On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <[email protected]> wrote:
> > >> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
> > >>> So it looks Bjorn has taken most of them and the e1000e one will go
> > >>> through the e1000e maintainers. I'll test after the merge window is
> > >>> done.
> > >>
> > >> Issue still persists on 3.9-rc1 :-( :
> > >>
> > >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> > >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> > >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> > >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
> > >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue
> > >
> > > The e1000e changes didn't get merged, did they? I don't see the
> > > following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
> > > 3.9-rc1:
> > >
> > > e1000e: fix pci-device enable-counter balance
> > > e1000e: fix runtime power management transitions
> > > e1000e: fix accessing to suspended device
> >
> > You're right. They are not even in -next :(.
>
> Oh, and there's another issue with this driver I reported yesterday:
> http://marc.info/?l=linux-kernel&m=136243374114892&w=2:
>
> "Trying to free already-free IRQ 20"
>
> which happens during suspend so it seems also related.
>
> Rafael, what's the state of those patches here:
> https://lkml.org/lkml/2013/2/4/185, are they ready to be tested or you
> still have issues with them?

They are in my queue of e1000e patches for net and are being testing
currently. I should be able to push them upstream this week.


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2013-03-05 10:04:08

by Jiri Slaby

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/05/2013 11:01 AM, Jeff Kirsher wrote:
> On Tue, 2013-03-05 at 10:42 +0100, Jiri Slaby wrote:
>>> The e1000e changes didn't get merged, did they? I don't see
>>> the following changes mentioned at
>>> https://lkml.org/lkml/2013/2/4/185 in 3.9-rc1:
>>>
>>> e1000e: fix pci-device enable-counter balance e1000e: fix
>>> runtime power management transitions e1000e: fix accessing to
>>> suspended device
>>
>> You're right. They are not even in -next :(.
>>
>
> I have them in my queue for net, so I should be pushing them later
> this week once validation has a chance to look at them.

Yeah, I've just noticed that here
https://lkml.org/lkml/2013/2/25/3

Thanks a lot.

- --
js
suse labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRNcMQAAoJEL0lsQQGtHBJ+ZkP/3AokrLy82YOmecMvuFssino
jpS9MSjr3Fq8H6WvDmqyhFkKiL8wW0liQU1ZHU8csAFOmTCYUUhrN7QyjZZLt3Ek
QeUhPCi40uaL+jjfDh2TFy6dI/kvtiLxwUfQ4YcGOnNoJSMsN14E4PFiwWcQ/vfX
rOsw9z+MkqJ4je2ZuDFBxZBcUYgdb1Mlrk7gPTVwADz+DnE3PN7DKIYWy3grI5/U
uI9QkyESv4YEdpBBEphqdK3TNWWZS4QyiOq2glNgllnoksybI1JnYAWt+O2Khcef
Os9O/ccZcUiQK6K6HvEYvJvp9eGhPNVt7Fyr+JBV3bzKoPlIcHOIgktahuisUuiZ
zZsxshj3pFYBhCGlGkjbkMkB74hkgenJoT9e36JMPtov00E11B+DazqGodZm1jto
e70821Y6MQ5gavTZrrdcmzJmzSwEsdww7ALs+FCTIBpc8Re0MrZMIp+XrTFnue2L
aA23fYLu6/1uqd11PGNb+82P5s6dYpFCR9NHV29TPuXk50yH60z1Me8n3wMCzm8Y
rIvrk6Xd3XATqepM6qG6O/cDPpvxo9itZldKBvi1SD088n3qEUdJWmLRzpaxisrt
v0pCuUNx+pZE6gTE+tsxbv2k5d0RtNYPsnDJrds7EKMyhIwam7NDJcX490tu9pU8
VLndALzYj0O07N4wCQP1
=MGO1
-----END PGP SIGNATURE-----

2013-03-05 10:14:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> They are in my queue of e1000e patches for net and are being testing
> currently. I should be able to push them upstream this week.

Right, if you'd like me to run them here too, let me know.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-03-05 10:29:05

by Jeff Kirsher

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, 2013-03-05 at 11:14 +0100, Borislav Petkov wrote:
>
> On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> > They are in my queue of e1000e patches for net and are being testing
> > currently. I should be able to push them upstream this week.
>
> Right, if you'd like me to run them here too, let me know.

Any additional testing is very much appreciated, so feel free to test
the patches with what hardware you have.

Thanks!


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2013-03-05 11:27:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, Mar 05, 2013 at 02:29:01AM -0800, Jeff Kirsher wrote:
> On Tue, 2013-03-05 at 11:14 +0100, Borislav Petkov wrote:
> >
> > On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> > > They are in my queue of e1000e patches for net and are being testing
> > > currently. I should be able to push them upstream this week.
> >
> > Right, if you'd like me to run them here too, let me know.
>
> Any additional testing is very much appreciated, so feel free to test
> the patches with what hardware you have.

Yep, it looks good, machine suspends ok again. I'll watch it in the next
couple of days.

The only problem that remains is this:

[ 103.137024] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
[ 103.161032] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
[ 103.462328] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[ 103.462342] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
[ 108.472847] Uhhuh. NMI received for unknown reason 3c on CPU 0. <---
[ 108.472850] Do you have a strange power saving mode enabled?
[ 108.472851] Dazed and confused, but trying to continue

AFAIR, Rafael said it had something to do with the suspend kernel not
picking up settings done to the main kernel on time. Or something to
that effect, my memory is hazy.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-03-05 11:33:47

by Jeff Kirsher

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, 2013-03-05 at 12:27 +0100, Borislav Petkov wrote:
> On Tue, Mar 05, 2013 at 02:29:01AM -0800, Jeff Kirsher wrote:
> > On Tue, 2013-03-05 at 11:14 +0100, Borislav Petkov wrote:
> > >
> > > On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> > > > They are in my queue of e1000e patches for net and are being testing
> > > > currently. I should be able to push them upstream this week.
> > >
> > > Right, if you'd like me to run them here too, let me know.
> >
> > Any additional testing is very much appreciated, so feel free to test
> > the patches with what hardware you have.
>
> Yep, it looks good, machine suspends ok again. I'll watch it in the next
> couple of days.
>
> The only problem that remains is this:
>
> [ 103.137024] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
> [ 103.161032] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
> [ 103.462328] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> [ 103.462342] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> [ 108.472847] Uhhuh. NMI received for unknown reason 3c on CPU 0. <---
> [ 108.472850] Do you have a strange power saving mode enabled?
> [ 108.472851] Dazed and confused, but trying to continue
>
> AFAIR, Rafael said it had something to do with the suspend kernel not
> picking up settings done to the main kernel on time. Or something to
> that effect, my memory is hazy.
>

Would you like me to add your Tested-by: to the patches?


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2013-03-05 11:42:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tue, Mar 05, 2013 at 03:33:45AM -0800, Jeff Kirsher wrote:
> Would you like me to add your Tested-by: to the patches?

Sure, if you'd like to:

Tested-by: Borislav Petkov <[email protected]>

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-03-06 00:06:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Tuesday, March 05, 2013 12:27:37 PM Borislav Petkov wrote:
> On Tue, Mar 05, 2013 at 02:29:01AM -0800, Jeff Kirsher wrote:
> > On Tue, 2013-03-05 at 11:14 +0100, Borislav Petkov wrote:
> > >
> > > On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> > > > They are in my queue of e1000e patches for net and are being testing
> > > > currently. I should be able to push them upstream this week.
> > >
> > > Right, if you'd like me to run them here too, let me know.
> >
> > Any additional testing is very much appreciated, so feel free to test
> > the patches with what hardware you have.
>
> Yep, it looks good, machine suspends ok again. I'll watch it in the next
> couple of days.
>
> The only problem that remains is this:
>
> [ 103.137024] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
> [ 103.161032] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
> [ 103.462328] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> [ 103.462342] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> [ 108.472847] Uhhuh. NMI received for unknown reason 3c on CPU 0. <---
> [ 108.472850] Do you have a strange power saving mode enabled?
> [ 108.472851] Dazed and confused, but trying to continue
>
> AFAIR, Rafael said it had something to do with the suspend kernel not
> picking up settings done to the main kernel on time. Or something to
> that effect, my memory is hazy.

I suspected that during resume from hibernation the boot kernel (the one that
loaded the image) did something to hardware and the restored kernel didn't
handle that change properly. It is hard do say what piece of hardware that
was, however (it might or might not be the NIC, it may be pure coincidence
that the NMI messages appear in the log at this point).

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2013-03-06 00:19:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Wed, Mar 06, 2013 at 01:13:23AM +0100, Rafael J. Wysocki wrote:
> I suspected that during resume from hibernation the boot kernel (the
> one that loaded the image) did something to hardware and the restored
> kernel didn't handle that change properly. It is hard do say what
> piece of hardware that was, however (it might or might not be the NIC,
> it may be pure coincidence that the NMI messages appear in the log at
> this point).

Agreed with the second part. About the first part, who communicates what
to whom, come to think of it, it might not be related to any devices at
all.

Here's why I think so:

So one of the things I did to trigger this is boot the machine, run
powertop and set all the knobs in the "Tunables" tab to "Good". One of
the tunables is turn-off-nmi-watchdog something which turns off the
watchdog which is using the perf infrastructure which generates NMIs
when the counter overflows.

Now, imagine I do that in the "normal" kernel, then suspend,
...<something happens or does not happen>, then resume back into the
normal kernel and it somehow "forgets" the fact that we disabled the NMI
watchdog before the suspend cycle. And boom, it gets a single spurious
NMI.

Does it make sense? I dunno - I'm just connecting the dots here between
the observation points which are most likely.

Anyway, it's getting late, good night. :)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-03-08 16:47:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: Uhhuh. NMI received for unknown reason 2c on CPU 0.

On Wed, Mar 06, 2013 at 01:19:32AM +0100, Borislav Petkov wrote:
> On Wed, Mar 06, 2013 at 01:13:23AM +0100, Rafael J. Wysocki wrote:
> > I suspected that during resume from hibernation the boot kernel (the
> > one that loaded the image) did something to hardware and the restored
> > kernel didn't handle that change properly. It is hard do say what
> > piece of hardware that was, however (it might or might not be the NIC,
> > it may be pure coincidence that the NMI messages appear in the log at
> > this point).
>
> Agreed with the second part. About the first part, who communicates what
> to whom, come to think of it, it might not be related to any devices at
> all.
>
> Here's why I think so:
>
> So one of the things I did to trigger this is boot the machine, run
> powertop and set all the knobs in the "Tunables" tab to "Good". One of
> the tunables is turn-off-nmi-watchdog something which turns off the
> watchdog which is using the perf infrastructure which generates NMIs
> when the counter overflows.
>
> Now, imagine I do that in the "normal" kernel, then suspend,
> ...<something happens or does not happen>, then resume back into the
> normal kernel and it somehow "forgets" the fact that we disabled the NMI
> watchdog before the suspend cycle. And boom, it gets a single spurious
> NMI.
>
> Does it make sense? I dunno - I'm just connecting the dots here between
> the observation points which are most likely.
>
> Anyway, it's getting late, good night. :)

Exactly as I thought: so I'm running the machine with NMI watchdog
enabled, i.e. powertop says:


PowerTOP v2.0 Overview Idle stats Frequency stats Device stats Tunables

>> Bad NMI watchdog should be turned off
Good VM writeback timeout
....

and no more spurious NMIs.

I'd say the plot thickens: disabling NMIs and suspending to disk right
afterwards doesn't seem to really disable the watchdog. Or the disable
gets delayed leading to one last spurious NMI when resuming... I
probably need to go stare at the code though...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--