2021-03-16 16:51:50

by Jiri Olsa

[permalink] [raw]
Subject: unknown NMI on AMD Rome

hi,
when running 'perf top' on AMD Rome (/proc/cpuinfo below)
with fedora 33 kernel 5.10.22-200.fc33.x86_64

we got unknown NMI messages:

[ 226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
[ 226.700162] Do you have a strange power saving mode enabled?
[ 226.700163] Dazed and confused, but trying to continue
[ 226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
[ 226.769566] Do you have a strange power saving mode enabled?
[ 226.769567] Dazed and confused, but trying to continue
[ 226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
[ 226.769773] Do you have a strange power saving mode enabled?
[ 226.769774] Dazed and confused, but trying to continue
[ 226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
[ 226.812846] Do you have a strange power saving mode enabled?
[ 226.812847] Dazed and confused, but trying to continue
[ 226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
[ 226.893785] Do you have a strange power saving mode enabled?
[ 226.893786] Dazed and confused, but trying to continue
[ 226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
[ 226.900141] Do you have a strange power saving mode enabled?
[ 226.900143] Dazed and confused, but trying to continue
[ 226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
[ 226.908765] Do you have a strange power saving mode enabled?
[ 226.908766] Dazed and confused, but trying to continue
[ 227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
[ 227.751298] Do you have a strange power saving mode enabled?
[ 227.751299] Dazed and confused, but trying to continue
[ 227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.

also when discussing ths with Borislav, he managed to reproduce easily
on his AMD Rome machine

any idea?

thanks,
jirka


---
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7742 64-Core Processor
stepping : 0
microcode : 0x8301034
cpu MHz : 1497.024
cache size : 512 KB
physical id : 0
siblings : 64
core id : 0
cpu cores : 64
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4491.76
TLB size : 3072 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]


2021-03-16 20:49:35

by Adam Borowski

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome

On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> hi,
> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> with fedora 33 kernel 5.10.22-200.fc33.x86_64
>
> we got unknown NMI messages:
>
> [ 226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> [ 226.700162] Do you have a strange power saving mode enabled?
> [ 226.700163] Dazed and confused, but trying to continue
>
> also when discussing ths with Borislav, he managed to reproduce easily
> on his AMD Rome machine

Likewise, 3c on Pinnacle Ridge.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀ -- <willmore> on #linux-sunxi
⠈⠳⣄⠀⠀⠀⠀

2021-03-16 20:59:17

by Alexander Monakov

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome

On Tue, 16 Mar 2021, Adam Borowski wrote:

> On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> > hi,
> > when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> > with fedora 33 kernel 5.10.22-200.fc33.x86_64
> >
> > we got unknown NMI messages:
> >
> > [ 226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> > [ 226.700162] Do you have a strange power saving mode enabled?
> > [ 226.700163] Dazed and confused, but trying to continue
> >
> > also when discussing ths with Borislav, he managed to reproduce easily
> > on his AMD Rome machine
>
> Likewise, 3c on Pinnacle Ridge.

I've also seen it on Renoir, and it appears related to PMU interrupt racing
against C-state entry/exit. Disabling C2 and C3 via 'cpupower' is enough to
avoid those NMIs in my case.

IIRC there were a few patches related to this area from AMD in the past.

Alexander

2021-03-16 21:24:24

by Kim Phillips

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome

On 3/16/21 2:53 PM, Peter Zijlstra wrote:
> On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
>> hi,
>> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
>> with fedora 33 kernel 5.10.22-200.fc33.x86_64
>>
>> we got unknown NMI messages:
>>
>> [ 226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
>> [ 226.700162] Do you have a strange power saving mode enabled?
>> [ 226.700163] Dazed and confused, but trying to continue
>> [ 226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
>> [ 226.769566] Do you have a strange power saving mode enabled?
>> [ 226.769567] Dazed and confused, but trying to continue
>> [ 226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
>> [ 226.769773] Do you have a strange power saving mode enabled?
>> [ 226.769774] Dazed and confused, but trying to continue
>> [ 226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
>> [ 226.812846] Do you have a strange power saving mode enabled?
>> [ 226.812847] Dazed and confused, but trying to continue
>> [ 226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
>> [ 226.893785] Do you have a strange power saving mode enabled?
>> [ 226.893786] Dazed and confused, but trying to continue
>> [ 226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
>> [ 226.900141] Do you have a strange power saving mode enabled?
>> [ 226.900143] Dazed and confused, but trying to continue
>> [ 226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
>> [ 226.908765] Do you have a strange power saving mode enabled?
>> [ 226.908766] Dazed and confused, but trying to continue
>> [ 227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
>> [ 227.751298] Do you have a strange power saving mode enabled?
>> [ 227.751299] Dazed and confused, but trying to continue
>> [ 227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.
>>
>> also when discussing ths with Borislav, he managed to reproduce easily
>> on his AMD Rome machine
>>
>> any idea?
>
> Kim is the AMD point person for this I think..

Since perf top invokes precision and therefore IBS,
this looks like it's hitting erratum #1215:

https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

Kim

2021-03-16 21:26:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome

On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> hi,
> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> with fedora 33 kernel 5.10.22-200.fc33.x86_64
>
> we got unknown NMI messages:
>
> [ 226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> [ 226.700162] Do you have a strange power saving mode enabled?
> [ 226.700163] Dazed and confused, but trying to continue
> [ 226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
> [ 226.769566] Do you have a strange power saving mode enabled?
> [ 226.769567] Dazed and confused, but trying to continue
> [ 226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
> [ 226.769773] Do you have a strange power saving mode enabled?
> [ 226.769774] Dazed and confused, but trying to continue
> [ 226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
> [ 226.812846] Do you have a strange power saving mode enabled?
> [ 226.812847] Dazed and confused, but trying to continue
> [ 226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
> [ 226.893785] Do you have a strange power saving mode enabled?
> [ 226.893786] Dazed and confused, but trying to continue
> [ 226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
> [ 226.900141] Do you have a strange power saving mode enabled?
> [ 226.900143] Dazed and confused, but trying to continue
> [ 226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
> [ 226.908765] Do you have a strange power saving mode enabled?
> [ 226.908766] Dazed and confused, but trying to continue
> [ 227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
> [ 227.751298] Do you have a strange power saving mode enabled?
> [ 227.751299] Dazed and confused, but trying to continue
> [ 227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.
>
> also when discussing ths with Borislav, he managed to reproduce easily
> on his AMD Rome machine
>
> any idea?

Kim is the AMD point person for this I think..

>
> thanks,
> jirka
>
>
> ---
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 23
> model : 49
> model name : AMD EPYC 7742 64-Core Processor
> stepping : 0
> microcode : 0x8301034
> cpu MHz : 1497.024
> cache size : 512 KB
> physical id : 0
> siblings : 64
> core id : 0
> cpu cores : 64
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 16
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
> bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
> bogomips : 4491.76
> TLB size : 3072 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 43 bits physical, 48 bits virtual
> power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
>

2021-03-17 08:50:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome


* Kim Phillips <[email protected]> wrote:

> On 3/16/21 2:53 PM, Peter Zijlstra wrote:
> > On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> >> hi,
> >> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> >> with fedora 33 kernel 5.10.22-200.fc33.x86_64
> >>
> >> we got unknown NMI messages:
> >>
> >> [ 226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> >> [ 226.700162] Do you have a strange power saving mode enabled?
> >> [ 226.700163] Dazed and confused, but trying to continue
> >> [ 226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
> >> [ 226.769566] Do you have a strange power saving mode enabled?
> >> [ 226.769567] Dazed and confused, but trying to continue
> >> [ 226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
> >> [ 226.769773] Do you have a strange power saving mode enabled?
> >> [ 226.769774] Dazed and confused, but trying to continue
> >> [ 226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
> >> [ 226.812846] Do you have a strange power saving mode enabled?
> >> [ 226.812847] Dazed and confused, but trying to continue
> >> [ 226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
> >> [ 226.893785] Do you have a strange power saving mode enabled?
> >> [ 226.893786] Dazed and confused, but trying to continue
> >> [ 226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
> >> [ 226.900141] Do you have a strange power saving mode enabled?
> >> [ 226.900143] Dazed and confused, but trying to continue
> >> [ 226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
> >> [ 226.908765] Do you have a strange power saving mode enabled?
> >> [ 226.908766] Dazed and confused, but trying to continue
> >> [ 227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
> >> [ 227.751298] Do you have a strange power saving mode enabled?
> >> [ 227.751299] Dazed and confused, but trying to continue
> >> [ 227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.
> >>
> >> also when discussing ths with Borislav, he managed to reproduce easily
> >> on his AMD Rome machine
> >>
> >> any idea?
> >
> > Kim is the AMD point person for this I think..
>
> Since perf top invokes precision and therefore IBS,
> this looks like it's hitting erratum #1215:
>
> https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

So:


1215 IBS (Instruction Based Sampling) Counter Valid Value
May be Incorrect After Exit From Core C6 (CC6) State

Description

If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
issued, but an invalid value of the valid bit may be restored when the core exits CC6.
Potential Effect on System

The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
Linux systems.

Suggested Workaround: None
Fix Planned: No fix planned

lovely.

Thanks,

Ingo

2021-03-17 10:15:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome

On Wed, Mar 17, 2021 at 09:48:29AM +0100, Ingo Molnar wrote:
> > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
>
> So:
>
>
> 1215 IBS (Instruction Based Sampling) Counter Valid Value
> May be Incorrect After Exit From Core C6 (CC6) State
>
> Description
>
> If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
> Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
> issued, but an invalid value of the valid bit may be restored when the core exits CC6.
> Potential Effect on System
>
> The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
> valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
> Linux systems.
>
> Suggested Workaround: None
> Fix Planned: No fix planned

Should be simple enough to disable CC6 while IBS is in use. Kim, can you
please make that happen?

2021-03-17 13:34:06

by Alexander Monakov

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome

On Wed, 17 Mar 2021, Peter Zijlstra wrote:

> On Wed, Mar 17, 2021 at 09:48:29AM +0100, Ingo Molnar wrote:
> > > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> >
> > So:
> >
> >
> > 1215 IBS (Instruction Based Sampling) Counter Valid Value
> > May be Incorrect After Exit From Core C6 (CC6) State
> >
> > Description
> >
> > If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
> > Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
> > issued, but an invalid value of the valid bit may be restored when the core exits CC6.
> > Potential Effect on System
> >
> > The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
> > valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
> > Linux systems.
> >
> > Suggested Workaround: None
> > Fix Planned: No fix planned
>
> Should be simple enough to disable CC6 while IBS is in use. Kim, can you
> please make that happen?

Wouldn't that "magically" significantly speed up workloads running under
'perf top', in case they don't saturate the CPUs? Scheduling gets
much snappier if the target CPU doesn't need to wake up from deep sleep :)

Alternatively, would you consider adding the errata reference to the
printk message when IBS is in use, and rate-limit it so it doesn't
flood dmesg? Then the user will know what's going on, and may
choose to temporarily disable C-states using the 'cpupower' tool.

Alexander

2021-03-17 17:12:36

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: unknown NMI on AMD Rome

Em Wed, Mar 17, 2021 at 04:32:17PM +0300, Alexander Monakov escreveu:
> On Wed, 17 Mar 2021, Peter Zijlstra wrote:
> > On Wed, Mar 17, 2021 at 09:48:29AM +0100, Ingo Molnar wrote:
> > > > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

> > > 1215 IBS (Instruction Based Sampling) Counter Valid Value
> > > May be Incorrect After Exit From Core C6 (CC6) State

> > > Description

> > > If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
> > > Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
> > > issued, but an invalid value of the valid bit may be restored when the core exits CC6.
> > > Potential Effect on System

> > > The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
> > > valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
> > > Linux systems.

> > > Suggested Workaround: None
> > > Fix Planned: No fix planned

> > Should be simple enough to disable CC6 while IBS is in use. Kim, can you
> > please make that happen?

> Wouldn't that "magically" significantly speed up workloads running under
> 'perf top', in case they don't saturate the CPUs? Scheduling gets
> much snappier if the target CPU doesn't need to wake up from deep sleep :)

> Alternatively, would you consider adding the errata reference to the
> printk message when IBS is in use, and rate-limit it so it doesn't
> flood dmesg? Then the user will know what's going on, and may
> choose to temporarily disable C-states using the 'cpupower' tool.

Would be interesting as well to make 'perf top' realize that somehow
(looking at some cpu id, etc) and don't use IBS when C-states are being
used and/or warn the user about the situation, i.e. cycles:P can't be
used in this machine if C-states are enabled?

- Arnaldo