2013-09-23 21:25:09

by Mike Travis

[permalink] [raw]
Subject: [PATCH 0/7] x86/UV/KDB/NMI: Updates for NMI/KDB handler for SGI UV


V3: Reduce number of changes to KGDB/KDB code to simplify special
handling of SYSTEM NMI. Remove disable UV NMI function.

V2: Split KDB updates from NMI updates. Broke up the big patch to
uv_nmi.c into smaller patches. Updated to the latest linux
kernel version.

The current UV NMI handler has not been updated for the changes in the
system NMI handler and the perf operations. The UV NMI handler reads
an MMR in the UV Hub to check to see if the NMI event was caused by
the external 'system NMI' that the operator can initiate on the System
Mgmt Controller.

The problem arises when the perf tools are running, causing millions of
perf events per second on very large CPU count systems. Previously this
was okay because the perf NMI handler ran at a higher priority on the
NMI call chain and if the NMI was a perf event, it would stop calling
other NMI handlers remaining on the NMI call chain.

Now the system NMI handler calls all the handlers on the NMI call
chain including the UV NMI handler. This causes the UV NMI handler
to read the MMRs at the same millions per second rate. This can lead
to significant performance loss and possible system failures. It also
can cause thousands of 'Dazed and Confused' messages being sent to the
system console. This effectively makes perf tools unusable on UV systems.

This patch set addresses this problem and allows the perf tools to run on
UV without impacting performance and causing system failures.

--


2013-09-24 07:52:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/7] x86/UV/KDB/NMI: Updates for NMI/KDB handler for SGI UV


Hm, do you test-build your patches? This series produces the following
annoying warning:

arch/x86/platform/uv/uv_nmi.c: In function ‘uv_nmi_setup’:
arch/x86/platform/uv/uv_nmi.c:664:2: warning: the address of ‘uv_nmi_cpu_mask’ will always evaluate as ‘true’ [-Waddress]

This:

alloc_cpumask_var(&uv_nmi_cpu_mask, GFP_KERNEL);
BUG_ON(!uv_nmi_cpu_mask);


the way to check for allocation failures is by checking the return value
of alloc_cpumask_var():

BUG_ON(!alloc_cpumask_var(&uv_nmi_cpu_mask, GFP_KERNEL));

I've fixed this in the patch.

Thanks,

Ingo

2013-09-24 13:52:17

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 0/7] x86/UV/KDB/NMI: Updates for NMI/KDB handler for SGI UV



On 9/24/2013 12:52 AM, Ingo Molnar wrote:
>
> Hm, do you test-build your patches?

Both build and test incessantly...

This series produces the following
> annoying warning:
>
> arch/x86/platform/uv/uv_nmi.c: In function ‘uv_nmi_setup’:
> arch/x86/platform/uv/uv_nmi.c:664:2: warning: the address of ‘uv_nmi_cpu_mask’ will always evaluate as ‘true’ [-Waddress]

I didn't hit the above warning since I never tried building without
CONFIG_CPUMASK_OFFSTACK defined. I wonder if uv_nmi.c should not
be built if not on an enterprise sized system?

>
> This:
>
> alloc_cpumask_var(&uv_nmi_cpu_mask, GFP_KERNEL);
> BUG_ON(!uv_nmi_cpu_mask);
>
>
> the way to check for allocation failures is by checking the return value
> of alloc_cpumask_var():
>
> BUG_ON(!alloc_cpumask_var(&uv_nmi_cpu_mask, GFP_KERNEL));
>
> I've fixed this in the patch.

Thanks!! I should have remembered this since it was my code. (doh!)
>
> Thanks,
>
> Ingo
>

2013-09-24 14:59:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/7] x86/UV/KDB/NMI: Updates for NMI/KDB handler for SGI UV


* Mike Travis <[email protected]> wrote:

> On 9/24/2013 12:52 AM, Ingo Molnar wrote:
> >
> > Hm, do you test-build your patches?
>
> Both build and test incessantly...
>
> This series produces the following
> > annoying warning:
> >
> > arch/x86/platform/uv/uv_nmi.c: In function ‘uv_nmi_setup’:
> > arch/x86/platform/uv/uv_nmi.c:664:2: warning: the address of ‘uv_nmi_cpu_mask’ will always evaluate as ‘true’ [-Waddress]
>
> I didn't hit the above warning since I never tried building without
> CONFIG_CPUMASK_OFFSTACK defined. [...]

Ok, that explains it!

> [...] I wonder if uv_nmi.c should not be built if not on an enterprise
> sized system?

I don't think so - the config variations help root out such bugs.

Thanks,

Ingo