2015-02-02 16:36:14

by Aravind Gopalakrishnan

[permalink] [raw]
Subject: [PATCH V2] x86, mce, amd: Enable interrupts by default if HW capable

We setup APIC vectors for threshold errors if interrupt_capable.
However, we don't set interrupt_enable by default.
Re-working threshold_restart_bank() here so that the first time we
set up lvt_offset, we also set IntType to APIC.

User is still allowed to disable interrupts through sysfs.

While at it, check if status is valid before we proceed to log
error using mce_log. This is because, in multi-node platforms,
only NBC has valid status info. So, the decoding of status values
on the non-NBC leads to noise on kernel logs like so-

[ 440.509744] EDAC DEBUG: amd64_inject_write_store: section=0x80000000
word_bits=0x10020001
[ 466.570925] [Hardware Error]: Corrected error, no action required.
[ 466.570935] [Hardware Error]: CPU:25 (15:2:0) MC4_STATUS[-|CE|-|-|-
[ 466.570936] [Hardware Error]: Corrected error, no action required.
[ 466.570959] [Hardware Error]: CPU:26 (15:2:0) MC4_STATUS[-|CE|-|-|-
<...>
[ 466.571293] WARNING: CPU: 25 PID: 0 at drivers/edac/amd64_edac.c:2147
decode_bus_error+0x1ba/0x2a0()
[ 466.571301] WARNING: CPU: 26 PID: 0 at drivers/edac/amd64_edac.c:2147
decode_bus_error+0x1ba/0x2a0()
[ 466.571303] Something is rotten in the state of Denmark.

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Aravind Gopalakrishnan <[email protected]>
---
Changes in V2:
- earlier changes regarding removal of bank == 4 check and removal
of 'interrupt_enable' attribute causes regressions. Fixed that.
- moving setting of threshold_limit and comment style fixes are not
directly related to this patch. So removing them to cut out any
distractions
- Add fix for garbled dmesg output on multi-node platforms, modify
commit message to reflect the change.

arch/x86/kernel/cpu/mcheck/mce_amd.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index f1c3769..82c5144 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -250,6 +250,7 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
if (!b.interrupt_capable)
goto init;

+ b.interrupt_enable = 1;
new = (high & MASK_LVTOFF_HI) >> 20;
offset = setup_APIC_mce(offset, new);

@@ -322,6 +323,8 @@ static void amd_threshold_interrupt(void)
log:
mce_setup(&m);
rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
+ if (!(m.status & MCI_STATUS_VAL))
+ return;
m.misc = ((u64)high << 32) | low;
m.bank = bank;
mce_log(&m);
@@ -497,10 +500,12 @@ static int allocate_threshold_blocks(unsigned int cpu, unsigned int bank,
b->interrupt_capable = lvt_interrupt_supported(bank, high);
b->threshold_limit = THRESHOLD_MAX;

- if (b->interrupt_capable)
+ if (b->interrupt_capable) {
threshold_ktype.default_attrs[2] = &interrupt_enable.attr;
- else
+ b->interrupt_enable = 1;
+ } else {
threshold_ktype.default_attrs[2] = NULL;
+ }

INIT_LIST_HEAD(&b->miscj);

--
2.1.0


2015-02-06 16:39:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH V2] x86, mce, amd: Enable interrupts by default if HW capable

On Mon, Feb 02, 2015 at 11:02:41AM -0600, Aravind Gopalakrishnan wrote:
> We setup APIC vectors for threshold errors if interrupt_capable.
> However, we don't set interrupt_enable by default.
> Re-working threshold_restart_bank() here so that the first time we
> set up lvt_offset, we also set IntType to APIC.
>
> User is still allowed to disable interrupts through sysfs.
>
> While at it, check if status is valid before we proceed to log
> error using mce_log. This is because, in multi-node platforms,
> only NBC has valid status info. So, the decoding of status values
> on the non-NBC leads to noise on kernel logs like so-
>
> [ 440.509744] EDAC DEBUG: amd64_inject_write_store: section=0x80000000
> word_bits=0x10020001
> [ 466.570925] [Hardware Error]: Corrected error, no action required.
> [ 466.570935] [Hardware Error]: CPU:25 (15:2:0) MC4_STATUS[-|CE|-|-|-
> [ 466.570936] [Hardware Error]: Corrected error, no action required.
> [ 466.570959] [Hardware Error]: CPU:26 (15:2:0) MC4_STATUS[-|CE|-|-|-
> <...>
> [ 466.571293] WARNING: CPU: 25 PID: 0 at drivers/edac/amd64_edac.c:2147
> decode_bus_error+0x1ba/0x2a0()
> [ 466.571301] WARNING: CPU: 26 PID: 0 at drivers/edac/amd64_edac.c:2147
> decode_bus_error+0x1ba/0x2a0()
> [ 466.571303] Something is rotten in the state of Denmark.
>
> Suggested-by: Borislav Petkov <[email protected]>
> Signed-off-by: Aravind Gopalakrishnan <[email protected]>

Queued for 3.21, thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--