2014-10-08 21:52:20

by Aravind Gopalakrishnan

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

>
> Ok, this return is still bugging me - we're logging the error which
> caused the counter overflow but we go and explicitly clear _STATUS so
> that machine_check_poll doesn't pick up the same error again.
>
> Even though, machine_check_poll is intended to log the thresholding
> error.
>
> Which actually makes me think that that machine_check_poll is actually
> completely useless there. IOW, how about that instead:
>
> ---
> From: Chen Yucong <[email protected] <mailto:[email protected]>>
> Date: Thu, 2 Oct 2014 14:48:19 +0200
> Subject: [PATCH] x86, MCE, AMD: Correct thresholding error logging
>
> mce_setup() does not gather the content of IA32_MCG_STATUS, so it
> should be read explicitly. Moreover, we need to clear IA32_MCx_STATUS
> to avoid that mce_log() logs the processed threshold event again
> at next time.
>
> But we do the logging ourselves and machine_check_poll() is completely
> useless there. So kill it.
>
> Signed-off-by: Chen Yucong <[email protected] <mailto:[email protected]>>
> Signed-off-by: Borislav Petkov <[email protected] <mailto:[email protected]>>
> ---
> arch/x86/kernel/cpu/mcheck/mce_amd.c | 30 +++++++++++++++---------------
> 1 file changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c
> b/arch/x86/kernel/cpu/mcheck/mce_amd.c
> index 1c54d3d61a4d..9ce64955559d 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
> @@ -270,14 +270,13 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
> static void amd_threshold_interrupt(void)
> {
> u32 low = 0, high = 0, address = 0;
> + int cpu = smp_processor_id();
> unsigned int bank, block;
> struct mce m;
>
> - mce_setup(&m);
> -
> /* assume first bank caused it */
> for (bank = 0; bank < mca_cfg.banks; ++bank) {
> - if (!(per_cpu(bank_map, m.cpu) & (1 << bank)))
> + if (!(per_cpu(bank_map, cpu) & (1 << bank)))
> continue;
> for (block = 0; block < NR_BLOCKS; ++block) {
> if (block == 0) {
> @@ -309,20 +308,21 @@ static void amd_threshold_interrupt(void)
> * Log the machine check that caused the threshold
> * event.
> */
> - machine_check_poll(MCP_TIMESTAMP,
> - &__get_cpu_var(mce_poll_banks));
> -
> - if (high & MASK_OVERFLOW_HI) {
> - rdmsrl(address, m.misc);
> - rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
> - m.bank = K8_MCE_THRESHOLD_BASE
> - + bank * NR_BLOCKS
> - + block;
> - mce_log(&m);
> - return;
> - }
> + if (high & MASK_OVERFLOW_HI)
> + goto log;
> }
> }
> + return;
> +
> +log:
> + mce_setup(&m);
> + rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> + rdmsrl(address, m.misc);
> + rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
> + m.bank = K8_MCE_THRESHOLD_BASE + bank * NR_BLOCKS + block;


I am not understanding why m.bank is assigned this value..

It only causes incorrect decoding-
[ 608.832916] DEBUG: raise_amd_threshold_event
[ 608.832926] [Hardware Error]: Corrected error, no action required.
[ 608.833143] [Hardware Error]: CPU:26 (15:2:0)
MC165_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000000000
[ 608.833551] [Hardware Error]: MC165_ADDR: 0x0000000000000000
[ 608.833777] [Hardware Error]: cache level: RESV, tx: INSN
[ 608.834034] amd_inject module loaded ...


(Obviously, as in amd_decode_mce() we switch (m->bank) for decoding the
status and there is no bank 165)

OTOH, if m.bank = bank;
Then we get correct decoding info-
[ 58.021978] DEBUG: raise_amd_threshold_event
[ 58.021992] [Hardware Error]: Corrected error, no action required.
[ 58.022155] [Hardware Error]: CPU:0 (15:60:0)
MC4_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000000000
[ 58.022393] [Hardware Error]: MC4_ADDR: 0x0000000000000000
[ 58.022531] [Hardware Error]: MC4 Error (node 0): DRAM ECC error
detected on the NB.
<snip..it's throws WARN as "Something is rotten in the state of Denmark".>
<.. but that's fine. we are just fake-injecting errors here.. :) >
[ 58.022933] [Hardware Error]: cache level: RESV, tx: INSN
[ 58.023084] amd_inject module loaded ...

Thanks,
-Aravind.

> + mce_log(&m);
> +
> + wrmsrl(MSR_IA32_MCx_STATUS(bank), 0);
> }
>
> /*
> --
> 2.0.0
>
> --
> Regards/Gruss,
> Boris.


2014-10-08 22:58:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Wed, Oct 08, 2014 at 04:52:06PM -0500, Aravind Gopalakrishnan wrote:
> I am not understanding why m.bank is assigned this value..

That's a very good question, see below for some history.

>
> It only causes incorrect decoding-
> [ 608.832916] DEBUG: raise_amd_threshold_event
> [ 608.832926] [Hardware Error]: Corrected error, no action required.
> [ 608.833143] [Hardware Error]: CPU:26 (15:2:0)
> MC165_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000000000
> [ 608.833551] [Hardware Error]: MC165_ADDR: 0x0000000000000000
> [ 608.833777] [Hardware Error]: cache level: RESV, tx: INSN
> [ 608.834034] amd_inject module loaded ...
>
>
> (Obviously, as in amd_decode_mce() we switch (m->bank) for decoding the
> status and there is no bank 165)
>
> OTOH, if m.bank = bank;
> Then we get correct decoding info-

Yes, and I think we should do that only if we're using the *last* error
to report the overflow with: we're reporting a thresholding counter
overflow and the bank on which it was detected on should, of course, be
part of the report.

The "funny" bank is some sort of a software defined banks thing which
got added in 2005 (see the patch I dug out below) and it was supposed
to be used (I'm guessing here) for reporting thermal events using MCA
(dumb idea, if you ask me) so since thermal events don't really have
a bank, they decided to have some sort of a software-defined MCA bank
which doesn't correspond to any hardware bank.

Then Jacob decided to use it for some reason too:

95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")

maybe because thresholding errors don't have a bank associated with them
but if I'm not missing anything, they do!

Oh oh, ok, it just dawned on me! I think I know what it *might* have
been: they wanted to report the overflowing with a special error
signature which uses a software-defined bank. Ok, that actually makes
sense: when you see an error for a sw-defined bank, you're reporting an
thresholding counter overflow.

Which means that we shouldn't be populating m.status either, i.e. what
we did earlier:

rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);

because this is a special error type.

Hmm, it is too late here to think straight, more tomorrow. But Aravind,
that was a very good question, you actually made me dig into git history
:-)

Good night.


>From d2b6331397e634477b76f6fec119b7caf3ac564e Mon Sep 17 00:00:00 2001
From: Zwane Mwaikambo <[email protected]>
Date: Mon, 3 Jan 2005 04:42:52 -0800
Subject: [PATCH] [PATCH] Intel thermal monitor for x86_64

Patch adds support for notification of overheating conditions on intel
x86_64 processors. Tested on EM64T, test booted on AMD64.

Hardware courtesy of Intel Corporation

Signed-off-by: Zwane Mwaikambo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
---
arch/x86_64/Kconfig | 7 +++
arch/x86_64/kernel/Makefile | 1 +
arch/x86_64/kernel/entry.S | 3 ++
arch/x86_64/kernel/i8259.c | 2 +
arch/x86_64/kernel/mce.c | 14 +++++-
arch/x86_64/kernel/mce_intel.c | 99 ++++++++++++++++++++++++++++++++++++++++++
arch/x86_64/kernel/traps.c | 4 ++
include/asm-x86_64/mce.h | 13 ++++++
8 files changed, 142 insertions(+), 1 deletion(-)
create mode 100644 arch/x86_64/kernel/mce_intel.c

diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index 4ffa04271050..bc317049ebed 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -338,6 +338,13 @@ config X86_MCE
machine check error logs. See
ftp://ftp.x86-64.org/pub/linux/tools/mcelog

+config X86_MCE_INTEL
+ bool "Intel MCE features"
+ depends on X86_MCE && X86_LOCAL_APIC
+ default y
+ help
+ Additional support for intel specific MCE features such as
+ the thermal monitor.
endmenu

#
diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile
index 2c0f3af82e5e..96a5111e96c6 100644
--- a/arch/x86_64/kernel/Makefile
+++ b/arch/x86_64/kernel/Makefile
@@ -10,6 +10,7 @@ obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o \
setup64.o bootflag.o e820.o reboot.o warmreboot.o quirks.o

obj-$(CONFIG_X86_MCE) += mce.o
+obj-$(CONFIG_X86_MCE_INTEL) += mce_intel.o
obj-$(CONFIG_MTRR) += ../../i386/kernel/cpu/mtrr/
obj-$(CONFIG_ACPI_BOOT) += acpi/
obj-$(CONFIG_X86_MSR) += msr.o
diff --git a/arch/x86_64/kernel/entry.S b/arch/x86_64/kernel/entry.S
index d8d906a7d8e1..ca050e729a85 100644
--- a/arch/x86_64/kernel/entry.S
+++ b/arch/x86_64/kernel/entry.S
@@ -538,6 +538,9 @@ retint_kernel:
CFI_ENDPROC
.endm

+ENTRY(thermal_interrupt)
+ apicinterrupt THERMAL_APIC_VECTOR,smp_thermal_interrupt
+
#ifdef CONFIG_SMP
ENTRY(reschedule_interrupt)
apicinterrupt RESCHEDULE_VECTOR,smp_reschedule_interrupt
diff --git a/arch/x86_64/kernel/i8259.c b/arch/x86_64/kernel/i8259.c
index 7929a2e534a6..04e6fdab46b6 100644
--- a/arch/x86_64/kernel/i8259.c
+++ b/arch/x86_64/kernel/i8259.c
@@ -476,6 +476,7 @@ void error_interrupt(void);
void reschedule_interrupt(void);
void call_function_interrupt(void);
void invalidate_interrupt(void);
+void thermal_interrupt(void);

static void setup_timer(void)
{
@@ -550,6 +551,7 @@ void __init init_IRQ(void)
/* IPI for generic function call */
set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
#endif
+ set_intr_gate(THERMAL_APIC_VECTOR, thermal_interrupt);

#ifdef CONFIG_X86_LOCAL_APIC
/* self generated IPI for local APIC timer */
diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index 5da150baf25e..6e717e470460 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -43,7 +43,7 @@ struct mce_log mcelog = {
MCE_LOG_LEN,
};

-static void mce_log(struct mce *mce)
+void mce_log(struct mce *mce)
{
unsigned next, entry;
mce->finished = 0;
@@ -305,6 +305,17 @@ static void __init mce_cpu_quirks(struct cpuinfo_x86 *c)
}
}

+static void __init mce_cpu_features(struct cpuinfo_x86 *c)
+{
+ switch (c->x86_vendor) {
+ case X86_VENDOR_INTEL:
+ mce_intel_feature_init(c);
+ break;
+ default:
+ break;
+ }
+}
+
/*
* Called for each booted CPU to set up machine checks.
* Must be called with preempt off.
@@ -321,6 +332,7 @@ void __init mcheck_init(struct cpuinfo_x86 *c)
return;

mce_init(NULL);
+ mce_cpu_features(c);
}

/*
diff --git a/arch/x86_64/kernel/mce_intel.c b/arch/x86_64/kernel/mce_intel.c
new file mode 100644
index 000000000000..4db9a640069f
--- /dev/null
+++ b/arch/x86_64/kernel/mce_intel.c
@@ -0,0 +1,99 @@
+/*
+ * Intel specific MCE features.
+ * Copyright 2004 Zwane Mwaikambo <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/percpu.h>
+#include <asm/processor.h>
+#include <asm/msr.h>
+#include <asm/mce.h>
+#include <asm/hw_irq.h>
+
+static DEFINE_PER_CPU(unsigned long, next_check);
+
+asmlinkage void smp_thermal_interrupt(void)
+{
+ struct mce m;
+
+ ack_APIC_irq();
+
+ irq_enter();
+ if (time_before(jiffies, __get_cpu_var(next_check)))
+ goto done;
+
+ __get_cpu_var(next_check) = jiffies + HZ*300;
+ memset(&m, 0, sizeof(m));
+ m.cpu = smp_processor_id();
+ m.bank = MCE_THERMAL_BANK;
+ rdtscll(m.tsc);
+ rdmsrl(MSR_IA32_THERM_STATUS, m.status);
+ if (m.status & 0x1) {
+ printk(KERN_EMERG
+ "CPU%d: Temperature above threshold, cpu clock throttled\n", m.cpu);
+ add_taint(TAINT_MACHINE_CHECK);
+ } else {
+ printk(KERN_EMERG "CPU%d: Temperature/speed normal\n", m.cpu);
+ }
+
+ mce_log(&m);
+done:
+ irq_exit();
+}
+
+static void __init intel_init_thermal(struct cpuinfo_x86 *c)
+{
+ u32 l, h;
+ int tm2 = 0;
+ unsigned int cpu = smp_processor_id();
+
+ if (!cpu_has(c, X86_FEATURE_ACPI))
+ return;
+
+ if (!cpu_has(c, X86_FEATURE_ACC))
+ return;
+
+ /* first check if TM1 is already enabled by the BIOS, in which
+ * case there might be some SMM goo which handles it, so we can't even
+ * put a handler since it might be delivered via SMI already.
+ */
+ rdmsr(MSR_IA32_MISC_ENABLE, l, h);
+ h = apic_read(APIC_LVTTHMR);
+ if ((l & (1 << 3)) && (h & APIC_DM_SMI)) {
+ printk(KERN_DEBUG
+ "CPU%d: Thermal monitoring handled by SMI\n", cpu);
+ return;
+ }
+
+ if (cpu_has(c, X86_FEATURE_TM2) && (l & (1 << 13)))
+ tm2 = 1;
+
+ if (h & APIC_VECTOR_MASK) {
+ printk(KERN_DEBUG
+ "CPU%d: Thermal LVT vector (%#x) already "
+ "installed\n", cpu, (h & APIC_VECTOR_MASK));
+ return;
+ }
+
+ h = THERMAL_APIC_VECTOR;
+ h |= (APIC_DM_FIXED | APIC_LVT_MASKED);
+ apic_write_around(APIC_LVTTHMR, h);
+
+ rdmsr(MSR_IA32_THERM_INTERRUPT, l, h);
+ wrmsr(MSR_IA32_THERM_INTERRUPT, l | 0x03, h);
+
+ rdmsr(MSR_IA32_MISC_ENABLE, l, h);
+ wrmsr(MSR_IA32_MISC_ENABLE, l | (1 << 3), h);
+
+ l = apic_read(APIC_LVTTHMR);
+ apic_write_around(APIC_LVTTHMR, l & ~APIC_LVT_MASKED);
+ printk(KERN_INFO "CPU%d: Thermal monitoring enabled (%s)\n",
+ cpu, tm2 ? "TM2" : "TM1");
+ return;
+}
+
+void __init mce_intel_feature_init(struct cpuinfo_x86 *c)
+{
+ intel_init_thermal(c);
+}
diff --git a/arch/x86_64/kernel/traps.c b/arch/x86_64/kernel/traps.c
index 50e9621b0273..3ebfc9117d2a 100644
--- a/arch/x86_64/kernel/traps.c
+++ b/arch/x86_64/kernel/traps.c
@@ -882,6 +882,10 @@ asmlinkage void do_spurious_interrupt_bug(struct pt_regs * regs)
{
}

+asmlinkage void __attribute__((weak)) smp_thermal_interrupt(void)
+{
+}
+
/*
* 'math_state_restore()' saves the current math information in the
* old math state array, and gets the new ones from the current task
diff --git a/include/asm-x86_64/mce.h b/include/asm-x86_64/mce.h
index 1c84fa8758c3..869249db6795 100644
--- a/include/asm-x86_64/mce.h
+++ b/include/asm-x86_64/mce.h
@@ -64,4 +64,17 @@ struct mce_log {
#define MCE_GET_LOG_LEN _IOR('M', 2, int)
#define MCE_GETCLEAR_FLAGS _IOR('M', 3, int)

+/* Software defined banks */
+#define MCE_EXTENDED_BANK 128
+#define MCE_THERMAL_BANK MCE_EXTENDED_BANK + 0
+
+void mce_log(struct mce *m);
+#ifdef CONFIG_X86_MCE_INTEL
+void mce_intel_feature_init(struct cpuinfo_x86 *c);
+#else
+static inline void mce_intel_feature_init(struct cpuinfo_x86 *c)
+{
+}
+#endif
+
#endif
--
2.0.0


--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-10-09 16:33:26

by Aravind Gopalakrishnan

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Thu, Oct 09, 2014 at 12:57:50AM +0200, Borislav Petkov wrote:
> On Wed, Oct 08, 2014 at 04:52:06PM -0500, Aravind Gopalakrishnan wrote:
> > I am not understanding why m.bank is assigned this value..
>
> That's a very good question, see below for some history.
>
> >
> > It only causes incorrect decoding-
> > [ 608.832916] DEBUG: raise_amd_threshold_event
> > [ 608.832926] [Hardware Error]: Corrected error, no action required.
> > [ 608.833143] [Hardware Error]: CPU:26 (15:2:0)
> > MC165_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000000000
> > [ 608.833551] [Hardware Error]: MC165_ADDR: 0x0000000000000000
> > [ 608.833777] [Hardware Error]: cache level: RESV, tx: INSN
> > [ 608.834034] amd_inject module loaded ...
> >
> >
> > (Obviously, as in amd_decode_mce() we switch (m->bank) for decoding the
> > status and there is no bank 165)
> >
> > OTOH, if m.bank = bank;
> > Then we get correct decoding info-
>
> Yes, and I think we should do that only if we're using the *last* error
> to report the overflow with: we're reporting a thresholding counter
> overflow and the bank on which it was detected on should, of course, be
> part of the report.
>

How do you mean "last error"?
The interrupt is only fired upon overflow..

> The "funny" bank is some sort of a software defined banks thing which
> got added in 2005 (see the patch I dug out below) and it was supposed
> to be used (I'm guessing here) for reporting thermal events using MCA
> (dumb idea, if you ask me) so since thermal events don't really have
> a bank, they decided to have some sort of a software-defined MCA bank
> which doesn't correspond to any hardware bank.
>
> Then Jacob decided to use it for some reason too:
>
> 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
>
> maybe because thresholding errors don't have a bank associated with them
> but if I'm not missing anything, they do!
>

Right. The thresholding registers are nothing but _MISC(x) where x is a
bank value.

> Oh oh, ok, it just dawned on me! I think I know what it *might* have
> been: they wanted to report the overflowing with a special error
> signature which uses a software-defined bank. Ok, that actually makes
> sense: when you see an error for a sw-defined bank, you're reporting an
> thresholding counter overflow.
>
> Which means that we shouldn't be populating m.status either, i.e. what
> we did earlier:
>
> rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
>
> because this is a special error type.
>

How is it a "special error type"? It's still the same CE error that
we get notified with. Only difference being - now it's crossed a
specific 'threshold_limit'

So- I am not getting the rationale behind a S/W defined bank for reporting
this.

CE error if collected through polling gives proper decoding
info. So, why should this be any different for the same CE error for
which an interrupt is generated on crossing a threshold?

Thanks,
-Aravind

> Hmm, it is too late here to think straight, more tomorrow. But Aravind,
> that was a very good question, you actually made me dig into git history
> :-)
>
> Good night.
>

2014-10-09 17:35:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Thu, Oct 09, 2014 at 11:53:39AM -0500, Aravind Gopalakrishnan wrote:
> How do you mean "last error"?
> The interrupt is only fired upon overflow..

And? Think about it, what is causing the overflow? A CE, right?

There was even a call to machine_check_poll() there which we removed,
but for another reason. In any case, you should have the error signature
in the MCA banks of the last error causing the overflow, right? This is
what I mean with last error.

However(!),...

> CE error if collected through polling gives proper decoding info. So,
> why should this be any different for the same CE error for which an
> interrupt is generated on crossing a threshold?

... we're currently using a special signature to signal the overflow
with the K8_MCE_THRESHOLD_BASE thing. You simply report a special bank
and this way you can tell userspace that this is an overflow error. I
think that was the reason behind the software-defined banks.

Now, we can also drop that and simply log a normal error but make sure
MASK_OVERFLOW_HI is passed onto userspace so that it can see that the
error is an overflow error. I.e., something like this:

mce_setup(&m);
// rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus); - not sure about this one - we're not looking at MCGSTATUS for CEs
// rdmsrl(address, m.misc); - this MSR can be saved too as we're reading
// the MISC register already.
rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
m.bank = bank;
mce_log(&m);

so in the end it'll be something like this:

mce_setup(&m);
m.misc = (high << 32) | low;
rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
m.bank = bank;
mce_log(&m);

so I'm still on the fence about what we want to do and am expecting
arguments. I like the last one more because it is simpler and tools
don't need to know about the software-defined banks.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-10-09 19:01:23

by Aravind Gopalakrishnan

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On 10/9/2014 12:35 PM, Borislav Petkov wrote:
> On Thu, Oct 09, 2014 at 11:53:39AM -0500, Aravind Gopalakrishnan wrote:
>> How do you mean "last error"?
>> The interrupt is only fired upon overflow..
> And? Think about it, what is causing the overflow? A CE, right?
>
> There was even a call to machine_check_poll() there which we removed,
> but for another reason. In any case, you should have the error signature
> in the MCA banks of the last error causing the overflow, right?

Right. I was not arguing that we shouldn't. Just wasn't clear on what
you meant.
Anyway, Thanks for clarifying.

> This is
> what I mean with last error.
>
> However(!),...
>
>> CE error if collected through polling gives proper decoding info. So,
>> why should this be any different for the same CE error for which an
>> interrupt is generated on crossing a threshold?
> ... we're currently using a special signature to signal the overflow
> with the K8_MCE_THRESHOLD_BASE thing. You simply report a special bank
> and this way you can tell userspace that this is an overflow error. I
> think that was the reason behind the software-defined banks.
>
> Now, we can also drop that and simply log a normal error but make sure
> MASK_OVERFLOW_HI is passed onto userspace so that it can see that the
> error is an overflow error. I.e., something like this:
>
> mce_setup(&m);
> // rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus); - not sure about this one - we're not looking at MCGSTATUS for CEs
That's right. Might as well remove it.

> // rdmsrl(address, m.misc); - this MSR can be saved too as we're reading
> // the MISC register already.
> rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
> m.bank = bank;
> mce_log(&m);
>
> so in the end it'll be something like this:
>
> mce_setup(&m);
> m.misc = (high << 32) | low;
> rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
> m.bank = bank;
> mce_log(&m);
>
> so I'm still on the fence about what we want to do and am expecting
> arguments.

I actually agree with this approach. So no argument:)
> I like the last one more because it is simpler and tools
> don't need to know about the software-defined banks.
>

Thanks
-Aravind.

2014-10-21 20:29:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Thu, Oct 09, 2014 at 02:01:06PM -0500, Aravind Gopalakrishnan wrote:
> I actually agree with this approach. So no argument:)

Ok, thanks, here's a patch.

Btw, I'm pushing the whole queue to a ras-for-3.19 branch at
https://git.kernel.org/cgit/linux/kernel/git/bp/bp.git if you'd like to
take a look and see whether we haven't forgotten anything before I send
it to tip guys.

Thanks.

---
From: Borislav Petkov <[email protected]>
Subject: [PATCH] x86, MCE, AMD: Drop software-defined bank in error thresholding

Aravind had the good question about why we're assigning a
software-defined bank when reporting error thresholding errors instead
of simply using the bank which reports the last error causing the
overflow.

Digging through git history, it pointed to

95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")

which added that functionality. The problem with this, however, is that
tools don't know about software-defined banks and get puzzled. So drop
that K8_MCE_THRESHOLD_BASE and simply use the hw bank reporting the
thresholding interrupt.

Save us a couple of MSR reads while at it.

Reported-by: Aravind Gopalakrishnan <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/include/asm/mce.h | 1 -
arch/x86/kernel/cpu/mcheck/mce_amd.c | 5 ++---
2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 958b90f761e5..276392f121fb 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -78,7 +78,6 @@
/* Software defined banks */
#define MCE_EXTENDED_BANK 128
#define MCE_THERMAL_BANK (MCE_EXTENDED_BANK + 0)
-#define K8_MCE_THRESHOLD_BASE (MCE_EXTENDED_BANK + 1)

#define MCE_LOG_LEN 32
#define MCE_LOG_SIGNATURE "MACHINECHECK"
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 9af7bd74828b..6606523ff1c1 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -318,10 +318,9 @@ static void amd_threshold_interrupt(void)

log:
mce_setup(&m);
- rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
- rdmsrl(address, m.misc);
rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
- m.bank = K8_MCE_THRESHOLD_BASE + bank * NR_BLOCKS + block;
+ m.misc = ((u64)high << 32) | low;
+ m.bank = bank;
mce_log(&m);

wrmsrl(MSR_IA32_MCx_STATUS(bank), 0);
--
2.0.0


--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-10-22 01:52:17

by Chen Yucong

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Tue, 2014-10-21 at 22:28 +0200, Borislav Petkov wrote:
> On Thu, Oct 09, 2014 at 02:01:06PM -0500, Aravind Gopalakrishnan wrote:
> > I actually agree with this approach. So no argument:)
>
> Ok, thanks, here's a patch.
>
> Btw, I'm pushing the whole queue to a ras-for-3.19 branch at
> https://git.kernel.org/cgit/linux/kernel/git/bp/bp.git if you'd like to
> take a look and see whether we haven't forgotten anything before I send
> it to tip guys.
>
Hi Boris,

Can you check the following link? The link contains my reply about
"x86, MCE, AMD: Move invariant code out from loop body". The reply
was sent to you on October 7, but until now, there aren't any comments
from you!

https://lkml.org/lkml/2014/10/7/84

Thanks!
cyc

2014-10-22 08:16:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Wed, Oct 22, 2014 at 09:51:18AM +0800, Chen Yucong wrote:
> Can you check the following link? The link contains my reply about
> "x86, MCE, AMD: Move invariant code out from loop body". The reply was
> sent to you on October 7, but until now, there aren't any comments
> from you!

https://git.kernel.org/cgit/linux/kernel/git/bp/bp.git/commit/?h=ras-for-3.19&id=69b957583580bf40624553c64d802fefb54199cb

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-10-22 08:54:15

by Chen Yucong

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Wed, 2014-10-22 at 10:16 +0200, Borislav Petkov wrote:
> On Wed, Oct 22, 2014 at 09:51:18AM +0800, Chen Yucong wrote:
> > Can you check the following link? The link contains my reply about
> > "x86, MCE, AMD: Move invariant code out from loop body". The reply was
> > sent to you on October 7, but until now, there aren't any comments
> > from you!
>
> https://git.kernel.org/cgit/linux/kernel/git/bp/bp.git/commit/?h=ras-for-3.19&id=69b957583580bf40624553c64d802fefb54199cb

I have checked this link! I mean that there is another reply that you
may not have noticed.

thx!
cyc

2014-10-22 09:31:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

Hi Aravind,

question: what's the story with MC?_MISC[IntP], is that bit still there?
Because I don't see it in my BKDGs here.

The background of the story is

https://lkml.org/lkml/2014/10/7/84

There's this thing we did at the time

f227d4306cf3 ("x86, MCE, AMD: Make APIC LVT thresholding interrupt optional")

which, AFAICR, is about some F15h versions having a counter but *not*
generating a thresholding interrupt. Can you confirm that is still
the case and we can have a counter but no interrupt gets generated on
overflow?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-10-29 16:00:05

by Aravind Gopalakrishnan

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On 10/22/2014 4:30 AM, Borislav Petkov wrote:
> Hi Aravind,
>
> question: what's the story with MC?_MISC[IntP], is that bit still there?
> Because I don't see it in my BKDGs here.

Yep, It exists.
Maybe you are referring to Fam15h M0h BKDG? I think the bit was
introduced only from F15h M30h onwards.
The bit does *not* exist for bank=4, But-
if (bank ==4)
return true;

takes care of that.

> The background of the story is
>
> https://lkml.org/lkml/2014/10/7/84
>
> There's this thing we did at the time
>
> f227d4306cf3 ("x86, MCE, AMD: Make APIC LVT thresholding interrupt optional")
>
> which, AFAICR, is about some F15h versions having a counter but *not*
> generating a thresholding interrupt. Can you confirm that is still
> the case and we can have a counter but no interrupt gets generated on
> overflow?
>

So yes, moving the assignment inside the if condition should work just fine.

I see the patch on your 'ras-for-3.19' branch does not have this, so
I'll make this modification
to the branch before I test it.

Thanks,
-Aravind.

2014-10-30 19:04:25

by Aravind Gopalakrishnan

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On 10/29/2014 10:59 AM, Aravind Gopalakrishnan wrote:
> On 10/22/2014 4:30 AM, Borislav Petkov wrote:
>> Hi Aravind,
>>
>> question: what's the story with MC?_MISC[IntP], is that bit still there?
>> Because I don't see it in my BKDGs here.
>
> Yep, It exists.
> Maybe you are referring to Fam15h M0h BKDG? I think the bit was
> introduced only from F15h M30h onwards.
> The bit does *not* exist for bank=4, But-
> if (bank ==4)
> return true;
>
> takes care of that.
>
>> The background of the story is
>>
>> https://lkml.org/lkml/2014/10/7/84
>>
>> There's this thing we did at the time
>>
>> f227d4306cf3 ("x86, MCE, AMD: Make APIC LVT thresholding interrupt
>> optional")
>>
>> which, AFAICR, is about some F15h versions having a counter but *not*
>> generating a thresholding interrupt. Can you confirm that is still
>> the case and we can have a counter but no interrupt gets generated on
>> overflow?
>>
>
> So yes, moving the assignment inside the if condition should work just
> fine.
>
> I see the patch on your 'ras-for-3.19' branch does not have this, so
> I'll make this modification
> to the branch before I test it.
>

Hi Boris,
I have tested the branch with this bit:

if (b.interrupt_capable) {
... ...
if (mce_threshold_vector != amd_threshold_interrupt)
mce_threshold_vector = amd_threshold_interrupt;
}

and it works fine.

Thanks,
-Aravind.

2014-10-30 21:40:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it

On Thu, Oct 30, 2014 at 02:04:17PM -0500, Aravind Gopalakrishnan wrote:
> Hi Boris,
> I have tested the branch with this bit:
>
> if (b.interrupt_capable) {
> ... ...
> if (mce_threshold_vector != amd_threshold_interrupt)
> mce_threshold_vector = amd_threshold_interrupt;
> }
>
> and it works fine.

Good, I'll adjust Chen's patch tomorrow then and send up.

Thanks a lot for testing!

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--