We saw a strange issue with local APIC timer. Some random CPU doesn't
receive any local APIC timer interrupt, which causes different issues.
The cpu uses TSC-Deadline mode for local APIC timer and APIC is in xAPIC
mode. When this happens, manually writing TSC_DEADLINE MSR can trigger
interrupt again and the system goes normal.
Currently we only see this issue in E5-2660 v2 and E5-2680 v2 CPU.
Compiler version seems mattering too, it's quite easy to reproduce the
issue with v4.7 gcc.
Since the local APIC timer interrupt number is 0, we either lose the
first interrupt or TSC_DEADLINE MSR isn't set correctly. After some
debugging, we believe it's the serialize issue described in Intel SDM.
In xAPIC mode, write to APIC LVTT and write to TSC_DEADLINE isn't
serialized. Debug shows read TSC_DEADLINE MSR followed the very first
MSR write returns 0 in the buggy cpu.
The patch uses the algorithm Intel SDM described. The issue only happens
in xAPIC mode, but it doesn't bother to check the APIC mode I guess.
Without this patch, we see the issue after ~5 reboots. With it, we
don't see it after 24hr reboot test.
Cc: Suresh Siddha <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected] v3.7+
Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/kernel/apic/apic.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..b7890b3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
apic_write(APIC_LVTT, lvtt_value);
if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
+ u64 msr;
+
+ /*
+ * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
+ * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
+ * This uses the algorithm described in Intel SDM to serialize
+ * the two writes
+ * */
+ while (1) {
+ wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
+ rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
+ if (msr)
+ break;
+ }
+ wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
+
printk_once(KERN_DEBUG "TSC deadline timer enabled\n");
return;
}
--
1.8.5.6
On Fri, 31 Jul 2015, Shaohua Li wrote:
> @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> apic_write(APIC_LVTT, lvtt_value);
>
> if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> + u64 msr;
> +
> + /*
> + * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> + * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> + * This uses the algorithm described in Intel SDM to serialize
> + * the two writes
> + * */
> + while (1) {
> + wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> + rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> + if (msr)
> + break;
> + }
> + wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
I think this is exceptionally silly. A proper fence after the
apic_write() should have the same effect.
Thanks,
tglx
On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> On Fri, 31 Jul 2015, Shaohua Li wrote:
> > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > apic_write(APIC_LVTT, lvtt_value);
> >
> > if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > + u64 msr;
> > +
> > + /*
> > + * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > + * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > + * This uses the algorithm described in Intel SDM to serialize
> > + * the two writes
> > + * */
> > + while (1) {
> > + wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > + rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > + if (msr)
> > + break;
> > + }
> > + wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
>
>
> I think this is exceptionally silly. A proper fence after the
> apic_write() should have the same effect.
Not sure what happens in the hardware, I could have a try of fence, but
I'd prefer using the algorithm Intel described. This is not a fast path,
the loop will exit immediately regardless the issue occurs anyway.
Thanks,
Shaohua
On Sun, 2 Aug 2015, Shaohua Li wrote:
> On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > > apic_write(APIC_LVTT, lvtt_value);
> > >
> > > if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > + u64 msr;
> > > +
> > > + /*
> > > + * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > + * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > + * This uses the algorithm described in Intel SDM to serialize
> > > + * the two writes
> > > + * */
> > > + while (1) {
> > > + wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > + rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > + if (msr)
> > > + break;
> > > + }
> > > + wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> >
> >
> > I think this is exceptionally silly. A proper fence after the
> > apic_write() should have the same effect.
>
> Not sure what happens in the hardware, I could have a try of fence, but
> I'd prefer using the algorithm Intel described. This is not a fast path,
s/algorithm/voodoo/
> the loop will exit immediately regardless the issue occurs anyway.
Well, the SDM also says:
"To allow for efficient access to the APIC registers in x2APIC mode,
the serializing semantics of WRMSR are relaxed when writing to the
APIC registers. Thus, system software should not use “WRMSR to APIC
registers in x2APIC mode” as a serializing instruction. Read and write
accesses to the APIC registers will occur in program order. A WRMSR to
an APIC register may complete before all preceding stores are globally
visible; software can prevent this by inserting a serializing
instruction, an SFENCE, or an MFENCE before the WRMSR."
And that's what happens here. The write to the LVT has not yet hit the
APIC, so the WRMSR has no effect.
Thanks,
tglx
On Sun, Aug 02, 2015 at 09:41:08PM +0200, Thomas Gleixner wrote:
> On Sun, 2 Aug 2015, Shaohua Li wrote:
>
> > On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > > > apic_write(APIC_LVTT, lvtt_value);
> > > >
> > > > if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > > + u64 msr;
> > > > +
> > > > + /*
> > > > + * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > > + * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > > + * This uses the algorithm described in Intel SDM to serialize
> > > > + * the two writes
> > > > + * */
> > > > + while (1) {
> > > > + wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > > + rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > > + if (msr)
> > > > + break;
> > > > + }
> > > > + wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> > >
> > >
> > > I think this is exceptionally silly. A proper fence after the
> > > apic_write() should have the same effect.
> >
> > Not sure what happens in the hardware, I could have a try of fence, but
> > I'd prefer using the algorithm Intel described. This is not a fast path,
>
> s/algorithm/voodoo/
>
> > the loop will exit immediately regardless the issue occurs anyway.
>
> Well, the SDM also says:
>
> "To allow for efficient access to the APIC registers in x2APIC mode,
> the serializing semantics of WRMSR are relaxed when writing to the
> APIC registers. Thus, system software should not use “WRMSR to APIC
> registers in x2APIC mode” as a serializing instruction. Read and write
> accesses to the APIC registers will occur in program order. A WRMSR to
> an APIC register may complete before all preceding stores are globally
> visible; software can prevent this by inserting a serializing
> instruction, an SFENCE, or an MFENCE before the WRMSR."
>
> And that's what happens here. The write to the LVT has not yet hit the
> APIC, so the WRMSR has no effect.
What you quoted is for x2APIC, I didn't see similar description for
xAPIC.
Tested mfence here, it does work. But I'm not convinced it's the right
thing. the xAPIC access is memory mapped IO, mfence is nothing related
to it. Anyway, cc-ed more intel people, hope they can share some
insights.
Thanks,
Shaohua
* Shaohua Li <[email protected]> wrote:
> On Sun, Aug 02, 2015 at 09:41:08PM +0200, Thomas Gleixner wrote:
> > On Sun, 2 Aug 2015, Shaohua Li wrote:
> >
> > > On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > > > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > > > > apic_write(APIC_LVTT, lvtt_value);
> > > > >
> > > > > if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > > > + u64 msr;
> > > > > +
> > > > > + /*
> > > > > + * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > > > + * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > > > + * This uses the algorithm described in Intel SDM to serialize
> > > > > + * the two writes
> > > > > + * */
> > > > > + while (1) {
> > > > > + wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > > > + rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > > > + if (msr)
> > > > > + break;
> > > > > + }
> > > > > + wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> > > >
> > > >
> > > > I think this is exceptionally silly. A proper fence after the
> > > > apic_write() should have the same effect.
> > >
> > > Not sure what happens in the hardware, I could have a try of fence, but
> > > I'd prefer using the algorithm Intel described. This is not a fast path,
> >
> > s/algorithm/voodoo/
> >
> > > the loop will exit immediately regardless the issue occurs anyway.
> >
> > Well, the SDM also says:
> >
> > "To allow for efficient access to the APIC registers in x2APIC mode,
> > the serializing semantics of WRMSR are relaxed when writing to the
> > APIC registers. Thus, system software should not use “WRMSR to APIC
> > registers in x2APIC mode” as a serializing instruction. Read and write
> > accesses to the APIC registers will occur in program order. A WRMSR to
> > an APIC register may complete before all preceding stores are globally
> > visible; software can prevent this by inserting a serializing
> > instruction, an SFENCE, or an MFENCE before the WRMSR."
> >
> > And that's what happens here. The write to the LVT has not yet hit the
> > APIC, so the WRMSR has no effect.
>
> What you quoted is for x2APIC, I didn't see similar description for
> xAPIC.
>
> Tested mfence here, it does work. But I'm not convinced it's the right thing.
> the xAPIC access is memory mapped IO, mfence is nothing related to it. [...]
Huh?
Can you cite the SDM that says that MFENCE will have no effect on instructions
that deal with accesses that are not to system RAM (i.e. are memory mapped IO)?
Thanks,
Ingo
On Wed, Aug 05, 2015 at 10:44:24AM +0200, Ingo Molnar wrote:
>
> * Shaohua Li <[email protected]> wrote:
>
> > On Sun, Aug 02, 2015 at 09:41:08PM +0200, Thomas Gleixner wrote:
> > > On Sun, 2 Aug 2015, Shaohua Li wrote:
> > >
> > > > On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > > > > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > > > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > > > > > apic_write(APIC_LVTT, lvtt_value);
> > > > > >
> > > > > > if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > > > > + u64 msr;
> > > > > > +
> > > > > > + /*
> > > > > > + * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > > > > + * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > > > > + * This uses the algorithm described in Intel SDM to serialize
> > > > > > + * the two writes
> > > > > > + * */
> > > > > > + while (1) {
> > > > > > + wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > > > > + rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > > > > + if (msr)
> > > > > > + break;
> > > > > > + }
> > > > > > + wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> > > > >
> > > > >
> > > > > I think this is exceptionally silly. A proper fence after the
> > > > > apic_write() should have the same effect.
> > > >
> > > > Not sure what happens in the hardware, I could have a try of fence, but
> > > > I'd prefer using the algorithm Intel described. This is not a fast path,
> > >
> > > s/algorithm/voodoo/
> > >
> > > > the loop will exit immediately regardless the issue occurs anyway.
> > >
> > > Well, the SDM also says:
> > >
> > > "To allow for efficient access to the APIC registers in x2APIC mode,
> > > the serializing semantics of WRMSR are relaxed when writing to the
> > > APIC registers. Thus, system software should not use “WRMSR to APIC
> > > registers in x2APIC mode” as a serializing instruction. Read and write
> > > accesses to the APIC registers will occur in program order. A WRMSR to
> > > an APIC register may complete before all preceding stores are globally
> > > visible; software can prevent this by inserting a serializing
> > > instruction, an SFENCE, or an MFENCE before the WRMSR."
> > >
> > > And that's what happens here. The write to the LVT has not yet hit the
> > > APIC, so the WRMSR has no effect.
> >
> > What you quoted is for x2APIC, I didn't see similar description for
> > xAPIC.
> >
> > Tested mfence here, it does work. But I'm not convinced it's the right thing.
> > the xAPIC access is memory mapped IO, mfence is nothing related to it. [...]
>
> Huh?
>
> Can you cite the SDM that says that MFENCE will have no effect on instructions
> that deal with accesses that are not to system RAM (i.e. are memory mapped IO)?
Hmm, I didn't mean mfence can't serialize the instructions. For a true
IO, a serialization can't guarantee device finishes the IO, we generally
read some safe IO registers to wait IO finish. I completely don't know
if this case fits here though.
Thanks,
Shaohua