2011-04-14 06:41:10

by Youquan Song

[permalink] [raw]
Subject: [PATCH v4 1/2] apic: Fix error interrupt report at all APs

Recently, customer report that once machine boot, there are many error interrupt
reported with exact number of all APs.

The root cause is Local APIC will generate error interrupt when it detect
the illegal vector (one in 0 ~ 15) in an interrupt message received or
interrupt generate from local vector table or self IPI. SDM3A.chapter 10.

AP LAPIC thermal sensor register will be reset to 0x10000, if thermal throttling
interrupt take over by BIOS, it need restore AP with the thermal sensor register
value of geting from BSP, otherwise cause system issue. If BIOS does not take
over the thermal interrupt, The restore value will be CPU rest value of 0x10000,
which means the interrupt vector is zero. After writing 0x10000 to thermal
sensor LVT, the processor will recieve the error interrupt report if the APIC
error interrupt is also set.

This patch add check the BIOS whether take over the thermal interrupt by look
at interrupt delivery mode not fixed mode(BIOS handle will be SMI mode) before
restore AP's thermal LVT. So the agony noise of error interrupt will dismiss
when boot on machine that BIOS does not handle thermal interrupt..


Signed-off-by: Youquan Song <[email protected]>
Acked-by: Suresh Siddha <[email protected]>
Acked-by: Yong Wang <[email protected]>
---
arch/x86/include/asm/apicdef.h | 1 +
arch/x86/kernel/cpu/mcheck/therm_throt.c | 12 +++++++-----
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
index d87988b..34595d5 100644
--- a/arch/x86/include/asm/apicdef.h
+++ b/arch/x86/include/asm/apicdef.h
@@ -78,6 +78,7 @@
#define APIC_DEST_LOGICAL 0x00800
#define APIC_DEST_PHYSICAL 0x00000
#define APIC_DM_FIXED 0x00000
+#define APIC_DM_FIXED_MASK 0x00700
#define APIC_DM_LOWEST 0x00100
#define APIC_DM_SMI 0x00200
#define APIC_DM_REMRD 0x00300
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 6f8c5e9..22c212a 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -446,18 +446,20 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
*/
rdmsr(MSR_IA32_MISC_ENABLE, l, h);

+ h = lvtthmr_init;
/*
* The initial value of thermal LVT entries on all APs always reads
* 0x10000 because APs are woken up by BSP issuing INIT-SIPI-SIPI
* sequence to them and LVT registers are reset to 0s except for
* the mask bits which are set to 1s when APs receive INIT IPI.
- * Always restore the value that BIOS has programmed on AP based on
- * BSP's info we saved since BIOS is always setting the same value
- * for all threads/cores
+ * If BIOS take over the thermal interrupt and set its interrupt
+ * delivery mode to SMI not fixed, it restore the value that BIOS has
+ * programmed on AP based on BSP's info we saved since BIOS is always
+ * setting the same value for all threads/cores.
*/
- apic_write(APIC_LVTTHMR, lvtthmr_init);
+ if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
+ apic_write(APIC_LVTTHMR, lvtthmr_init);

- h = lvtthmr_init;

if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
printk(KERN_DEBUG
--
1.6.4.2


2011-04-14 06:41:27

by Youquan Song

[permalink] [raw]
Subject: [PATCH v4 2/2] apic: Add print error interrupt reason

End user worry about the error interrupt information and intend to know what
kind of error interrupts are generated, so this patch add printing out the
detail debug information of error interrupt.
dynamic debug is not initiated when LAPIC initiation, so the pr_debug will not
output the error interrupt debug information when boot.
In this patch, we use apic_printk(APIC_DEBUG,), so if add kernel option
apic=debug will output the error interrupt during boot.

Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Joe Perches <[email protected]>
---
arch/x86/kernel/apic/apic.c | 38 +++++++++++++++++++++++++-------------
1 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index fabf01e..3ddf4ef 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1813,6 +1813,17 @@ void smp_spurious_interrupt(struct pt_regs *regs)
void smp_error_interrupt(struct pt_regs *regs)
{
u32 v, v1;
+ u32 i = 0;
+ static const char * const error_interrupt_reason[] = {
+ "Send CS error", /* APIC Error Bit 0 */
+ "Receive CS error", /* APIC Error Bit 1 */
+ "Send accept error", /* APIC Error Bit 2 */
+ "Receive accept error", /* APIC Error Bit 3 */
+ "Redirectable IPI", /* APIC Error Bit 4 */
+ "Send illegal vector", /* APIC Error Bit 5 */
+ "Received illegal vector", /* APIC Error Bit 6 */
+ "Illegal register address", /* APIC Error Bit 7 */
+ };

exit_idle();
irq_enter();
@@ -1823,19 +1834,20 @@ void smp_error_interrupt(struct pt_regs *regs)
ack_APIC_irq();
atomic_inc(&irq_err_count);

- /*
- * Here is what the APIC error bits mean:
- * 0: Send CS error
- * 1: Receive CS error
- * 2: Send accept error
- * 3: Receive accept error
- * 4: Reserved
- * 5: Send illegal vector
- * 6: Received illegal vector
- * 7: Illegal register address
- */
- pr_debug("APIC error on CPU%d: %02x(%02x)\n",
- smp_processor_id(), v , v1);
+ apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",
+ smp_processor_id(), v , v1);
+
+ v1 = v1 & 0xff;
+ while (v1) {
+ if (v1 & 0x1)
+ apic_printk(APIC_DEBUG, KERN_CONT " : %s",
+ error_interrupt_reason[i]);
+ i++;
+ v1 >>= 1;
+ };
+
+ apic_printk(APIC_DEBUG, KERN_CONT "\n");
+
irq_exit();
}

--
1.6.4.2

2011-04-14 07:54:13

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] apic: Add print error interrupt reason

On Thu, Apr 14, 2011 at 10:36 AM, Youquan Song <[email protected]> wrote:
> End user worry about the error interrupt information and intend to know what
> kind of error interrupts are generated, so this patch add printing out the
> detail debug information of error interrupt.
> dynamic debug is not initiated when LAPIC initiation, so the pr_debug will not
> output the error interrupt debug information when boot.
> In this patch, we use apic_printk(APIC_DEBUG,), so if add kernel option
> apic=debug will output the error interrupt during boot.
>
> Signed-off-by: Youquan Song <[email protected]>
> Signed-off-by: Joe Perches <[email protected]>
> ---
> ?arch/x86/kernel/apic/apic.c | ? 38 +++++++++++++++++++++++++-------------
> ?1 files changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index fabf01e..3ddf4ef 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1813,6 +1813,17 @@ void smp_spurious_interrupt(struct pt_regs *regs)
> ?void smp_error_interrupt(struct pt_regs *regs)
> ?{
> ? ? ? ?u32 v, v1;
> + ? ? ? u32 i = 0;
> + ? ? ? static const char * const error_interrupt_reason[] = {
> + ? ? ? ? ? ? ? "Send CS error", ? ? ? ? ? ? ? ?/* APIC Error Bit 0 */
> + ? ? ? ? ? ? ? "Receive CS error", ? ? ? ? ? ? /* APIC Error Bit 1 */
> + ? ? ? ? ? ? ? "Send accept error", ? ? ? ? ? ?/* APIC Error Bit 2 */
> + ? ? ? ? ? ? ? "Receive accept error", ? ? ? ? /* APIC Error Bit 3 */
> + ? ? ? ? ? ? ? "Redirectable IPI", ? ? ? ? ? ? /* APIC Error Bit 4 */
> + ? ? ? ? ? ? ? "Send illegal vector", ? ? ? ? ?/* APIC Error Bit 5 */
> + ? ? ? ? ? ? ? "Received illegal vector", ? ? ?/* APIC Error Bit 6 */
> + ? ? ? ? ? ? ? "Illegal register address", ? ? /* APIC Error Bit 7 */
> + ? ? ? };
>
...
> - ? ? ? pr_debug("APIC error on CPU%d: %02x(%02x)\n",
> - ? ? ? ? ? ? ? smp_processor_id(), v , v1);
> + ? ? ? apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",
> + ? ? ? ? ? ? ? ? ? smp_processor_id(), v , v1);
> +
> + ? ? ? v1 = v1 & 0xff;
> + ? ? ? while (v1) {
> + ? ? ? ? ? ? ? if (v1 & 0x1)
> + ? ? ? ? ? ? ? ? ? ? ? apic_printk(APIC_DEBUG, KERN_CONT " : %s",
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? error_interrupt_reason[i]);
> + ? ? ? ? ? ? ? i++;
> + ? ? ? ? ? ? ? v1 >>= 1;
> + ? ? ? };
> +

Hi looks good but please add some array-checking in case if we get
some damaged result
from hardware, ie check for i not being out of error_interrupt_reason. Thanks!

2011-04-14 07:57:41

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] apic: Add print error interrupt reason

heh, drop my prev mail. Patch is good! thanks ans sorry for noise.

On Thursday, April 14, 2011, Cyrill Gorcunov <[email protected]> wrote:
> On Thu, Apr 14, 2011 at 10:36 AM, Youquan Song <[email protected]> wrote:
>> End user worry about the error interrupt information and intend to know what
>> kind of error interrupts are generated, so this patch add printing out the
>> detail debug information of error interrupt.
>> dynamic debug is not initiated when LAPIC initiation, so the pr_debug will not
>> output the error interrupt debug information when boot.
>> In this patch, we use apic_printk(APIC_DEBUG,), so if add kernel option
>> apic=debug will output the error interrupt during boot.
>>
>> Signed-off-by: Youquan Song <[email protected]>
>> Signed-off-by: Joe Perches <[email protected]>
>> ---
>> ?arch/x86/kernel/apic/apic.c | ? 38 +++++++++++++++++++++++++-------------
>> ?1 files changed, 25 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index fabf01e..3ddf4ef 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -1813,6 +1813,17 @@ void smp_spurious_interrupt(struct pt_regs *regs)
>> ?void smp_error_interrupt(struct pt_regs *regs)
>> ?{
>> ? ? ? ?u32 v, v1;
>> + ? ? ? u32 i = 0;
>> + ? ? ? static const char * const error_interrupt_reason[] = {
>> + ? ? ? ? ? ? ? "Send CS error", ? ? ? ? ? ? ? ?/* APIC Error Bit 0 */
>> + ? ? ? ? ? ? ? "Receive CS error", ? ? ? ? ? ? /* APIC Error Bit 1 */
>> + ? ? ? ? ? ? ? "Send accept error", ? ? ? ? ? ?/* APIC Error Bit 2 */
>> + ? ? ? ? ? ? ? "Receive accept error", ? ? ? ? /* APIC Error Bit 3 */
>> + ? ? ? ? ? ? ? "Redirectable IPI", ? ? ? ? ? ? /* APIC Error Bit 4 */
>> + ? ? ? ? ? ? ? "Send illegal vector", ? ? ? ? ?/* APIC Error Bit 5 */
>> + ? ? ? ? ? ? ? "Received illegal vector", ? ? ?/* APIC Error Bit 6 */
>> + ? ? ? ? ? ? ? "Illegal register address", ? ? /* APIC Error Bit 7 */
>> + ? ? ? };
>>
> ...
>> - ? ? ? pr_debug("APIC error on CPU%d: %02x(%02x)\n",
>> - ? ? ? ? ? ? ? smp_processor_id(), v , v1);
>> + ? ? ? apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",
>> + ? ? ? ? ? ? ? ? ? smp_processor_id(), v , v1);
>> +
>> + ? ? ? v1 = v1 & 0xff;
>> + ? ? ? while (v1) {
>> + ? ? ? ? ? ? ? if (v1 & 0x1)
>> + ? ? ? ? ? ? ? ? ? ? ? apic_printk(APIC_DEBUG, KERN_CONT " : %s",
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? error_interrupt_reason[i]);
>> + ? ? ? ? ? ? ? i++;
>> + ? ? ? ? ? ? ? v1 >>= 1;
>> + ? ? ? };
>> +
>
> Hi looks good but please add some array-checking in case if we get
> some damaged result
> from hardware, ie check for i not being out of error_interrupt_reason. Thanks!
>

2011-04-19 17:01:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] apic: Fix error interrupt report at all APs


* Youquan Song <[email protected]> wrote:

> Recently, customer report that once machine boot, there are many error interrupt
> reported with exact number of all APs.
>
> The root cause is Local APIC will generate error interrupt when it detect
> the illegal vector (one in 0 ~ 15) in an interrupt message received or
> interrupt generate from local vector table or self IPI. SDM3A.chapter 10.
>
> AP LAPIC thermal sensor register will be reset to 0x10000, if thermal throttling
> interrupt take over by BIOS, it need restore AP with the thermal sensor register
> value of geting from BSP, otherwise cause system issue. If BIOS does not take
> over the thermal interrupt, The restore value will be CPU rest value of 0x10000,
> which means the interrupt vector is zero. After writing 0x10000 to thermal
> sensor LVT, the processor will recieve the error interrupt report if the APIC
> error interrupt is also set.
>
> This patch add check the BIOS whether take over the thermal interrupt by look
> at interrupt delivery mode not fixed mode(BIOS handle will be SMI mode) before
> restore AP's thermal LVT. So the agony noise of error interrupt will dismiss
> when boot on machine that BIOS does not handle thermal interrupt..
>
>
> Signed-off-by: Youquan Song <[email protected]>
> Acked-by: Suresh Siddha <[email protected]>
> Acked-by: Yong Wang <[email protected]>
> ---
> arch/x86/include/asm/apicdef.h | 1 +
> arch/x86/kernel/cpu/mcheck/therm_throt.c | 12 +++++++-----
> 2 files changed, 8 insertions(+), 5 deletions(-)

I don't disagree with this change, but unfortunately the changelog is in
absolutely unreadable English. Please fix it or find someone who can fix it for
you.

I decoded and fixed the changelog of the 2/2 patch of your series so no need to
do it for that patch.

Thanks,

Ingo

2011-04-19 17:49:13

by Youquan Song

[permalink] [raw]
Subject: [tip:x86/apic] x86, apic: Print verbose error interrupt reason on apic=debug

Commit-ID: 2b398bd9f8f73be706b41adcbb240ce95793049a
Gitweb: http://git.kernel.org/tip/2b398bd9f8f73be706b41adcbb240ce95793049a
Author: Youquan Song <[email protected]>
AuthorDate: Thu, 14 Apr 2011 14:36:08 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 19 Apr 2011 19:03:30 +0200

x86, apic: Print verbose error interrupt reason on apic=debug

End users worry about the error interrupt printout we generate
currently:

pr_debug("APIC error on CPU%d: %02x(%02x)\n",
smp_processor_id(), v , v1);

... and would like to know the reason why error interrupts are generated.

This patch prints out more detailed debug information.

Another practical problem is that dynamic debug is not initialized yet
when the APIC initializes, so the pr_debug() will not output the error
interrupt debug information on bootup. In this patch, we use
apic_printk(APIC_DEBUG, ...), so the apic=debug boot option will print
verbose error interupts during bootup.

Signed-off-by: Youquan Song <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/apic/apic.c | 41 ++++++++++++++++++++++++++---------------
1 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index fabf01e..ae14712 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1812,30 +1812,41 @@ void smp_spurious_interrupt(struct pt_regs *regs)
*/
void smp_error_interrupt(struct pt_regs *regs)
{
- u32 v, v1;
+ u32 v0, v1;
+ u32 i = 0;
+ static const char * const error_interrupt_reason[] = {
+ "Send CS error", /* APIC Error Bit 0 */
+ "Receive CS error", /* APIC Error Bit 1 */
+ "Send accept error", /* APIC Error Bit 2 */
+ "Receive accept error", /* APIC Error Bit 3 */
+ "Redirectable IPI", /* APIC Error Bit 4 */
+ "Send illegal vector", /* APIC Error Bit 5 */
+ "Received illegal vector", /* APIC Error Bit 6 */
+ "Illegal register address", /* APIC Error Bit 7 */
+ };

exit_idle();
irq_enter();
/* First tickle the hardware, only then report what went on. -- REW */
- v = apic_read(APIC_ESR);
+ v0 = apic_read(APIC_ESR);
apic_write(APIC_ESR, 0);
v1 = apic_read(APIC_ESR);
ack_APIC_irq();
atomic_inc(&irq_err_count);

- /*
- * Here is what the APIC error bits mean:
- * 0: Send CS error
- * 1: Receive CS error
- * 2: Send accept error
- * 3: Receive accept error
- * 4: Reserved
- * 5: Send illegal vector
- * 6: Received illegal vector
- * 7: Illegal register address
- */
- pr_debug("APIC error on CPU%d: %02x(%02x)\n",
- smp_processor_id(), v , v1);
+ apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x(%02x)",
+ smp_processor_id(), v0 , v1);
+
+ v1 = v1 & 0xff;
+ while (v1) {
+ if (v1 & 0x1)
+ apic_printk(APIC_DEBUG, KERN_CONT " : %s", error_interrupt_reason[i]);
+ i++;
+ v1 >>= 1;
+ };
+
+ apic_printk(APIC_DEBUG, KERN_CONT "\n");
+
irq_exit();
}

2011-04-21 15:10:20

by Youquan Song

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] apic: Fix error interrupt report at all APs

> I don't disagree with this change, but unfortunately the changelog is in
> absolutely unreadable English. Please fix it or find someone who can fix it for
> you.
>
> I decoded and fixed the changelog of the 2/2 patch of your series so no need to
> do it for that patch.

Thanks a lot Ingo!

Here is the fixed changelog for 1/2 patch :Fix error interrupt report at all APs

This patch fixes a bug reported from customer, who found many unreasonable error
interrupts reported on all APs during the system boot stage.

According to Chapter 10 of Intel Software Developer Manual Volume 3A, Local APIC
may signal an illegal vector error when an LVT entry is set as an illegal
vector value (0~15) under FIXED delivery mode (bits 8-11 is 0), regardless of
whether the mask bit is set or an interrupt actually happen. These errors are
seen as error interrupts.

The initial value of thermal LVT entries on all APs always reads 0x10000 because
APs are woken up by BSP issuing INIT-SIPI-SIPI sequence to them and LVT
registers are reset to 0s except for the mask bits which are set to 1s when APs
receive INIT IPI. When BIOS take over the thermal throttling interrupt, LVT
thermal deliver mode should be SMI and it is required to restore AP's LVT
thermal monitor register.

This issue happens when BIOS do not take over thermal throttling interrupt,
AP's LVT thermal monitor register will be restored to 0x10000 which means vector
0 and fixed deliver mode, so all APs will signal illegal vector error
interrupt. This patch check if interrupt delivery mode is not fixed mode before
restore AP's LVT thermal monitor register.

-Youquan

2011-04-21 15:28:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] apic: Fix error interrupt report at all APs


* Youquan Song <[email protected]> wrote:

> > I don't disagree with this change, but unfortunately the changelog is in
> > absolutely unreadable English. Please fix it or find someone who can fix it for
> > you.
> >
> > I decoded and fixed the changelog of the 2/2 patch of your series so no need to
> > do it for that patch.
>
> Thanks a lot Ingo!
>
> Here is the fixed changelog for 1/2 patch :Fix error interrupt report at all APs

Mind resending the updated patch?

Thanks,

Ingo

2011-04-21 16:36:10

by Youquan Song

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] apic: Fix error interrupt report at all APs

> Mind resending the updated patch?

I have resended it.

Thanks
-Youquan