2016-11-29 18:44:24

by Prarit Bhargava

[permalink] [raw]
Subject: [PATCH] ACPI / APEI: Fix NMI notification handling

When removing and adding cpu 0 on a system with GHES NMI the following stack
trace is seen when re-adding the cpu:

WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+
Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc5+ #59
Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.01.00.0
ffffffff81c03e78 ffffffff81337905 0000000000000000 0000000000000000
ffffffff81c03eb8 ffffffff8107d9c1 00000545810aac4a 0000000000000000
00000000000000f0 0000000000000000 000081cb6440f1d0 0000000000000001
Call Trace:
[<ffffffff81337905>] dump_stack+0x63/0x8e
[<ffffffff8107d9c1>] __warn+0xd1/0xf0
[<ffffffff8107daad>] warn_slowpath_null+0x1d/0x20
[<ffffffff810522b5>] setup_local_APIC+0x275/0x370
[<ffffffff810523be>] apic_ap_setup+0xe/0x20
[<ffffffff8104f5a8>] start_secondary+0x48/0x180
[<ffffffff81d89aa0>] ? set_init_arg+0x55/0x55
[<ffffffff81d89120>] ? early_idt_handler_array+0x120/0x120
[<ffffffff81d895d6>] ? x86_64_start_reservations+0x2a/0x2c
[<ffffffff81d89715>] ? x86_64_start_kernel+0x13d/0x14c
---[ end trace 7b6555b6343ef9ee ]---

During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an
NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the
ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR
(0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also
0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms
that something has set the IRQ_WORK_VECTOR line prior to the APIC being
initialized.

Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
incorrectly modified the behavior such that the handler returns
NMI_HANDLED only if an error was processed, and incorrectly runs the ghes
work queue for every NMI.

This patch modifies the ghes_proc_irq_work() to run as it did prior to
2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by
properly returning NMI_HANDLED and only calling the work queue if
NMI_HANDLED has been set.

Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
Signed-off-by: Prarit Bhargava <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Tyler Baicar <[email protected]>
Cc: Punit Agrawal <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: [email protected]
---
drivers/acpi/apei/ghes.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 0d099a24f776..39c45efbcb3d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -858,17 +858,18 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
if (sev >= GHES_SEV_PANIC)
__ghes_panic(ghes);

+ ret = NMI_HANDLED;
+
if (!(ghes->flags & GHES_TO_CLEAR))
continue;

__process_error(ghes);
ghes_clear_estatus(ghes);
-
- ret = NMI_HANDLED;
}

#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
- irq_work_queue(&ghes_proc_irq_work);
+ if (ret == NMI_HANDLED)
+ irq_work_queue(&ghes_proc_irq_work);
#endif
atomic_dec(&ghes_in_nmi);
return ret;
--
1.7.9.3


2016-11-29 19:54:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] ACPI / APEI: Fix NMI notification handling

On Tue, Nov 29, 2016 at 01:43:59PM -0500, Prarit Bhargava wrote:
> When removing and adding cpu 0 on a system with GHES NMI the following stack
> trace is seen when re-adding the cpu:
>
> WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+
> Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc5+ #59
> Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.01.00.0
> ffffffff81c03e78 ffffffff81337905 0000000000000000 0000000000000000
> ffffffff81c03eb8 ffffffff8107d9c1 00000545810aac4a 0000000000000000
> 00000000000000f0 0000000000000000 000081cb6440f1d0 0000000000000001
> Call Trace:
> [<ffffffff81337905>] dump_stack+0x63/0x8e
> [<ffffffff8107d9c1>] __warn+0xd1/0xf0
> [<ffffffff8107daad>] warn_slowpath_null+0x1d/0x20
> [<ffffffff810522b5>] setup_local_APIC+0x275/0x370
> [<ffffffff810523be>] apic_ap_setup+0xe/0x20
> [<ffffffff8104f5a8>] start_secondary+0x48/0x180
> [<ffffffff81d89aa0>] ? set_init_arg+0x55/0x55
> [<ffffffff81d89120>] ? early_idt_handler_array+0x120/0x120
> [<ffffffff81d895d6>] ? x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81d89715>] ? x86_64_start_kernel+0x13d/0x14c
> ---[ end trace 7b6555b6343ef9ee ]---

Please remove all hex numbers from the splat - they're useless in the
commit message.

> During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an
> NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the
> ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR
> (0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also
> 0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms
> that something has set the IRQ_WORK_VECTOR line prior to the APIC being
> initialized.
>
> Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
> incorrectly modified the behavior such that the handler returns
> NMI_HANDLED only if an error was processed, and incorrectly runs the ghes
> work queue for every NMI.
>
> This patch modifies the ghes_proc_irq_work() to run as it did prior to
> 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by
> properly returning NMI_HANDLED and only calling the work queue if
> NMI_HANDLED has been set.
>
> Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
> Signed-off-by: Prarit Bhargava <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Paul Gortmaker <[email protected]>
> Cc: Tyler Baicar <[email protected]>
> Cc: Punit Agrawal <[email protected]>
> Cc: Don Zickus <[email protected]>
> Cc: [email protected]
> ---
> drivers/acpi/apei/ghes.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 0d099a24f776..39c45efbcb3d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -858,17 +858,18 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> if (sev >= GHES_SEV_PANIC)
> __ghes_panic(ghes);
>
> + ret = NMI_HANDLED;
> +

Make that more explicit:

if (ghes_read_estatus(ghes, 1)) {
ghes_clear_estatus(ghes);
continue;
} else {
ret = NMI_HANDLED;
}


> if (!(ghes->flags & GHES_TO_CLEAR))
> continue;
>
> __process_error(ghes);
> ghes_clear_estatus(ghes);
> -
> - ret = NMI_HANDLED;
> }
>
> #ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
> - irq_work_queue(&ghes_proc_irq_work);
> + if (ret == NMI_HANDLED)
> + irq_work_queue(&ghes_proc_irq_work);
> #endif
> atomic_dec(&ghes_in_nmi);
> return ret;
> --

Otherwise looks ok,
thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2016-11-30 13:19:55

by Prarit Bhargava

[permalink] [raw]
Subject: [PATCH v2] ACPI / APEI: Fix NMI notification handling

When removing and adding cpu 0 on a system with GHES NMI the following stack
trace is seen when re-adding the cpu:

WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+
Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc6+ #2
Call Trace:
dump_stack+0x63/0x8e
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
setup_local_APIC+0x275/0x370
apic_ap_setup+0xe/0x20
start_secondary+0x48/0x180
set_init_arg+0x55/0x55
early_idt_handler_array+0x120/0x120
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x13d/0x14c

During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an
NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the
ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR
(0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also
0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms
that something has set the IRQ_WORK_VECTOR line prior to the APIC being
initialized.

Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
incorrectly modified the behavior such that the handler returns
NMI_HANDLED only if an error was processed, and incorrectly runs the ghes
work queue for every NMI.

This patch modifies the ghes_proc_irq_work() to run as it did prior to
2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by
properly returning NMI_HANDLED and only calling the work queue if
NMI_HANDLED has been set.

v2: Borislav, setting of NMI_HANDLED moved & cleaned up changelog.

Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
Signed-off-by: Prarit Bhargava <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Tyler Baicar <[email protected]>
Cc: Punit Agrawal <[email protected]>
Cc: Don Zickus <[email protected]>
---
drivers/acpi/apei/ghes.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 0d099a24f776..e53bef6cf53c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -852,6 +852,8 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
if (ghes_read_estatus(ghes, 1)) {
ghes_clear_estatus(ghes);
continue;
+ } else {
+ ret = NMI_HANDLED;
}

sev = ghes_severity(ghes->estatus->error_severity);
@@ -863,12 +865,11 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)

__process_error(ghes);
ghes_clear_estatus(ghes);
-
- ret = NMI_HANDLED;
}

#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
- irq_work_queue(&ghes_proc_irq_work);
+ if (ret == NMI_HANDLED)
+ irq_work_queue(&ghes_proc_irq_work);
#endif
atomic_dec(&ghes_in_nmi);
return ret;
--
1.7.9.3

2016-12-01 20:07:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2] ACPI / APEI: Fix NMI notification handling

On Wed, Nov 30, 2016 at 08:19:39AM -0500, Prarit Bhargava wrote:
> When removing and adding cpu 0 on a system with GHES NMI the following stack
> trace is seen when re-adding the cpu:
>
> WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+
> Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc6+ #2
> Call Trace:
> dump_stack+0x63/0x8e
> __warn+0xd1/0xf0
> warn_slowpath_null+0x1d/0x20
> setup_local_APIC+0x275/0x370
> apic_ap_setup+0xe/0x20
> start_secondary+0x48/0x180
> set_init_arg+0x55/0x55
> early_idt_handler_array+0x120/0x120
> x86_64_start_reservations+0x2a/0x2c
> x86_64_start_kernel+0x13d/0x14c
>
> During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an
> NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the
> ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR
> (0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also
> 0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms
> that something has set the IRQ_WORK_VECTOR line prior to the APIC being
> initialized.
>
> Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
> incorrectly modified the behavior such that the handler returns
> NMI_HANDLED only if an error was processed, and incorrectly runs the ghes
> work queue for every NMI.
>
> This patch modifies the ghes_proc_irq_work() to run as it did prior to
> 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by
> properly returning NMI_HANDLED and only calling the work queue if
> NMI_HANDLED has been set.
>
> v2: Borislav, setting of NMI_HANDLED moved & cleaned up changelog.
>
> Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
> Signed-off-by: Prarit Bhargava <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Paul Gortmaker <[email protected]>
> Cc: Tyler Baicar <[email protected]>
> Cc: Punit Agrawal <[email protected]>
> Cc: Don Zickus <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)

Reviewed-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 21:21:59

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2] ACPI / APEI: Fix NMI notification handling

On Thursday, December 01, 2016 09:07:39 PM Borislav Petkov wrote:
> On Wed, Nov 30, 2016 at 08:19:39AM -0500, Prarit Bhargava wrote:
> > When removing and adding cpu 0 on a system with GHES NMI the following stack
> > trace is seen when re-adding the cpu:
> >
> > WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+
> > Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra
> > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc6+ #2
> > Call Trace:
> > dump_stack+0x63/0x8e
> > __warn+0xd1/0xf0
> > warn_slowpath_null+0x1d/0x20
> > setup_local_APIC+0x275/0x370
> > apic_ap_setup+0xe/0x20
> > start_secondary+0x48/0x180
> > set_init_arg+0x55/0x55
> > early_idt_handler_array+0x120/0x120
> > x86_64_start_reservations+0x2a/0x2c
> > x86_64_start_kernel+0x13d/0x14c
> >
> > During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an
> > NMI on CPU 0. The GHES NMI handler, ghes_notify_nmi() runs the
> > ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR
> > (0xf6). The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is also
> > 0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms
> > that something has set the IRQ_WORK_VECTOR line prior to the APIC being
> > initialized.
> >
> > Commit 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
> > incorrectly modified the behavior such that the handler returns
> > NMI_HANDLED only if an error was processed, and incorrectly runs the ghes
> > work queue for every NMI.
> >
> > This patch modifies the ghes_proc_irq_work() to run as it did prior to
> > 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler") by
> > properly returning NMI_HANDLED and only calling the work queue if
> > NMI_HANDLED has been set.
> >
> > v2: Borislav, setting of NMI_HANDLED moved & cleaned up changelog.
> >
> > Fixes: 2383844d4850 ("GHES: Elliminate double-loop in the NMI handler")
> > Signed-off-by: Prarit Bhargava <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Paul Gortmaker <[email protected]>
> > Cc: Tyler Baicar <[email protected]>
> > Cc: Punit Agrawal <[email protected]>
> > Cc: Don Zickus <[email protected]>
> > ---
> > drivers/acpi/apei/ghes.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
>
> Reviewed-by: Borislav Petkov <[email protected]>

I guess I should pick up this one, then?

Thanks,
Rafael

2016-12-01 22:37:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2] ACPI / APEI: Fix NMI notification handling

On Thursday, December 01, 2016 10:51:49 PM Borislav Petkov wrote:
> On Thu, Dec 01, 2016 at 10:17:54PM +0100, Rafael J. Wysocki wrote:
> > I guess I should pick up this one, then?
>
> If you already have stuff touching this area, it probably would be more
> prudent if you took it. If not, I can take it through tip's ras branch,
> if you prefer.

Well, there's another ARM-related patch touching APEI.

I guess whoever takes this one should also take the other one and
honestly they can go in via any tree as far as I'm concerned, I'm just trying to
avoid merge clashes here. :-)

Thanks,
Rafael

2016-12-01 22:47:22

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2] ACPI / APEI: Fix NMI notification handling

On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote:
> Well, there's another ARM-related patch touching APEI.
>
> I guess whoever takes this one should also take the other one and
> honestly they can go in via any tree as far as I'm concerned, I'm just trying to
> avoid merge clashes here. :-)

Maybe have ARM-folk ACK the other one and take both through your ACPI
tree? They both do have ACPI in common :-)

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 23:15:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2] ACPI / APEI: Fix NMI notification handling

On Thursday, December 01, 2016 11:47:17 PM Borislav Petkov wrote:
> On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote:
> > Well, there's another ARM-related patch touching APEI.
> >
> > I guess whoever takes this one should also take the other one and
> > honestly they can go in via any tree as far as I'm concerned, I'm just trying to
> > avoid merge clashes here. :-)
>
> Maybe have ARM-folk ACK the other one and take both through your ACPI
> tree? They both do have ACPI in common :-)

That one have been ACKed already.

OK, I'll take them both.

Thanks,
Rafael

2016-12-02 05:45:45

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH v2] ACPI / APEI: Fix NMI notification handling

On 2016/12/2 7:12, Rafael J. Wysocki wrote:
> On Thursday, December 01, 2016 11:47:17 PM Borislav Petkov wrote:
>> On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote:
>>> Well, there's another ARM-related patch touching APEI.
>>>
>>> I guess whoever takes this one should also take the other one and
>>> honestly they can go in via any tree as far as I'm concerned, I'm just trying to
>>> avoid merge clashes here. :-)
>> Maybe have ARM-folk ACK the other one and take both through your ACPI
>> tree? They both do have ACPI in common :-)
> That one have been ACKed already.
>
> OK, I'll take them both.

Thank you very much :)

Hanjun

2016-12-02 11:39:34

by Fu Wei

[permalink] [raw]
Subject: Re: [PATCH v2] ACPI / APEI: Fix NMI notification handling

Hi Rafael,

On 2 December 2016 at 07:12, Rafael J. Wysocki <[email protected]> wrote:
> On Thursday, December 01, 2016 11:47:17 PM Borislav Petkov wrote:
>> On Thu, Dec 01, 2016 at 11:29:45PM +0100, Rafael J. Wysocki wrote:
>> > Well, there's another ARM-related patch touching APEI.
>> >
>> > I guess whoever takes this one should also take the other one and
>> > honestly they can go in via any tree as far as I'm concerned, I'm just trying to
>> > avoid merge clashes here. :-)
>>
>> Maybe have ARM-folk ACK the other one and take both through your ACPI
>> tree? They both do have ACPI in common :-)
>
> That one have been ACKed already.

I guess that is "[PATCH v15] acpi, apei, arm64: APEI initial support
for aarch64."

>
> OK, I'll take them both.

Great thanks, Rafael :-)

>
> Thanks,
> Rafael
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Best regards,

Fu Wei
Software Engineer
Red Hat