2018-02-09 12:13:50

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 0/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

A regression bug was introduced in below commit.
commit 522e66464467 ("x86/apic: Disable I/O APIC before shutdown of the local APIC")

It caused the action to fail which we try to restore boot irq mode
in reboot and kexec/kdump. Details can be seen in patch 0002.

Warning can always be seen during kdump kernel boot on qemu/kvm
platform. Our customer even saw casual kdump kernel hang once in
~30 attempts during stress testing of kdump on KVM machine.

This is v3 post, patches are rearranged and changed according to
Eric's suggestions.

Baoquan He (5):
x86/apic: Split out restore_boot_irq_mode from disable_IO_APIC
x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump
x86/apic: Remove useless disable_IO_APIC
x86/apic: Rename variable/function related to x86_io_apic_ops
x86/apic: Set up through-local-APIC on boot CPU if 'noapic' specified

arch/x86/include/asm/io_apic.h | 9 +++++----
arch/x86/include/asm/x86_init.h | 8 ++++----
arch/x86/kernel/apic/apic.c | 2 +-
arch/x86/kernel/apic/io_apic.c | 16 ++++------------
arch/x86/kernel/crash.c | 3 ++-
arch/x86/kernel/machine_kexec_32.c | 7 +++----
arch/x86/kernel/machine_kexec_64.c | 7 +++----
arch/x86/kernel/reboot.c | 3 ++-
arch/x86/kernel/x86_init.c | 6 +++---
arch/x86/xen/apic.c | 2 +-
drivers/iommu/irq_remapping.c | 4 ++--
11 files changed, 30 insertions(+), 37 deletions(-)

--
2.13.6



2018-02-09 12:11:28

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 1/5] x86/apic: Split out restore_boot_irq_mode from disable_IO_APIC

This is a preparation patch. Split out the code which restores boot
irq mode from disable_IO_APIC() and wrap into a new function
restore_boot_irq_mode(). No functional change.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/include/asm/io_apic.h | 1 +
arch/x86/kernel/apic/io_apic.c | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index a8834dd546cd..558d1a6a13ad 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
extern void setup_IO_APIC(void);
extern void enable_IO_APIC(void);
extern void disable_IO_APIC(void);
+extern void restore_boot_irq_mode(void);
extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
extern void print_IO_APICs(void);
#else /* !CONFIG_X86_IO_APIC */
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 8ad2e410974f..7b73b6b9b4b6 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1448,6 +1448,11 @@ void disable_IO_APIC(void)
*/
clear_IO_APIC();

+ restore_boot_irq_mode();
+}
+
+void restore_boot_irq_mode(void)
+{
if (!nr_legacy_irqs())
return;

--
2.13.6


2018-02-09 12:11:41

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 5/5] x86/apic: Set up through-local-APIC on boot CPU if 'noapic' specified

Currently kdump kernel becomes very slow if 'noapic' is specified.
Normal kernel won't.

Kernel parameter 'noapic' is used to disable IO-APIC in system for
testing or special purpose. Here the root cause is that in kdump
kernel LAPIC is disabled since commit 522e664644
("x86/apic: Disable I/O APIC before shutdown of the local APIC").
In this case We need set up through-local-APIC on boot CPU in
setup_local_APIC().

While In normal kernel the legacy irq mode is enabled in BIOS. If
it is virtual wire mode, the local-APIC has been enabled and set as
through-local-APIC.

Though we fix the regression introduced by criminal commit 522e664644,
for safety and clarity, better set up through-local-APIC explicitly,
but not rely on the default boot irq mode.

Do it now.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/kernel/apic/apic.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 25ddf02598d2..3fc259b4dd2d 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1570,7 +1570,7 @@ void setup_local_APIC(void)
* TODO: set up through-local-APIC from through-I/O-APIC? --macro
*/
value = apic_read(APIC_LVT0) & APIC_LVT_MASKED;
- if (!cpu && (pic_mode || !value)) {
+ if (!cpu && (pic_mode || !value || skip_ioapic_setup)) {
value = APIC_DM_EXTINT;
apic_printk(APIC_VERBOSE, "enabled ExtINT on CPU#%d\n", cpu);
} else {
--
2.13.6


2018-02-09 12:12:07

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 4/5] x86/apic: Rename variable/function related to x86_io_apic_ops

The names of x86_io_apic_ops and its two member variables, are
misleading. The .read member is to read IO_APIC reg, while .disable
which hook native_disable_io_apic/irq_remapping_disable_io_apic
is actually used to restore boot irq mode, not disable IO_APIC.

So rename x86_io_apic_ops to x86_apic_ops since it doesn't only
handle IO_APIC, also LAPIC.

And also rename its member variables and the related hooks.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/include/asm/io_apic.h | 6 +++---
arch/x86/include/asm/x86_init.h | 8 ++++----
arch/x86/kernel/apic/io_apic.c | 4 ++--
arch/x86/kernel/x86_init.c | 6 +++---
arch/x86/xen/apic.c | 2 +-
drivers/iommu/irq_remapping.c | 4 ++--
6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index 5e389145d808..06fec4426458 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -183,11 +183,11 @@ extern void disable_ioapic_support(void);

extern void __init io_apic_init_mappings(void);
extern unsigned int native_io_apic_read(unsigned int apic, unsigned int reg);
-extern void native_disable_io_apic(void);
+extern void native_restore_boot_irq_mode(void);

static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
{
- return x86_io_apic_ops.read(apic, reg);
+ return x86_apic_ops.io_apic_read(apic, reg);
}

extern void setup_IO_APIC(void);
@@ -229,7 +229,7 @@ static inline void mp_save_irq(struct mpc_intsrc *m) { }
static inline void disable_ioapic_support(void) { }
static inline void io_apic_init_mappings(void) { }
#define native_io_apic_read NULL
-#define native_disable_io_apic NULL
+#define native_restore_boot_irq_mode NULL

static inline void setup_IO_APIC(void) { }
static inline void enable_IO_APIC(void) { }
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index fc2f082ac635..88306054bd98 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -274,16 +274,16 @@ struct x86_msi_ops {
void (*restore_msi_irqs)(struct pci_dev *dev);
};

-struct x86_io_apic_ops {
- unsigned int (*read) (unsigned int apic, unsigned int reg);
- void (*disable)(void);
+struct x86_apic_ops {
+ unsigned int (*io_apic_read) (unsigned int apic, unsigned int reg);
+ void (*restore)(void);
};

extern struct x86_init_ops x86_init;
extern struct x86_cpuinit_ops x86_cpuinit;
extern struct x86_platform_ops x86_platform;
extern struct x86_msi_ops x86_msi;
-extern struct x86_io_apic_ops x86_io_apic_ops;
+extern struct x86_apic_ops x86_apic_ops;

extern void x86_early_init_platform_quirks(void);
extern void x86_init_noop(void);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 9d86b10c2121..68129f11e7db 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1410,7 +1410,7 @@ void __init enable_IO_APIC(void)
clear_IO_APIC();
}

-void native_disable_io_apic(void)
+void native_restore_boot_irq_mode(void)
{
/*
* If the i8259 is routed through an IOAPIC
@@ -1443,7 +1443,7 @@ void restore_boot_irq_mode(void)
if (!nr_legacy_irqs())
return;

- x86_io_apic_ops.disable();
+ x86_apic_ops.restore();
}

#ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 1151ccd72ce9..2bccd03bd654 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -146,7 +146,7 @@ void arch_restore_msi_irqs(struct pci_dev *dev)
}
#endif

-struct x86_io_apic_ops x86_io_apic_ops __ro_after_init = {
- .read = native_io_apic_read,
- .disable = native_disable_io_apic,
+struct x86_apic_ops x86_apic_ops __ro_after_init = {
+ .io_apic_read = native_io_apic_read,
+ .restore = native_restore_boot_irq_mode,
};
diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c
index de58533d3664..2163888497d3 100644
--- a/arch/x86/xen/apic.c
+++ b/arch/x86/xen/apic.c
@@ -215,7 +215,7 @@ static void __init xen_apic_check(void)
}
void __init xen_init_apic(void)
{
- x86_io_apic_ops.read = xen_io_apic_read;
+ x86_apic_ops.io_apic_read = xen_io_apic_read;
/* On PV guests the APIC CPUID bit is disabled so none of the
* routines end up executing. */
if (!xen_initial_domain())
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 49721b4e1975..496deee3ae3a 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -27,7 +27,7 @@ int disable_irq_post = 0;
static int disable_irq_remap;
static struct irq_remap_ops *remap_ops;

-static void irq_remapping_disable_io_apic(void)
+static void irq_remapping_restore_boot_irq_mode(void)
{
/*
* With interrupt-remapping, for now we will use virtual wire A
@@ -42,7 +42,7 @@ static void irq_remapping_disable_io_apic(void)

static void __init irq_remapping_modify_x86_ops(void)
{
- x86_io_apic_ops.disable = irq_remapping_disable_io_apic;
+ x86_apic_ops.restore = irq_remapping_restore_boot_irq_mode;
}

static __init int setup_nointremap(char *str)
--
2.13.6


2018-02-09 12:12:36

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

This is a regression fix.

Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
calling after disable_IO_APIC(). This introdued a regression. The
root cause is that disable_IO_APIC() not only clears IO_APIC, also
restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
after disable_IO_APIC() will disable LAPIC and ruin the possible
virtual wire mode setting which the code has been trying to do all
along.

The consequence is, in KVM guest kernel always prints warning as below
during kexec/kdump kernel boots up. That happened in setup_local_APIC()
since 'do { xxx } while (queued && max_loops > 0)' loop does not function
well any more if pending irq exists in APIC IRR after LAPIC is disabled.
And people even saw casual kdump kernel hang once in ~30 attempts during
stress testing of kdump on KVM machine.

[ 0.001000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1467 setup_local_APIC+0x228/0x330
[ 0.001000] Modules linked in:
[ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #3
[ 0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 0.001000] RIP: 0010:setup_local_APIC+0x228/0x330
[ 0.001000] RSP: 0000:ffffffffb6e03eb8 EFLAGS: 00010286
[ 0.001000] RAX: 0000009edb4c4d84 RBX: 0000000000000000 RCX: 00000000b099d800
[ 0.001000] RDX: 0000009e00000000 RSI: 0000000000000000 RDI: 0000000000000810
[ 0.001000] RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000001
[ 0.001000] R10: ffff98ce6a801c00 R11: 0761076d072f0776 R12: 0000000000000001
[ 0.001000] R13: 00000000000000f0 R14: 0000000000004000 R15: ffffffffffffc6ff
[ 0.001000] FS: 0000000000000000(0000) GS:ffff98ce6bc00000(0000) knlGS:0000000000000000
[ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.001000] CR2: 00000000ffffffff CR3: 0000000022209000 CR4: 00000000000406b0
[ 0.001000] Call Trace:
[ 0.001000] apic_bsp_setup+0x56/0x74
[ 0.001000] x86_late_time_init+0x11/0x16
[ 0.001000] start_kernel+0x3c9/0x486
[ 0.001000] secondary_startup_64+0xa5/0xb0
[ 0.001000] Code: 00 85 c9 74 2d 0f 31 c1 e1 0a 48 c1 e2 20 41 89 cf 4c 03 7c 24 08 48 09 d0 49 29 c7 4c 89 3c 24 48 83 3c 24 00 0f 8f 8f fe ff
ff <0f> ff e9 10 ff ff ff 48 83 2c 24 01 eb e7 48 83 c4 18 5b 5d 41
[ 0.001000] ---[ end trace b88e71b9a6ebebdd ]---
[ 0.001000] masked ExtINT on CPU#0

To fix this, just break down disable_IO_APIC(), then call
clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
and call restore_boot_irq_mode() to restore boot irq mode before
reboot or kexec/kdump jump.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/include/asm/io_apic.h | 1 +
arch/x86/kernel/apic/io_apic.c | 2 +-
arch/x86/kernel/crash.c | 3 ++-
arch/x86/kernel/machine_kexec_32.c | 2 +-
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/kernel/reboot.c | 3 ++-
6 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index 558d1a6a13ad..0fa95bfacb39 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
extern void setup_IO_APIC(void);
extern void enable_IO_APIC(void);
extern void disable_IO_APIC(void);
+extern void clear_IO_APIC(void);
extern void restore_boot_irq_mode(void);
extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
extern void print_IO_APICs(void);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 7b73b6b9b4b6..2d7cd2db77f5 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -587,7 +587,7 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
mpc_ioapic_id(apic), pin);
}

-static void clear_IO_APIC (void)
+void clear_IO_APIC (void)
{
int apic, pin;

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 10e74d4778a1..1f6680427ff0 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -199,9 +199,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
#ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
- disable_IO_APIC();
+ clear_IO_APIC();
#endif
lapic_shutdown();
+ restore_boot_irq_mode();
#ifdef CONFIG_HPET_TIMER
hpet_disable();
#endif
diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
index edfede768688..f78bb4432bfb 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -199,7 +199,7 @@ void machine_kexec(struct kimage *image)
* one form or other. kexec jump path also need
* one.
*/
- disable_IO_APIC();
+ restore_boot_irq_mode();
#endif
}

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 1f790cf9d38f..cb0c2d0a4c99 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
* one form or other. kexec jump path also need
* one.
*/
- disable_IO_APIC();
+ restore_boot_irq_mode();
#endif
}

diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 2126b9d27c34..725624b6c0c0 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -666,7 +666,7 @@ void native_machine_shutdown(void)
* Even without the erratum, it still makes sense to quiet IO APIC
* before disabling Local APIC.
*/
- disable_IO_APIC();
+ clear_IO_APIC();
#endif

#ifdef CONFIG_SMP
@@ -680,6 +680,7 @@ void native_machine_shutdown(void)
#endif

lapic_shutdown();
+ restore_boot_irq_mode();

#ifdef CONFIG_HPET_TIMER
hpet_disable();
--
2.13.6


2018-02-09 12:12:40

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 3/5] x86/apic: Remove useless disable_IO_APIC

No one uses it anymore.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/include/asm/io_apic.h | 1 -
arch/x86/kernel/apic/io_apic.c | 13 -------------
arch/x86/kernel/machine_kexec_32.c | 5 ++---
arch/x86/kernel/machine_kexec_64.c | 5 ++---
4 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index 0fa95bfacb39..5e389145d808 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -192,7 +192,6 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)

extern void setup_IO_APIC(void);
extern void enable_IO_APIC(void);
-extern void disable_IO_APIC(void);
extern void clear_IO_APIC(void);
extern void restore_boot_irq_mode(void);
extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 2d7cd2db77f5..9d86b10c2121 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1438,19 +1438,6 @@ void native_disable_io_apic(void)
disconnect_bsp_APIC(ioapic_i8259.pin != -1);
}

-/*
- * Not an __init, needed by the reboot code
- */
-void disable_IO_APIC(void)
-{
- /*
- * Clear the IO-APIC before rebooting:
- */
- clear_IO_APIC();
-
- restore_boot_irq_mode();
-}
-
void restore_boot_irq_mode(void)
{
if (!nr_legacy_irqs())
diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
index f78bb4432bfb..3172f22c1fc1 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -195,9 +195,8 @@ void machine_kexec(struct kimage *image)
/*
* We need to put APICs in legacy mode so that we can
* get timer interrupts in second kernel. kexec/kdump
- * paths already have calls to disable_IO_APIC() in
- * one form or other. kexec jump path also need
- * one.
+ * paths already have calls to restore_boot_irq_mode()
+ * in one form or other. kexec jump path also need one.
*/
restore_boot_irq_mode();
#endif
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index cb0c2d0a4c99..61a87e97aac6 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -293,9 +293,8 @@ void machine_kexec(struct kimage *image)
/*
* We need to put APICs in legacy mode so that we can
* get timer interrupts in second kernel. kexec/kdump
- * paths already have calls to disable_IO_APIC() in
- * one form or other. kexec jump path also need
- * one.
+ * paths already have calls to restore_boot_irq_mode()
+ * in one form or other. kexec jump path also need one.
*/
restore_boot_irq_mode();
#endif
--
2.13.6


2018-02-12 03:11:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Baoquan He <[email protected]> writes:

> This is a regression fix.
>
> Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
> I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
> calling after disable_IO_APIC(). This introdued a regression. The
> root cause is that disable_IO_APIC() not only clears IO_APIC, also
> restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
> after disable_IO_APIC() will disable LAPIC and ruin the possible
> virtual wire mode setting which the code has been trying to do all
> along.
>
> The consequence is, in KVM guest kernel always prints warning as below
> during kexec/kdump kernel boots up. That happened in setup_local_APIC()
> since 'do { xxx } while (queued && max_loops > 0)' loop does not function
> well any more if pending irq exists in APIC IRR after LAPIC is disabled.
> And people even saw casual kdump kernel hang once in ~30 attempts during
> stress testing of kdump on KVM machine.
>
> [ 0.001000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1467 setup_local_APIC+0x228/0x330
> [ 0.001000] Modules linked in:
> [ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #3
> [ 0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [ 0.001000] RIP: 0010:setup_local_APIC+0x228/0x330
> [ 0.001000] RSP: 0000:ffffffffb6e03eb8 EFLAGS: 00010286
> [ 0.001000] RAX: 0000009edb4c4d84 RBX: 0000000000000000 RCX: 00000000b099d800
> [ 0.001000] RDX: 0000009e00000000 RSI: 0000000000000000 RDI: 0000000000000810
> [ 0.001000] RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000001
> [ 0.001000] R10: ffff98ce6a801c00 R11: 0761076d072f0776 R12: 0000000000000001
> [ 0.001000] R13: 00000000000000f0 R14: 0000000000004000 R15: ffffffffffffc6ff
> [ 0.001000] FS: 0000000000000000(0000) GS:ffff98ce6bc00000(0000) knlGS:0000000000000000
> [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.001000] CR2: 00000000ffffffff CR3: 0000000022209000 CR4: 00000000000406b0
> [ 0.001000] Call Trace:
> [ 0.001000] apic_bsp_setup+0x56/0x74
> [ 0.001000] x86_late_time_init+0x11/0x16
> [ 0.001000] start_kernel+0x3c9/0x486
> [ 0.001000] secondary_startup_64+0xa5/0xb0
> [ 0.001000] Code: 00 85 c9 74 2d 0f 31 c1 e1 0a 48 c1 e2 20 41 89 cf 4c 03 7c 24 08 48 09 d0 49 29 c7 4c 89 3c 24 48 83 3c 24 00 0f 8f 8f fe ff
> ff <0f> ff e9 10 ff ff ff 48 83 2c 24 01 eb e7 48 83 c4 18 5b 5d 41
> [ 0.001000] ---[ end trace b88e71b9a6ebebdd ]---
> [ 0.001000] masked ExtINT on CPU#0
>
> To fix this, just break down disable_IO_APIC(), then call
> clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
> and call restore_boot_irq_mode() to restore boot irq mode before
> reboot or kexec/kdump jump.

Two things here.
a) This is missing a fixes tag and a CC stable.
b) What makes your change to the KEXEC_JUMP code path safe?
Have the lapic and ioapic already been shut down?

The KEXEC_JUMP changes to machine_kexec_32.c and machine_kexec_64.c
either need to be documented in the change long why they are safe
so that this change becomes obviously safe and correct.

Otherwise we risk and trivial and obvious looking change causing another
regression like changing the order of lapic_shutdown and disable_IOAPIC
did.

Eric


>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> arch/x86/include/asm/io_apic.h | 1 +
> arch/x86/kernel/apic/io_apic.c | 2 +-
> arch/x86/kernel/crash.c | 3 ++-
> arch/x86/kernel/machine_kexec_32.c | 2 +-
> arch/x86/kernel/machine_kexec_64.c | 2 +-
> arch/x86/kernel/reboot.c | 3 ++-
> 6 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
> index 558d1a6a13ad..0fa95bfacb39 100644
> --- a/arch/x86/include/asm/io_apic.h
> +++ b/arch/x86/include/asm/io_apic.h
> @@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
> extern void setup_IO_APIC(void);
> extern void enable_IO_APIC(void);
> extern void disable_IO_APIC(void);
> +extern void clear_IO_APIC(void);
> extern void restore_boot_irq_mode(void);
> extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
> extern void print_IO_APICs(void);
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 7b73b6b9b4b6..2d7cd2db77f5 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -587,7 +587,7 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> mpc_ioapic_id(apic), pin);
> }
>
> -static void clear_IO_APIC (void)
> +void clear_IO_APIC (void)
> {
> int apic, pin;
>
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 10e74d4778a1..1f6680427ff0 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -199,9 +199,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> #ifdef CONFIG_X86_IO_APIC
> /* Prevent crash_kexec() from deadlocking on ioapic_lock. */
> ioapic_zap_locks();
> - disable_IO_APIC();
> + clear_IO_APIC();
> #endif
> lapic_shutdown();
> + restore_boot_irq_mode();
> #ifdef CONFIG_HPET_TIMER
> hpet_disable();
> #endif
> diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
> index edfede768688..f78bb4432bfb 100644
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -199,7 +199,7 @@ void machine_kexec(struct kimage *image)
> * one form or other. kexec jump path also need
> * one.
> */
> - disable_IO_APIC();
> + restore_boot_irq_mode();
> #endif
> }
>
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 1f790cf9d38f..cb0c2d0a4c99 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
> * one form or other. kexec jump path also need
> * one.
> */
> - disable_IO_APIC();
> + restore_boot_irq_mode();
> #endif
> }
>
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 2126b9d27c34..725624b6c0c0 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -666,7 +666,7 @@ void native_machine_shutdown(void)
> * Even without the erratum, it still makes sense to quiet IO APIC
> * before disabling Local APIC.
> */
> - disable_IO_APIC();
> + clear_IO_APIC();
> #endif
>
> #ifdef CONFIG_SMP
> @@ -680,6 +680,7 @@ void native_machine_shutdown(void)
> #endif
>
> lapic_shutdown();
> + restore_boot_irq_mode();
>
> #ifdef CONFIG_HPET_TIMER
> hpet_disable();

2018-02-12 05:11:45

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Hi all,

One thing confused me.

The disconnect_bsp_APIC() may restore the interrupt delivery mode into
virtual wire mode. it uses the vector F as the spurious interrput, But,
IMO, using the vector 0xFF(SPURIOUS_APIC_VECTOR) may more suitable and
will give us more detail. Why the disconnect_bsp_APIC() use vector F
here?

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 25ddf02598d2..550deaad6a9a 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2130,7 +2130,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
value = apic_read(APIC_SPIV);
value &= ~APIC_VECTOR_MASK;
value |= APIC_SPIV_APIC_ENABLED;
- value |= 0xf;
+ value |= SPURIOUS_APIC_VECTOR;
apic_write(APIC_SPIV, value);

if (!virt_wire_setup) {

Thanks,
dou

At 02/09/2018 08:10 PM, Baoquan He wrote:
> A regression bug was introduced in below commit.
> commit 522e66464467 ("x86/apic: Disable I/O APIC before shutdown of the local APIC")
>
> It caused the action to fail which we try to restore boot irq mode
> in reboot and kexec/kdump. Details can be seen in patch 0002.
>
> Warning can always be seen during kdump kernel boot on qemu/kvm
> platform. Our customer even saw casual kdump kernel hang once in
> ~30 attempts during stress testing of kdump on KVM machine.
>
> This is v3 post, patches are rearranged and changed according to
> Eric's suggestions.
>
> Baoquan He (5):
> x86/apic: Split out restore_boot_irq_mode from disable_IO_APIC
> x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump
> x86/apic: Remove useless disable_IO_APIC
> x86/apic: Rename variable/function related to x86_io_apic_ops
> x86/apic: Set up through-local-APIC on boot CPU if 'noapic' specified
>
> arch/x86/include/asm/io_apic.h | 9 +++++----
> arch/x86/include/asm/x86_init.h | 8 ++++----
> arch/x86/kernel/apic/apic.c | 2 +-
> arch/x86/kernel/apic/io_apic.c | 16 ++++------------
> arch/x86/kernel/crash.c | 3 ++-
> arch/x86/kernel/machine_kexec_32.c | 7 +++----
> arch/x86/kernel/machine_kexec_64.c | 7 +++----
> arch/x86/kernel/reboot.c | 3 ++-
> arch/x86/kernel/x86_init.c | 6 +++---
> arch/x86/xen/apic.c | 2 +-
> drivers/iommu/irq_remapping.c | 4 ++--
> 11 files changed, 30 insertions(+), 37 deletions(-)
>



2018-02-12 06:02:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Dou Liyang <[email protected]> writes:

> Hi all,
>
> One thing confused me.
>
> The disconnect_bsp_APIC() may restore the interrupt delivery mode into
> virtual wire mode. it uses the vector F as the spurious interrput, But,
> IMO, using the vector 0xFF(SPURIOUS_APIC_VECTOR) may more suitable and
> will give us more detail. Why the disconnect_bsp_APIC() use vector F
> here?

I would say this needs a documentation search before changing this.

This code originates in:
208fb93162d5 ("[PATCH] kexec: x86_64: restore apic virtual wire mode on shutdown")

The example in the Multi-Processor Specification v1.4 shows setting
up the SPIV to vector 0x0f.

I don't know what is canonical and what will interact best with DOS,
and that erra of setup. The vector 0x0f seems an odd choice as
it is below 0x20 putting it in the range of vectors that are
reserved for processor exceptions.

The constant SPURIOUS_APIC_VECTOR is definitely not something we want
to be using at this point as that is a linux specific setting and used
when Linux is up and running. So it is completely inapplicable.

This is all about restoring how the apics were configured at boot time
so it may be appropriate to copy and store this value, if it was not
architectural.

At a practical level at this point I suspect we are ok as the setting
of the SPIV this way has not caused any known problems in the last
decade. If someone wants to dig through the archtectural documents
and the real world practice and find a better value and explain the
change I would not oppose it.

All I know for certain is using the constant SPURIOUS_APIC_VECTOR
is completely inappropriate (as that constant is about how linux uses
vectors) and thus the patch below is wrong.

> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 25ddf02598d2..550deaad6a9a 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -2130,7 +2130,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
> value = apic_read(APIC_SPIV);
> value &= ~APIC_VECTOR_MASK;
> value |= APIC_SPIV_APIC_ENABLED;
> - value |= 0xf;
> + value |= SPURIOUS_APIC_VECTOR;
> apic_write(APIC_SPIV, value);
>
> if (!virt_wire_setup) {
>

Eric

2018-02-12 10:19:27

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

On 02/11/18 at 11:11pm, Eric W. Biederman wrote:
> Dou Liyang <[email protected]> writes:
>
> > Hi all,
> >
> > One thing confused me.
> >
> > The disconnect_bsp_APIC() may restore the interrupt delivery mode into
> > virtual wire mode. it uses the vector F as the spurious interrput, But,
> > IMO, using the vector 0xFF(SPURIOUS_APIC_VECTOR) may more suitable and
> > will give us more detail. Why the disconnect_bsp_APIC() use vector F
> > here?
>
> I would say this needs a documentation search before changing this.
>
> This code originates in:
> 208fb93162d5 ("[PATCH] kexec: x86_64: restore apic virtual wire mode on shutdown")
>
> The example in the Multi-Processor Specification v1.4 shows setting
> up the SPIV to vector 0x0f.
>
> I don't know what is canonical and what will interact best with DOS,
> and that erra of setup. The vector 0x0f seems an odd choice as
> it is below 0x20 putting it in the range of vectors that are
> reserved for processor exceptions.
>
> The constant SPURIOUS_APIC_VECTOR is definitely not something we want
> to be using at this point as that is a linux specific setting and used
> when Linux is up and running. So it is completely inapplicable.
>
> This is all about restoring how the apics were configured at boot time
> so it may be appropriate to copy and store this value, if it was not
> architectural.
>
> At a practical level at this point I suspect we are ok as the setting
> of the SPIV this way has not caused any known problems in the last
> decade. If someone wants to dig through the archtectural documents
> and the real world practice and find a better value and explain the
> change I would not oppose it.
>
> All I know for certain is using the constant SPURIOUS_APIC_VECTOR
> is completely inappropriate (as that constant is about how linux uses
> vectors) and thus the patch below is wrong.

I dig a little deep into doc and code, there are some findings:

1) In Intel® 64 and IA-32 Architectures Software Developer’s Manual and
MP-Spec, both mentioned that Spurious-Interrupt Vector Register (SVR)
should be 0xxF. For P6 family and Pentium processors, bits 0 through 3
are hardwired to logical ones. Please see 10.9 SPURIOUS INTERRUPT of
intel manual vol-3A, part 1.

2) For vector 0xf, we do have a X86_TRAP_SPURIOUS which value is 0xf. In
intel manual mentioned as above, Table 6-1. Protected-Mode Exceptions and
Interrupts, vector 15 is "(Intel reserved. Do not use.)".

3) I made a debug patch as below to print out the value of SVR at the
very beginning of system init, on a bare metal system, its value is
0x10F, means the default value of BIOS setting is 0xF for spurious
interrupt vector.

3.1* I tested on qemu-2.9, the printed value is 0x1ff.
3.2* Tested with changing kdump kernel's spurious vector to 0xff, it
works.
3.3* Googled on internet, one guy said he ever checked bare metal
machine and vmware system, values are all 0x10f.

So agree with Eric that we should keep it as is since the value has been
there for so long time and no one met issue about it, and no
confirmative to support what value we should take, except of the example
in MP-Spec.

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 446c9ef8cfc3..739691bd3d77 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -942,6 +942,8 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)

void __init trap_init(void)
{
+ unsigned int v = apic_read(APIC_SPIV);
+ pr_info("... APIC SPIV: %08x\n", v);
/* Init cpu_entry_area before IST entries are set up */
setup_cpu_entry_areas();

>
> > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> > index 25ddf02598d2..550deaad6a9a 100644
> > --- a/arch/x86/kernel/apic/apic.c
> > +++ b/arch/x86/kernel/apic/apic.c
> > @@ -2130,7 +2130,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
> > value = apic_read(APIC_SPIV);
> > value &= ~APIC_VECTOR_MASK;
> > value |= APIC_SPIV_APIC_ENABLED;
> > - value |= 0xf;
> > + value |= SPURIOUS_APIC_VECTOR;
> > apic_write(APIC_SPIV, value);
> >
> > if (!virt_wire_setup) {
> >
>
> Eric

2018-02-12 11:56:21

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Hi Eric,

At 02/12/2018 01:11 PM, Eric W. Biederman wrote:
> Dou Liyang <[email protected]> writes:
>
>> Hi all,
>>
>> One thing confused me.
>>
>> The disconnect_bsp_APIC() may restore the interrupt delivery mode into
>> virtual wire mode. it uses the vector F as the spurious interrput, But,
>> IMO, using the vector 0xFF(SPURIOUS_APIC_VECTOR) may more suitable and
>> will give us more detail. Why the disconnect_bsp_APIC() use vector F
>> here?
>
> I would say this needs a documentation search before changing this.
>
> This code originates in:
> 208fb93162d5 ("[PATCH] kexec: x86_64: restore apic virtual wire mode on shutdown")
>
> The example in the Multi-Processor Specification v1.4 shows setting
> up the SPIV to vector 0x0f.
>
Thanks for your detailed explanation, I found the example A-1 in this
spec.

> I don't know what is canonical and what will interact best with DOS,
> and that erra of setup. The vector 0x0f seems an odd choice as
> it is below 0x20 putting it in the range of vectors that are
> reserved for processor exceptions.
>
> The constant SPURIOUS_APIC_VECTOR is definitely not something we want
> to be using at this point as that is a linux specific setting and used
> when Linux is up and running. So it is completely inapplicable.
>
> This is all about restoring how the apics were configured at boot time
> so it may be appropriate to copy and store this value, if it was not
> architectural.
>
> At a practical level at this point I suspect we are ok as the setting
> of the SPIV this way has not caused any known problems in the last
> decade. If someone wants to dig through the archtectural documents
> and the real world practice and find a better value and explain the
> change I would not oppose it.

Indeed, I understand.


Thanks,
dou

>
> All I know for certain is using the constant SPURIOUS_APIC_VECTOR
> is completely inappropriate (as that constant is about how linux uses
> vectors) and thus the patch below is wrong.
>
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index 25ddf02598d2..550deaad6a9a 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -2130,7 +2130,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>> value = apic_read(APIC_SPIV);
>> value &= ~APIC_VECTOR_MASK;
>> value |= APIC_SPIV_APIC_ENABLED;
>> - value |= 0xf;
>> + value |= SPURIOUS_APIC_VECTOR;
>> apic_write(APIC_SPIV, value);
>>
>> if (!virt_wire_setup) {
>>
>
> Eric
>
>
>



2018-02-12 12:31:14

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

On 02/11/18 at 09:08pm, Eric W. Biederman wrote:
> Baoquan He <[email protected]> writes:
>
> > This is a regression fix.
> >
> > Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
> > I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
> > calling after disable_IO_APIC(). This introdued a regression. The
> > root cause is that disable_IO_APIC() not only clears IO_APIC, also
> > restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
> > after disable_IO_APIC() will disable LAPIC and ruin the possible
> > virtual wire mode setting which the code has been trying to do all
> > along.
> >
> > The consequence is, in KVM guest kernel always prints warning as below
> > during kexec/kdump kernel boots up. That happened in setup_local_APIC()
> > since 'do { xxx } while (queued && max_loops > 0)' loop does not function
> > well any more if pending irq exists in APIC IRR after LAPIC is disabled.
> > And people even saw casual kdump kernel hang once in ~30 attempts during
> > stress testing of kdump on KVM machine.
> >
> > [ 0.001000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1467 setup_local_APIC+0x228/0x330
> > [ 0.001000] Modules linked in:
> > [ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #3
> > [ 0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > [ 0.001000] RIP: 0010:setup_local_APIC+0x228/0x330
> > [ 0.001000] RSP: 0000:ffffffffb6e03eb8 EFLAGS: 00010286
> > [ 0.001000] RAX: 0000009edb4c4d84 RBX: 0000000000000000 RCX: 00000000b099d800
> > [ 0.001000] RDX: 0000009e00000000 RSI: 0000000000000000 RDI: 0000000000000810
> > [ 0.001000] RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000001
> > [ 0.001000] R10: ffff98ce6a801c00 R11: 0761076d072f0776 R12: 0000000000000001
> > [ 0.001000] R13: 00000000000000f0 R14: 0000000000004000 R15: ffffffffffffc6ff
> > [ 0.001000] FS: 0000000000000000(0000) GS:ffff98ce6bc00000(0000) knlGS:0000000000000000
> > [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 0.001000] CR2: 00000000ffffffff CR3: 0000000022209000 CR4: 00000000000406b0
> > [ 0.001000] Call Trace:
> > [ 0.001000] apic_bsp_setup+0x56/0x74
> > [ 0.001000] x86_late_time_init+0x11/0x16
> > [ 0.001000] start_kernel+0x3c9/0x486
> > [ 0.001000] secondary_startup_64+0xa5/0xb0
> > [ 0.001000] Code: 00 85 c9 74 2d 0f 31 c1 e1 0a 48 c1 e2 20 41 89 cf 4c 03 7c 24 08 48 09 d0 49 29 c7 4c 89 3c 24 48 83 3c 24 00 0f 8f 8f fe ff
> > ff <0f> ff e9 10 ff ff ff 48 83 2c 24 01 eb e7 48 83 c4 18 5b 5d 41
> > [ 0.001000] ---[ end trace b88e71b9a6ebebdd ]---
> > [ 0.001000] masked ExtINT on CPU#0
> >
> > To fix this, just break down disable_IO_APIC(), then call
> > clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
> > and call restore_boot_irq_mode() to restore boot irq mode before
> > reboot or kexec/kdump jump.
>
> Two things here.
> a) This is missing a fixes tag and a CC stable.
> b) What makes your change to the KEXEC_JUMP code path safe?
> Have the lapic and ioapic already been shut down?
>
> The KEXEC_JUMP changes to machine_kexec_32.c and machine_kexec_64.c
> either need to be documented in the change long why they are safe
> so that this change becomes obviously safe and correct.
>
> Otherwise we risk and trivial and obvious looking change causing another
> regression like changing the order of lapic_shutdown and disable_IOAPIC
> did.

Makes sense. Will change as you suggested and repost. Thanks a lot.

Thanks
Baoquan

> > ---
> > arch/x86/include/asm/io_apic.h | 1 +
> > arch/x86/kernel/apic/io_apic.c | 2 +-
> > arch/x86/kernel/crash.c | 3 ++-
> > arch/x86/kernel/machine_kexec_32.c | 2 +-
> > arch/x86/kernel/machine_kexec_64.c | 2 +-
> > arch/x86/kernel/reboot.c | 3 ++-
> > 6 files changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
> > index 558d1a6a13ad..0fa95bfacb39 100644
> > --- a/arch/x86/include/asm/io_apic.h
> > +++ b/arch/x86/include/asm/io_apic.h
> > @@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
> > extern void setup_IO_APIC(void);
> > extern void enable_IO_APIC(void);
> > extern void disable_IO_APIC(void);
> > +extern void clear_IO_APIC(void);
> > extern void restore_boot_irq_mode(void);
> > extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
> > extern void print_IO_APICs(void);
> > diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> > index 7b73b6b9b4b6..2d7cd2db77f5 100644
> > --- a/arch/x86/kernel/apic/io_apic.c
> > +++ b/arch/x86/kernel/apic/io_apic.c
> > @@ -587,7 +587,7 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> > mpc_ioapic_id(apic), pin);
> > }
> >
> > -static void clear_IO_APIC (void)
> > +void clear_IO_APIC (void)
> > {
> > int apic, pin;
> >
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index 10e74d4778a1..1f6680427ff0 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -199,9 +199,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> > #ifdef CONFIG_X86_IO_APIC
> > /* Prevent crash_kexec() from deadlocking on ioapic_lock. */
> > ioapic_zap_locks();
> > - disable_IO_APIC();
> > + clear_IO_APIC();
> > #endif
> > lapic_shutdown();
> > + restore_boot_irq_mode();
> > #ifdef CONFIG_HPET_TIMER
> > hpet_disable();
> > #endif
> > diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
> > index edfede768688..f78bb4432bfb 100644
> > --- a/arch/x86/kernel/machine_kexec_32.c
> > +++ b/arch/x86/kernel/machine_kexec_32.c
> > @@ -199,7 +199,7 @@ void machine_kexec(struct kimage *image)
> > * one form or other. kexec jump path also need
> > * one.
> > */
> > - disable_IO_APIC();
> > + restore_boot_irq_mode();
> > #endif
> > }
> >
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 1f790cf9d38f..cb0c2d0a4c99 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
> > * one form or other. kexec jump path also need
> > * one.
> > */
> > - disable_IO_APIC();
> > + restore_boot_irq_mode();
> > #endif
> > }
> >
> > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> > index 2126b9d27c34..725624b6c0c0 100644
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -666,7 +666,7 @@ void native_machine_shutdown(void)
> > * Even without the erratum, it still makes sense to quiet IO APIC
> > * before disabling Local APIC.
> > */
> > - disable_IO_APIC();
> > + clear_IO_APIC();
> > #endif
> >
> > #ifdef CONFIG_SMP
> > @@ -680,6 +680,7 @@ void native_machine_shutdown(void)
> > #endif
> >
> > lapic_shutdown();
> > + restore_boot_irq_mode();
> >
> > #ifdef CONFIG_HPET_TIMER
> > hpet_disable();

2018-02-13 02:45:10

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Hi Baoquan,

At 02/12/2018 11:08 AM, Eric W. Biederman wrote:
> Baoquan He <[email protected]> writes:
>
>> This is a regression fix.
>>
>> Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
>> I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
>> calling after disable_IO_APIC(). This introdued a regression. The
>> root cause is that disable_IO_APIC() not only clears IO_APIC, also
>> restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
>> after disable_IO_APIC() will disable LAPIC and ruin the possible
>> virtual wire mode setting which the code has been trying to do all
>> along.
>>
>> The consequence is, in KVM guest kernel always prints warning as below
>> during kexec/kdump kernel boots up. That happened in setup_local_APIC()
>> since 'do { xxx } while (queued && max_loops > 0)' loop does not function

I am not sure another thing here

AFAIK, according to the order of SMP machine shutdown, other CPUs will
be stopped firstly, then the last CPU disable its local apic.

--machine_shutdown
|----......
|----stop_other_cpus()
|----local_shutdown()

So, the pending interrupts exist only in BSP and only be ACKed by
BSP. is it right?(I validated this by print the value of APIC_IRR/ISR of
all CPUs, found only BSP had the non-zero value).

If it is right, We will do not need check the pending interrupt for each
cpus.

BTW, the pending interrupt check code is mixed with the local
APIC setup code, that looks messy. How about extracting the code which
related to the crash interrupt check, put it into a new function and
only invoked it when the CPU is BSP?

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 3fc259b4dd2d..0350528b320d 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1408,6 +1408,57 @@ static void lapic_setup_esr(void)
oldvalue, value);
}

+
+static void clear_crash_pending_intr(void)
+{
+ long long max_loops = cpu_khz ? cpu_khz : 1000000;
+ unsigned long long tsc = 0, ntsc;
+ unsigned int value, queued;
+ int i, j, acked = 0;
+
+ if (boot_cpu_has(X86_FEATURE_TSC))
+ tsc = rdtsc();
+ /*
+ * After a crash, we no longer service the interrupts and a pending
+ * interrupt from previous kernel might still have ISR bit set.
+ *
+ * Most probably by now CPU has serviced that pending interrupt and
+ * it might not have done the ack_APIC_irq() because it thought,
+ * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
+ * does not clear the ISR bit and cpu thinks it has already serivced
+ * the interrupt. Hence a vector might get locked. It was noticed
+ * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
+ */
+ do {
+ queued = 0;
+ for (i = APIC_ISR_NR - 1; i >= 0; i--)
+ queued |= apic_read(APIC_IRR + i*0x10);
+
+ for (i = APIC_ISR_NR - 1; i >= 0; i--) {
+ value = apic_read(APIC_ISR + i*0x10);
+ for (j = 31; j >= 0; j--) {
+ if (value & (1<<j)) {
+ ack_APIC_irq();
+ acked++;
+ }
+ }
+ }
+ if (acked > 256) {
+ printk(KERN_ERR "LAPIC pending interrupts after
%d EOI\n",
+ acked);
+ break;
+ }
+ if (queued) {
+ if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
+ ntsc = rdtsc();
+ max_loops = (cpu_khz << 10) - (ntsc - tsc);
+ } else
+ max_loops--;
+ }
+ } while (queued && max_loops > 0);
+ WARN_ON(max_loops <= 0);
+}
+
/**
* setup_local_APIC - setup the local APIC
*
@@ -1417,13 +1468,7 @@ static void lapic_setup_esr(void)
void setup_local_APIC(void)
{
int cpu = smp_processor_id();
- unsigned int value, queued;
- int i, j, acked = 0;
- unsigned long long tsc = 0, ntsc;
- long long max_loops = cpu_khz ? cpu_khz : 1000000;
-
- if (boot_cpu_has(X86_FEATURE_TSC))
- tsc = rdtsc();
- tsc = rdtsc();
+ unsigned int value;

if (disable_apic) {
disable_ioapic_support();
@@ -1475,45 +1520,8 @@ void setup_local_APIC(void)
value &= ~APIC_TPRI_MASK;
apic_write(APIC_TASKPRI, value);

- /*
- * After a crash, we no longer service the interrupts and a pending
- * interrupt from previous kernel might still have ISR bit set.
- *
- * Most probably by now CPU has serviced that pending interrupt and
- * it might not have done the ack_APIC_irq() because it thought,
- * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
- * does not clear the ISR bit and cpu thinks it has already serivced
- * the interrupt. Hence a vector might get locked. It was noticed
- * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
- */
- do {
- queued = 0;
- for (i = APIC_ISR_NR - 1; i >= 0; i--)
- queued |= apic_read(APIC_IRR + i*0x10);
-
- for (i = APIC_ISR_NR - 1; i >= 0; i--) {
- value = apic_read(APIC_ISR + i*0x10);
- for (j = 31; j >= 0; j--) {
- if (value & (1<<j)) {
- ack_APIC_irq();
- acked++;
- }
- }
- }
- if (acked > 256) {
- printk(KERN_ERR "LAPIC pending interrupts after
%d EOI\n",
- acked);
- break;
- }
- if (queued) {
- if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
- ntsc = rdtsc();
- max_loops = (cpu_khz << 10) - (ntsc - tsc);
- } else
- max_loops--;
- }
- } while (queued && max_loops > 0);
- WARN_ON(max_loops <= 0);
+ if (!cpu)
+ clear_crash_pending_intr();

/*
* Now that we are all set up, enable the APIC


Thanks,
dou.

>> well any more if pending irq exists in APIC IRR after LAPIC is disabled.
>> And people even saw casual kdump kernel hang once in ~30 attempts during
>> stress testing of kdump on KVM machine.
>>
>> [ 0.001000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1467 setup_local_APIC+0x228/0x330
>> [ 0.001000] Modules linked in:
>> [ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #3
>> [ 0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
>> [ 0.001000] RIP: 0010:setup_local_APIC+0x228/0x330
>> [ 0.001000] RSP: 0000:ffffffffb6e03eb8 EFLAGS: 00010286
>> [ 0.001000] RAX: 0000009edb4c4d84 RBX: 0000000000000000 RCX: 00000000b099d800
>> [ 0.001000] RDX: 0000009e00000000 RSI: 0000000000000000 RDI: 0000000000000810
>> [ 0.001000] RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000001
>> [ 0.001000] R10: ffff98ce6a801c00 R11: 0761076d072f0776 R12: 0000000000000001
>> [ 0.001000] R13: 00000000000000f0 R14: 0000000000004000 R15: ffffffffffffc6ff
>> [ 0.001000] FS: 0000000000000000(0000) GS:ffff98ce6bc00000(0000) knlGS:0000000000000000
>> [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 0.001000] CR2: 00000000ffffffff CR3: 0000000022209000 CR4: 00000000000406b0
>> [ 0.001000] Call Trace:
>> [ 0.001000] apic_bsp_setup+0x56/0x74
>> [ 0.001000] x86_late_time_init+0x11/0x16
>> [ 0.001000] start_kernel+0x3c9/0x486
>> [ 0.001000] secondary_startup_64+0xa5/0xb0
>> [ 0.001000] Code: 00 85 c9 74 2d 0f 31 c1 e1 0a 48 c1 e2 20 41 89 cf 4c 03 7c 24 08 48 09 d0 49 29 c7 4c 89 3c 24 48 83 3c 24 00 0f 8f 8f fe ff
>> ff <0f> ff e9 10 ff ff ff 48 83 2c 24 01 eb e7 48 83 c4 18 5b 5d 41
>> [ 0.001000] ---[ end trace b88e71b9a6ebebdd ]---
>> [ 0.001000] masked ExtINT on CPU#0
>>
>> To fix this, just break down disable_IO_APIC(), then call
>> clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
>> and call restore_boot_irq_mode() to restore boot irq mode before
>> reboot or kexec/kdump jump.
>
> Two things here.
> a) This is missing a fixes tag and a CC stable.
> b) What makes your change to the KEXEC_JUMP code path safe?
> Have the lapic and ioapic already been shut down?
>
> The KEXEC_JUMP changes to machine_kexec_32.c and machine_kexec_64.c
> either need to be documented in the change long why they are safe
> so that this change becomes obviously safe and correct.
>
> Otherwise we risk and trivial and obvious looking change causing another
> regression like changing the order of lapic_shutdown and disable_IOAPIC
> did.
>
> Eric
>
>
>>
>> Signed-off-by: Baoquan He <[email protected]>
>> ---
>> arch/x86/include/asm/io_apic.h | 1 +
>> arch/x86/kernel/apic/io_apic.c | 2 +-
>> arch/x86/kernel/crash.c | 3 ++-
>> arch/x86/kernel/machine_kexec_32.c | 2 +-
>> arch/x86/kernel/machine_kexec_64.c | 2 +-
>> arch/x86/kernel/reboot.c | 3 ++-
>> 6 files changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
>> index 558d1a6a13ad..0fa95bfacb39 100644
>> --- a/arch/x86/include/asm/io_apic.h
>> +++ b/arch/x86/include/asm/io_apic.h
>> @@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
>> extern void setup_IO_APIC(void);
>> extern void enable_IO_APIC(void);
>> extern void disable_IO_APIC(void);
>> +extern void clear_IO_APIC(void);
>> extern void restore_boot_irq_mode(void);
>> extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
>> extern void print_IO_APICs(void);
>> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
>> index 7b73b6b9b4b6..2d7cd2db77f5 100644
>> --- a/arch/x86/kernel/apic/io_apic.c
>> +++ b/arch/x86/kernel/apic/io_apic.c
>> @@ -587,7 +587,7 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
>> mpc_ioapic_id(apic), pin);
>> }
>>
>> -static void clear_IO_APIC (void)
>> +void clear_IO_APIC (void)
>> {
>> int apic, pin;
>>
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index 10e74d4778a1..1f6680427ff0 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -199,9 +199,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>> #ifdef CONFIG_X86_IO_APIC
>> /* Prevent crash_kexec() from deadlocking on ioapic_lock. */
>> ioapic_zap_locks();
>> - disable_IO_APIC();
>> + clear_IO_APIC();
>> #endif
>> lapic_shutdown();
>> + restore_boot_irq_mode();
>> #ifdef CONFIG_HPET_TIMER
>> hpet_disable();
>> #endif
>> diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
>> index edfede768688..f78bb4432bfb 100644
>> --- a/arch/x86/kernel/machine_kexec_32.c
>> +++ b/arch/x86/kernel/machine_kexec_32.c
>> @@ -199,7 +199,7 @@ void machine_kexec(struct kimage *image)
>> * one form or other. kexec jump path also need
>> * one.
>> */
>> - disable_IO_APIC();
>> + restore_boot_irq_mode();
>> #endif
>> }
>>
>> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
>> index 1f790cf9d38f..cb0c2d0a4c99 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
>> * one form or other. kexec jump path also need
>> * one.
>> */
>> - disable_IO_APIC();
>> + restore_boot_irq_mode();
>> #endif
>> }
>>
>> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
>> index 2126b9d27c34..725624b6c0c0 100644
>> --- a/arch/x86/kernel/reboot.c
>> +++ b/arch/x86/kernel/reboot.c
>> @@ -666,7 +666,7 @@ void native_machine_shutdown(void)
>> * Even without the erratum, it still makes sense to quiet IO APIC
>> * before disabling Local APIC.
>> */
>> - disable_IO_APIC();
>> + clear_IO_APIC();
>> #endif
>>
>> #ifdef CONFIG_SMP
>> @@ -680,6 +680,7 @@ void native_machine_shutdown(void)
>> #endif
>>
>> lapic_shutdown();
>> + restore_boot_irq_mode();
>>
>> #ifdef CONFIG_HPET_TIMER
>> hpet_disable();
>
>
>



2018-02-13 03:25:39

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

On 02/13/18 at 10:43am, Dou Liyang wrote:
> Hi Baoquan,
>
> At 02/12/2018 11:08 AM, Eric W. Biederman wrote:
> > Baoquan He <[email protected]> writes:
> >
> > > This is a regression fix.
> > >
> > > Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
> > > I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
> > > calling after disable_IO_APIC(). This introdued a regression. The
> > > root cause is that disable_IO_APIC() not only clears IO_APIC, also
> > > restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
> > > after disable_IO_APIC() will disable LAPIC and ruin the possible
> > > virtual wire mode setting which the code has been trying to do all
> > > along.
> > >
> > > The consequence is, in KVM guest kernel always prints warning as below
> > > during kexec/kdump kernel boots up. That happened in setup_local_APIC()
> > > since 'do { xxx } while (queued && max_loops > 0)' loop does not function
>
> I am not sure another thing here
>
> AFAIK, according to the order of SMP machine shutdown, other CPUs will
> be stopped firstly, then the last CPU disable its local apic.
>
> --machine_shutdown
> |----......
> |----stop_other_cpus()
> |----local_shutdown()
>
> So, the pending interrupts exist only in BSP and only be ACKed by
> BSP. is it right?(I validated this by print the value of APIC_IRR/ISR of
> all CPUs, found only BSP had the non-zero value).
>
> If it is right, We will do not need check the pending interrupt for each
> cpus.
>
> BTW, the pending interrupt check code is mixed with the local
> APIC setup code, that looks messy. How about extracting the code which
> related to the crash interrupt check, put it into a new function and
> only invoked it when the CPU is BSP?

Yes, looks a good code cleaning up. Virtual wire irq mode only go
through the boot cpu's lapic.

In fact, or put in apic_bsp_setup() since it's only related to boot
cpu to make it look cleaner if can do.

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 3fc259b4dd2d..cec2d0692ace 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2380,6 +2380,7 @@ void __init apic_bsp_setup(bool upmode)
connect_bsp_APIC();
if (upmode)
apic_bsp_up_setup();
+ clear_crash_pending_intr();
setup_local_APIC();

enable_IO_APIC();

>
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 3fc259b4dd2d..0350528b320d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1408,6 +1408,57 @@ static void lapic_setup_esr(void)
> oldvalue, value);
> }
>
> +
> +static void clear_crash_pending_intr(void)
> +{
> + long long max_loops = cpu_khz ? cpu_khz : 1000000;
> + unsigned long long tsc = 0, ntsc;
> + unsigned int value, queued;
> + int i, j, acked = 0;
Better do a check here.

if (disable_apic)
return;
> +
> + if (boot_cpu_has(X86_FEATURE_TSC))
> + tsc = rdtsc();
> + /*
> + * After a crash, we no longer service the interrupts and a pending
> + * interrupt from previous kernel might still have ISR bit set.
> + *
> + * Most probably by now CPU has serviced that pending interrupt and
> + * it might not have done the ack_APIC_irq() because it thought,
> + * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
> + * does not clear the ISR bit and cpu thinks it has already serivced
> + * the interrupt. Hence a vector might get locked. It was noticed
> + * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
> + */
> + do {
> + queued = 0;
> + for (i = APIC_ISR_NR - 1; i >= 0; i--)
> + queued |= apic_read(APIC_IRR + i*0x10);
> +
> + for (i = APIC_ISR_NR - 1; i >= 0; i--) {
> + value = apic_read(APIC_ISR + i*0x10);
> + for (j = 31; j >= 0; j--) {
> + if (value & (1<<j)) {
> + ack_APIC_irq();
> + acked++;
> + }
> + }
> + }
> + if (acked > 256) {
> + printk(KERN_ERR "LAPIC pending interrupts after %d
> EOI\n",
> + acked);
> + break;
> + }
> + if (queued) {
> + if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
> + ntsc = rdtsc();
> + max_loops = (cpu_khz << 10) - (ntsc - tsc);
> + } else
> + max_loops--;
> + }
> + } while (queued && max_loops > 0);
> + WARN_ON(max_loops <= 0);
> +}
> +
> /**
> * setup_local_APIC - setup the local APIC
> *
> @@ -1417,13 +1468,7 @@ static void lapic_setup_esr(void)
> void setup_local_APIC(void)
> {
> int cpu = smp_processor_id();
> - unsigned int value, queued;
> - int i, j, acked = 0;
> - unsigned long long tsc = 0, ntsc;
> - long long max_loops = cpu_khz ? cpu_khz : 1000000;
> -
> - if (boot_cpu_has(X86_FEATURE_TSC))
> - tsc = rdtsc();
> - tsc = rdtsc();
> + unsigned int value;
>
> if (disable_apic) {
> disable_ioapic_support();
> @@ -1475,45 +1520,8 @@ void setup_local_APIC(void)
> value &= ~APIC_TPRI_MASK;
> apic_write(APIC_TASKPRI, value);
>
> - /*
> - * After a crash, we no longer service the interrupts and a pending
> - * interrupt from previous kernel might still have ISR bit set.
> - *
> - * Most probably by now CPU has serviced that pending interrupt and
> - * it might not have done the ack_APIC_irq() because it thought,
> - * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
> - * does not clear the ISR bit and cpu thinks it has already serivced
> - * the interrupt. Hence a vector might get locked. It was noticed
> - * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
> - */
> - do {
> - queued = 0;
> - for (i = APIC_ISR_NR - 1; i >= 0; i--)
> - queued |= apic_read(APIC_IRR + i*0x10);
> -
> - for (i = APIC_ISR_NR - 1; i >= 0; i--) {
> - value = apic_read(APIC_ISR + i*0x10);
> - for (j = 31; j >= 0; j--) {
> - if (value & (1<<j)) {
> - ack_APIC_irq();
> - acked++;
> - }
> - }
> - }
> - if (acked > 256) {
> - printk(KERN_ERR "LAPIC pending interrupts after %d
> EOI\n",
> - acked);
> - break;
> - }
> - if (queued) {
> - if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
> - ntsc = rdtsc();
> - max_loops = (cpu_khz << 10) - (ntsc - tsc);
> - } else
> - max_loops--;
> - }
> - } while (queued && max_loops > 0);
> - WARN_ON(max_loops <= 0);
> + if (!cpu)
> + clear_crash_pending_intr();
>
> /*
> * Now that we are all set up, enable the APIC
>
>
> Thanks,
> dou.
>
> > > well any more if pending irq exists in APIC IRR after LAPIC is disabled.
> > > And people even saw casual kdump kernel hang once in ~30 attempts during
> > > stress testing of kdump on KVM machine.
> > >
> > > [ 0.001000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1467 setup_local_APIC+0x228/0x330
> > > [ 0.001000] Modules linked in:
> > > [ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #3
> > > [ 0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > > [ 0.001000] RIP: 0010:setup_local_APIC+0x228/0x330
> > > [ 0.001000] RSP: 0000:ffffffffb6e03eb8 EFLAGS: 00010286
> > > [ 0.001000] RAX: 0000009edb4c4d84 RBX: 0000000000000000 RCX: 00000000b099d800
> > > [ 0.001000] RDX: 0000009e00000000 RSI: 0000000000000000 RDI: 0000000000000810
> > > [ 0.001000] RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000001
> > > [ 0.001000] R10: ffff98ce6a801c00 R11: 0761076d072f0776 R12: 0000000000000001
> > > [ 0.001000] R13: 00000000000000f0 R14: 0000000000004000 R15: ffffffffffffc6ff
> > > [ 0.001000] FS: 0000000000000000(0000) GS:ffff98ce6bc00000(0000) knlGS:0000000000000000
> > > [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 0.001000] CR2: 00000000ffffffff CR3: 0000000022209000 CR4: 00000000000406b0
> > > [ 0.001000] Call Trace:
> > > [ 0.001000] apic_bsp_setup+0x56/0x74
> > > [ 0.001000] x86_late_time_init+0x11/0x16
> > > [ 0.001000] start_kernel+0x3c9/0x486
> > > [ 0.001000] secondary_startup_64+0xa5/0xb0
> > > [ 0.001000] Code: 00 85 c9 74 2d 0f 31 c1 e1 0a 48 c1 e2 20 41 89 cf 4c 03 7c 24 08 48 09 d0 49 29 c7 4c 89 3c 24 48 83 3c 24 00 0f 8f 8f fe ff
> > > ff <0f> ff e9 10 ff ff ff 48 83 2c 24 01 eb e7 48 83 c4 18 5b 5d 41
> > > [ 0.001000] ---[ end trace b88e71b9a6ebebdd ]---
> > > [ 0.001000] masked ExtINT on CPU#0
> > >
> > > To fix this, just break down disable_IO_APIC(), then call
> > > clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
> > > and call restore_boot_irq_mode() to restore boot irq mode before
> > > reboot or kexec/kdump jump.
> >
> > Two things here.
> > a) This is missing a fixes tag and a CC stable.
> > b) What makes your change to the KEXEC_JUMP code path safe?
> > Have the lapic and ioapic already been shut down?
> >
> > The KEXEC_JUMP changes to machine_kexec_32.c and machine_kexec_64.c
> > either need to be documented in the change long why they are safe
> > so that this change becomes obviously safe and correct.
> >
> > Otherwise we risk and trivial and obvious looking change causing another
> > regression like changing the order of lapic_shutdown and disable_IOAPIC
> > did.
> >
> > Eric
> >
> >
> > >
> > > Signed-off-by: Baoquan He <[email protected]>
> > > ---
> > > arch/x86/include/asm/io_apic.h | 1 +
> > > arch/x86/kernel/apic/io_apic.c | 2 +-
> > > arch/x86/kernel/crash.c | 3 ++-
> > > arch/x86/kernel/machine_kexec_32.c | 2 +-
> > > arch/x86/kernel/machine_kexec_64.c | 2 +-
> > > arch/x86/kernel/reboot.c | 3 ++-
> > > 6 files changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
> > > index 558d1a6a13ad..0fa95bfacb39 100644
> > > --- a/arch/x86/include/asm/io_apic.h
> > > +++ b/arch/x86/include/asm/io_apic.h
> > > @@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
> > > extern void setup_IO_APIC(void);
> > > extern void enable_IO_APIC(void);
> > > extern void disable_IO_APIC(void);
> > > +extern void clear_IO_APIC(void);
> > > extern void restore_boot_irq_mode(void);
> > > extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
> > > extern void print_IO_APICs(void);
> > > diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> > > index 7b73b6b9b4b6..2d7cd2db77f5 100644
> > > --- a/arch/x86/kernel/apic/io_apic.c
> > > +++ b/arch/x86/kernel/apic/io_apic.c
> > > @@ -587,7 +587,7 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> > > mpc_ioapic_id(apic), pin);
> > > }
> > > -static void clear_IO_APIC (void)
> > > +void clear_IO_APIC (void)
> > > {
> > > int apic, pin;
> > > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > > index 10e74d4778a1..1f6680427ff0 100644
> > > --- a/arch/x86/kernel/crash.c
> > > +++ b/arch/x86/kernel/crash.c
> > > @@ -199,9 +199,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> > > #ifdef CONFIG_X86_IO_APIC
> > > /* Prevent crash_kexec() from deadlocking on ioapic_lock. */
> > > ioapic_zap_locks();
> > > - disable_IO_APIC();
> > > + clear_IO_APIC();
> > > #endif
> > > lapic_shutdown();
> > > + restore_boot_irq_mode();
> > > #ifdef CONFIG_HPET_TIMER
> > > hpet_disable();
> > > #endif
> > > diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
> > > index edfede768688..f78bb4432bfb 100644
> > > --- a/arch/x86/kernel/machine_kexec_32.c
> > > +++ b/arch/x86/kernel/machine_kexec_32.c
> > > @@ -199,7 +199,7 @@ void machine_kexec(struct kimage *image)
> > > * one form or other. kexec jump path also need
> > > * one.
> > > */
> > > - disable_IO_APIC();
> > > + restore_boot_irq_mode();
> > > #endif
> > > }
> > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > > index 1f790cf9d38f..cb0c2d0a4c99 100644
> > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > @@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
> > > * one form or other. kexec jump path also need
> > > * one.
> > > */
> > > - disable_IO_APIC();
> > > + restore_boot_irq_mode();
> > > #endif
> > > }
> > > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> > > index 2126b9d27c34..725624b6c0c0 100644
> > > --- a/arch/x86/kernel/reboot.c
> > > +++ b/arch/x86/kernel/reboot.c
> > > @@ -666,7 +666,7 @@ void native_machine_shutdown(void)
> > > * Even without the erratum, it still makes sense to quiet IO APIC
> > > * before disabling Local APIC.
> > > */
> > > - disable_IO_APIC();
> > > + clear_IO_APIC();
> > > #endif
> > > #ifdef CONFIG_SMP
> > > @@ -680,6 +680,7 @@ void native_machine_shutdown(void)
> > > #endif
> > > lapic_shutdown();
> > > + restore_boot_irq_mode();
> > > #ifdef CONFIG_HPET_TIMER
> > > hpet_disable();
> >
> >
> >
>
>

2018-02-13 07:45:17

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Hi Eric,

On 02/11/18 at 09:08pm, Eric W. Biederman wrote:
> Baoquan He <[email protected]> writes:
>
> > This is a regression fix.
> >
> > Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
> > I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
> > calling after disable_IO_APIC(). This introdued a regression. The
> > root cause is that disable_IO_APIC() not only clears IO_APIC, also
> > restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
> > after disable_IO_APIC() will disable LAPIC and ruin the possible
> > virtual wire mode setting which the code has been trying to do all
> > along.
> > To fix this, just break down disable_IO_APIC(), then call
> > clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
> > and call restore_boot_irq_mode() to restore boot irq mode before
> > reboot or kexec/kdump jump.
>
> Two things here.
> a) This is missing a fixes tag and a CC stable.
> b) What makes your change to the KEXEC_JUMP code path safe?
> Have the lapic and ioapic already been shut down?
>
> The KEXEC_JUMP changes to machine_kexec_32.c and machine_kexec_64.c
> either need to be documented in the change long why they are safe
> so that this change becomes obviously safe and correct.

Re-read the code, I have to admit I didn't check the KEXEC_JUMP code
path carefully.

kernel_kexec() {
if (kexec_image->preserve_context) {
...
freeze_processes();
...
disable_nonboot_cpus();
...

else {
...
machine_shutdown();
...
}
machine_kexec(kexec_image);
...
}

--machine_shutdown()
--native_machine_shutdown()
--disable_IO_APIC()
--lapic_shutdown()

machine_kexec() {
...
if (image->preserve_context) {
disable_IO_APIC();
}
...
}

KEXEC_JUMP code path is different than kexec/kdump, it doesn't call
lapic_shutdown() before jump. So commit 522e66464467
("x86/apic: Disable I/O APIC before shutdown of the local APIC") didn't
impact it. And here I break down disable_IO_APIC() and change to only
call restore_boot_irq_mode() to make a possible danger. I am not an
expert on KEXEC_JUMP, and don't know how to test it, so will keep the
code implementation consistent as before. For now, I plan to change it
as below if you don't object. As you pointed out, I will describe this
in patch log.

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 1f790cf9d38f..cb0c2d0a4c99 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
* one form or other. kexec jump path also need
* one.
*/
- disable_IO_APIC();
+ clear_IO_APIC();
+ restore_boot_irq_mode();
#endif
}




>
> Otherwise we risk and trivial and obvious looking change causing another
> regression like changing the order of lapic_shutdown and disable_IOAPIC
> did.
>
> Eric
>
>
> >
> > Signed-off-by: Baoquan He <[email protected]>
> > ---
> > arch/x86/include/asm/io_apic.h | 1 +
> > arch/x86/kernel/apic/io_apic.c | 2 +-
> > arch/x86/kernel/crash.c | 3 ++-
> > arch/x86/kernel/machine_kexec_32.c | 2 +-
> > arch/x86/kernel/machine_kexec_64.c | 2 +-
> > arch/x86/kernel/reboot.c | 3 ++-
> > 6 files changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
> > index 558d1a6a13ad..0fa95bfacb39 100644
> > --- a/arch/x86/include/asm/io_apic.h
> > +++ b/arch/x86/include/asm/io_apic.h
> > @@ -193,6 +193,7 @@ static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
> > extern void setup_IO_APIC(void);
> > extern void enable_IO_APIC(void);
> > extern void disable_IO_APIC(void);
> > +extern void clear_IO_APIC(void);
> > extern void restore_boot_irq_mode(void);
> > extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
> > extern void print_IO_APICs(void);
> > diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> > index 7b73b6b9b4b6..2d7cd2db77f5 100644
> > --- a/arch/x86/kernel/apic/io_apic.c
> > +++ b/arch/x86/kernel/apic/io_apic.c
> > @@ -587,7 +587,7 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> > mpc_ioapic_id(apic), pin);
> > }
> >
> > -static void clear_IO_APIC (void)
> > +void clear_IO_APIC (void)
> > {
> > int apic, pin;
> >
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index 10e74d4778a1..1f6680427ff0 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -199,9 +199,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> > #ifdef CONFIG_X86_IO_APIC
> > /* Prevent crash_kexec() from deadlocking on ioapic_lock. */
> > ioapic_zap_locks();
> > - disable_IO_APIC();
> > + clear_IO_APIC();
> > #endif
> > lapic_shutdown();
> > + restore_boot_irq_mode();
> > #ifdef CONFIG_HPET_TIMER
> > hpet_disable();
> > #endif
> > diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
> > index edfede768688..f78bb4432bfb 100644
> > --- a/arch/x86/kernel/machine_kexec_32.c
> > +++ b/arch/x86/kernel/machine_kexec_32.c
> > @@ -199,7 +199,7 @@ void machine_kexec(struct kimage *image)
> > * one form or other. kexec jump path also need
> > * one.
> > */
> > - disable_IO_APIC();
> > + restore_boot_irq_mode();
> > #endif
> > }
> >
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 1f790cf9d38f..cb0c2d0a4c99 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
> > * one form or other. kexec jump path also need
> > * one.
> > */
> > - disable_IO_APIC();
> > + restore_boot_irq_mode();
> > #endif
> > }
> >
> > diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> > index 2126b9d27c34..725624b6c0c0 100644
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -666,7 +666,7 @@ void native_machine_shutdown(void)
> > * Even without the erratum, it still makes sense to quiet IO APIC
> > * before disabling Local APIC.
> > */
> > - disable_IO_APIC();
> > + clear_IO_APIC();
> > #endif
> >
> > #ifdef CONFIG_SMP
> > @@ -680,6 +680,7 @@ void native_machine_shutdown(void)
> > #endif
> >
> > lapic_shutdown();
> > + restore_boot_irq_mode();
> >
> > #ifdef CONFIG_HPET_TIMER
> > hpet_disable();

2018-02-13 17:42:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Dou Liyang <[email protected]> writes:

> Hi Baoquan,
>
> At 02/12/2018 11:08 AM, Eric W. Biederman wrote:
>> Baoquan He <[email protected]> writes:
>>
>>> This is a regression fix.
>>>
>>> Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
>>> I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
>>> calling after disable_IO_APIC(). This introdued a regression. The
>>> root cause is that disable_IO_APIC() not only clears IO_APIC, also
>>> restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
>>> after disable_IO_APIC() will disable LAPIC and ruin the possible
>>> virtual wire mode setting which the code has been trying to do all
>>> along.
>>>
>>> The consequence is, in KVM guest kernel always prints warning as below
>>> during kexec/kdump kernel boots up. That happened in setup_local_APIC()
>>> since 'do { xxx } while (queued && max_loops > 0)' loop does not function
>
> I am not sure another thing here
>
> AFAIK, according to the order of SMP machine shutdown, other CPUs will
> be stopped firstly, then the last CPU disable its local apic.
>
> --machine_shutdown
> |----......
> |----stop_other_cpus()
> |----local_shutdown()
>
> So, the pending interrupts exist only in BSP and only be ACKed by
> BSP. is it right?(I validated this by print the value of APIC_IRR/ISR of
> all CPUs, found only BSP had the non-zero value).

We don't know. In the kexec on panic case we try and shutdown the other
cpus but we have a timeout because we might fail.

Further you have to be careful with the concept of boot cpu. In the
normal kexec case shutdown on the boot cpu and leave it running. In the
kexec on panic case we shutdown on an arbitrary cpu.

> If it is right, We will do not need check the pending interrupt for each
> cpus.

It is also cheap if there are no pending interrupts as there is nothing
to do.

> BTW, the pending interrupt check code is mixed with the local
> APIC setup code, that looks messy. How about extracting the code which
> related to the crash interrupt check, put it into a new function and
> only invoked it when the CPU is BSP?

Moving it into it's own function makes sense. Let's not taint the
concept with ``crash''. We don't know that the only way this will
ever happen is from kexec on panic. We only know it was easy to
trigger the condition from kexec on panic.

There are a lot of cases I can think of that interrupts might fire while
interrupts are disabled because the kernel is booting. A normal kexec
is also possible.

Throw in MSI interrupts and transitioning from one state to another in
non-legacy apic mode and we might be more likely to get some irqs in
a pending state.

Your patch makes me nervous as it is not just code motion but a much
more substantial change.


As much as I agree that we need to fix the regression in the apic
shutdown code that is causing problems. What we really need to do
is to completely remove the apic shutdown code from the kexec on panic
code path. Over the long term that will provide us with a much more
robust path to getting crash dumps and the like.


Eric


> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 3fc259b4dd2d..0350528b320d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1408,6 +1408,57 @@ static void lapic_setup_esr(void)
> oldvalue, value);
> }
>
> +
> +static void clear_crash_pending_intr(void)
> +{
> + long long max_loops = cpu_khz ? cpu_khz : 1000000;
> + unsigned long long tsc = 0, ntsc;
> + unsigned int value, queued;
> + int i, j, acked = 0;
> +
> + if (boot_cpu_has(X86_FEATURE_TSC))
> + tsc = rdtsc();
> + /*
> + * After a crash, we no longer service the interrupts and a pending
> + * interrupt from previous kernel might still have ISR bit set.
> + *
> + * Most probably by now CPU has serviced that pending interrupt and
> + * it might not have done the ack_APIC_irq() because it thought,
> + * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
> + * does not clear the ISR bit and cpu thinks it has already serivced
> + * the interrupt. Hence a vector might get locked. It was noticed
> + * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
> + */
> + do {
> + queued = 0;
> + for (i = APIC_ISR_NR - 1; i >= 0; i--)
> + queued |= apic_read(APIC_IRR + i*0x10);
> +
> + for (i = APIC_ISR_NR - 1; i >= 0; i--) {
> + value = apic_read(APIC_ISR + i*0x10);
> + for (j = 31; j >= 0; j--) {
> + if (value & (1<<j)) {
> + ack_APIC_irq();
> + acked++;
> + }
> + }
> + }
> + if (acked > 256) {
> + printk(KERN_ERR "LAPIC pending interrupts after %d
> EOI\n",
> + acked);
> + break;
> + }
> + if (queued) {
> + if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
> + ntsc = rdtsc();
> + max_loops = (cpu_khz << 10) - (ntsc - tsc);
> + } else
> + max_loops--;
> + }
> + } while (queued && max_loops > 0);
> + WARN_ON(max_loops <= 0);
> +}
> +
> /**
> * setup_local_APIC - setup the local APIC
> *
> @@ -1417,13 +1468,7 @@ static void lapic_setup_esr(void)
> void setup_local_APIC(void)
> {
> int cpu = smp_processor_id();
> - unsigned int value, queued;
> - int i, j, acked = 0;
> - unsigned long long tsc = 0, ntsc;
> - long long max_loops = cpu_khz ? cpu_khz : 1000000;
> -
> - if (boot_cpu_has(X86_FEATURE_TSC))
> - tsc = rdtsc();
> - tsc = rdtsc();
> + unsigned int value;
>
> if (disable_apic) {
> disable_ioapic_support();
> @@ -1475,45 +1520,8 @@ void setup_local_APIC(void)
> value &= ~APIC_TPRI_MASK;
> apic_write(APIC_TASKPRI, value);
>
> - /*
> - * After a crash, we no longer service the interrupts and a pending
> - * interrupt from previous kernel might still have ISR bit set.
> - *
> - * Most probably by now CPU has serviced that pending interrupt and
> - * it might not have done the ack_APIC_irq() because it thought,
> - * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
> - * does not clear the ISR bit and cpu thinks it has already serivced
> - * the interrupt. Hence a vector might get locked. It was noticed
> - * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
> - */
> - do {
> - queued = 0;
> - for (i = APIC_ISR_NR - 1; i >= 0; i--)
> - queued |= apic_read(APIC_IRR + i*0x10);
> -
> - for (i = APIC_ISR_NR - 1; i >= 0; i--) {
> - value = apic_read(APIC_ISR + i*0x10);
> - for (j = 31; j >= 0; j--) {
> - if (value & (1<<j)) {
> - ack_APIC_irq();
> - acked++;
> - }
> - }
> - }
> - if (acked > 256) {
> - printk(KERN_ERR "LAPIC pending interrupts after %d
> EOI\n",
> - acked);
> - break;
> - }
> - if (queued) {
> - if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
> - ntsc = rdtsc();
> - max_loops = (cpu_khz << 10) - (ntsc - tsc);
> - } else
> - max_loops--;
> - }
> - } while (queued && max_loops > 0);
> - WARN_ON(max_loops <= 0);
> + if (!cpu)
> + clear_crash_pending_intr();
>
> /*
> * Now that we are all set up, enable the APIC
>
>
> Thanks,
> dou.

2018-02-13 17:46:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Baoquan He <[email protected]> writes:

> Hi Eric,
>
> On 02/11/18 at 09:08pm, Eric W. Biederman wrote:
>> Baoquan He <[email protected]> writes:
>>
>> > This is a regression fix.
>> >
>> > Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
>> > I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
>> > calling after disable_IO_APIC(). This introdued a regression. The
>> > root cause is that disable_IO_APIC() not only clears IO_APIC, also
>> > restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
>> > after disable_IO_APIC() will disable LAPIC and ruin the possible
>> > virtual wire mode setting which the code has been trying to do all
>> > along.
>> > To fix this, just break down disable_IO_APIC(), then call
>> > clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
>> > and call restore_boot_irq_mode() to restore boot irq mode before
>> > reboot or kexec/kdump jump.
>>
>> Two things here.
>> a) This is missing a fixes tag and a CC stable.
>> b) What makes your change to the KEXEC_JUMP code path safe?
>> Have the lapic and ioapic already been shut down?
>>
>> The KEXEC_JUMP changes to machine_kexec_32.c and machine_kexec_64.c
>> either need to be documented in the change long why they are safe
>> so that this change becomes obviously safe and correct.
>
> Re-read the code, I have to admit I didn't check the KEXEC_JUMP code
> path carefully.
>
> kernel_kexec() {
> if (kexec_image->preserve_context) {
> ...
> freeze_processes();
> ...
> disable_nonboot_cpus();
> ...
>
> else {
> ...
> machine_shutdown();
> ...
> }
> machine_kexec(kexec_image);
> ...
> }
>
> --machine_shutdown()
> --native_machine_shutdown()
> --disable_IO_APIC()
> --lapic_shutdown()
>
> machine_kexec() {
> ...
> if (image->preserve_context) {
> disable_IO_APIC();
> }
> ...
> }
>
> KEXEC_JUMP code path is different than kexec/kdump, it doesn't call
> lapic_shutdown() before jump. So commit 522e66464467
> ("x86/apic: Disable I/O APIC before shutdown of the local APIC") didn't
> impact it. And here I break down disable_IO_APIC() and change to only
> call restore_boot_irq_mode() to make a possible danger. I am not an
> expert on KEXEC_JUMP, and don't know how to test it, so will keep the
> code implementation consistent as before. For now, I plan to change it
> as below if you don't object. As you pointed out, I will describe this
> in patch log.
>
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 1f790cf9d38f..cb0c2d0a4c99 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
> * one form or other. kexec jump path also need
> * one.
> */
> - disable_IO_APIC();
> + clear_IO_APIC();
> + restore_boot_irq_mode();
> #endif
> }
>
>

Let me give a very concrete suggestion:
Patch 1) Replace "disable_IO_APIC();" with "clear_IO_APIC(); restore_boot_irq_mode();"
Patch 2) Move restore_boot_irq_mode(); to fix the regression.

I think that will be a slightly shorter patch sequence than what you are
dealing with and one that is slightly easier to read.

We need to sort out KEXEC_JUMP but that is something for another time.

Eric

2018-02-14 02:46:49

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

On 02/13/18 at 11:44am, Eric W. Biederman wrote:
> Baoquan He <[email protected]> writes:
>
> > Hi Eric,
> >
> > On 02/11/18 at 09:08pm, Eric W. Biederman wrote:
> >> Baoquan He <[email protected]> writes:
> >>
> >> > This is a regression fix.
> >> >
> >> > Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
> >> > I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
> >> > calling after disable_IO_APIC(). This introdued a regression. The
> >> > root cause is that disable_IO_APIC() not only clears IO_APIC, also
> >> > restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
> >> > after disable_IO_APIC() will disable LAPIC and ruin the possible
> >> > virtual wire mode setting which the code has been trying to do all
> >> > along.
> >> > To fix this, just break down disable_IO_APIC(), then call
> >> > clear_IO_APIC() to stop IO_APIC where disable_IO_APIC() was called,
> >> > and call restore_boot_irq_mode() to restore boot irq mode before
> >> > reboot or kexec/kdump jump.
> >>
> >> Two things here.
> >> a) This is missing a fixes tag and a CC stable.
> >> b) What makes your change to the KEXEC_JUMP code path safe?
> >> Have the lapic and ioapic already been shut down?
> >>
> >> The KEXEC_JUMP changes to machine_kexec_32.c and machine_kexec_64.c
> >> either need to be documented in the change long why they are safe
> >> so that this change becomes obviously safe and correct.
> >
> > Re-read the code, I have to admit I didn't check the KEXEC_JUMP code
> > path carefully.
> >
> > kernel_kexec() {
> > if (kexec_image->preserve_context) {
> > ...
> > freeze_processes();
> > ...
> > disable_nonboot_cpus();
> > ...
> >
> > else {
> > ...
> > machine_shutdown();
> > ...
> > }
> > machine_kexec(kexec_image);
> > ...
> > }
> >
> > --machine_shutdown()
> > --native_machine_shutdown()
> > --disable_IO_APIC()
> > --lapic_shutdown()
> >
> > machine_kexec() {
> > ...
> > if (image->preserve_context) {
> > disable_IO_APIC();
> > }
> > ...
> > }
> >
> > KEXEC_JUMP code path is different than kexec/kdump, it doesn't call
> > lapic_shutdown() before jump. So commit 522e66464467
> > ("x86/apic: Disable I/O APIC before shutdown of the local APIC") didn't
> > impact it. And here I break down disable_IO_APIC() and change to only
> > call restore_boot_irq_mode() to make a possible danger. I am not an
> > expert on KEXEC_JUMP, and don't know how to test it, so will keep the
> > code implementation consistent as before. For now, I plan to change it
> > as below if you don't object. As you pointed out, I will describe this
> > in patch log.
> >
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 1f790cf9d38f..cb0c2d0a4c99 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -297,7 +297,7 @@ void machine_kexec(struct kimage *image)
> > * one form or other. kexec jump path also need
> > * one.
> > */
> > - disable_IO_APIC();
> > + clear_IO_APIC();
> > + restore_boot_irq_mode();
> > #endif
> > }
> >
> >
>
> Let me give a very concrete suggestion:
> Patch 1) Replace "disable_IO_APIC();" with "clear_IO_APIC(); restore_boot_irq_mode();"
> Patch 2) Move restore_boot_irq_mode(); to fix the regression.
>
> I think that will be a slightly shorter patch sequence than what you are
> dealing with and one that is slightly easier to read.

Sure, will do.

Thanks for reviewing and suggestion.

>
> We need to sort out KEXEC_JUMP but that is something for another time.

Agree. We ever contacted the author, intel dev, seems they don't
maintain it any more. And very few people are interested and report
the relevant issues. While recently someone is testing KEXEC_JUMP
functionality and updating related doc, will continue tracking it.

Thanks
Baoquan

2018-02-14 03:23:20

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

Hi Eric,

At 02/14/2018 01:40 AM, Eric W. Biederman wrote:
> Dou Liyang <[email protected]> writes:
>
>> Hi Baoquan,
>>
>> At 02/12/2018 11:08 AM, Eric W. Biederman wrote:
>>> Baoquan He <[email protected]> writes:
>>>
>>>> This is a regression fix.
>>>>
>>>> Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
>>>> I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
>>>> calling after disable_IO_APIC(). This introdued a regression. The
>>>> root cause is that disable_IO_APIC() not only clears IO_APIC, also
>>>> restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
>>>> after disable_IO_APIC() will disable LAPIC and ruin the possible
>>>> virtual wire mode setting which the code has been trying to do all
>>>> along.
>>>>
>>>> The consequence is, in KVM guest kernel always prints warning as below
>>>> during kexec/kdump kernel boots up. That happened in setup_local_APIC()
>>>> since 'do { xxx } while (queued && max_loops > 0)' loop does not function
>>
>> I am not sure another thing here
>>
>> AFAIK, according to the order of SMP machine shutdown, other CPUs will
>> be stopped firstly, then the last CPU disable its local apic.
>>
>> --machine_shutdown
>> |----......
>> |----stop_other_cpus()
>> |----local_shutdown()
>>
>> So, the pending interrupts exist only in BSP and only be ACKed by
>> BSP. is it right?(I validated this by print the value of APIC_IRR/ISR of
>> all CPUs, found only BSP had the non-zero value).
>
> We don't know. In the kexec on panic case we try and shutdown the other
> cpus but we have a timeout because we might fail.
>
> Further you have to be careful with the concept of boot cpu. In the
> normal kexec case shutdown on the boot cpu and leave it running. In the
> kexec on panic case we shutdown on an arbitrary cpu.
>
>> If it is right, We will do not need check the pending interrupt for each
>> cpus.
>
> It is also cheap if there are no pending interrupts as there is nothing
> to do.
>

Yes, It's cheap.

But, the local APICs in APs were disabled at that time, not sure if
writing to the end-of-interrupt (EOI) register could cause the local
APIC to clear the ISR successfully.

If the ISR was not cleared, doing that check is useless.

I can't produce any cases that the lapics in APs have pending
interrupts. do you have some suggestions?

>> BTW, the pending interrupt check code is mixed with the local
>> APIC setup code, that looks messy. How about extracting the code which
>> related to the crash interrupt check, put it into a new function and
>> only invoked it when the CPU is BSP?
>
> Moving it into it's own function makes sense. Let's not taint the
> concept with ``crash''. We don't know that the only way this will
> ever happen is from kexec on panic. We only know it was easy to
> trigger the condition from kexec on panic.
>

Yes, I see.

> There are a lot of cases I can think of that interrupts might fire while
> interrupts are disabled because the kernel is booting. A normal kexec
> is also possible.
>
> Throw in MSI interrupts and transitioning from one state to another in
> non-legacy apic mode and we might be more likely to get some irqs in
> a pending state.
>
> Your patch makes me nervous as it is not just code motion but a much
> more substantial change.
>
>
> As much as I agree that we need to fix the regression in the apic
> shutdown code that is causing problems. What we really need to do
> is to completely remove the apic shutdown code from the kexec on panic
> code path. Over the long term that will provide us with a much more

Wow, indeed, and it is related to many hypervisors on x86, For me, still
need more investigations and tests.

Thanks,
dou