This patch set is to allow kdump 2nd kernel to wake up multiple CPUs
even if 1st kernel crashs on some AP.
This version is for fixing building failure of tip tree reported by
some bot and also adding the other fixes I missed out from previous
experimental version I posted last year; see ChangeLog.
Sorry, this patch set have not include in-source documentation
requested by Borislav Petkov yet, but I'll post it later separately,
which would be better to focus on documentation reviewing.
ChangeLog
v1 => v2)
- Rebased on top of v3.12-rc5.
- Fix linking time error of boot_cpu_is_bsp_init() in case of
CONFIG_LOCAL_APIC disabled by adding empty static inline function
instead.
- Fix missing feature check by means of cpu_has_apic macro in
boot_cpu_is_bsp_init() before calling rdmsr_safe(MSR_IA32_APICBASE).
NOTE: I've checked local apic-present case only; I don't have any
x86 processor without local apic.
- Add __init annotation to boot_cpu_is_bsp_init().
Test
- built with and without CONFIG_LOCAL_APIC
- tested x86_64 in case of acpi and MP table
---
HATAYAMA Daisuke (2):
x86, apic: Add boot_cpu_is_bsp() to check if boot cpu is BSP
x86, apic: Disable BSP if boot cpu is AP
arch/x86/include/asm/mpspec.h | 9 ++++++++-
arch/x86/kernel/acpi/boot.c | 6 +++++-
arch/x86/kernel/apic/apic.c | 34 +++++++++++++++++++++++++++++++++-
arch/x86/kernel/devicetree.c | 1 +
arch/x86/kernel/mpparse.c | 15 +++++++++++++--
arch/x86/kernel/setup.c | 2 ++
arch/x86/platform/sfi/sfi.c | 2 +-
7 files changed, 63 insertions(+), 6 deletions(-)
--
Thanks.
HATAYAMA, Daisuke
Currently, on x86 architecture, if crash happens on AP in the kdump
1st kernel, the 2nd kernel fails to wake up multiple CPUs. The typical
behaviour we actually see is immediate system reset or hang.
This comes from the hardware specification that the processor with BSP
flag is jumped at BIOS init code when receiving INIT; the behaviour we
then see depends on the init code.
This never happens if we use only one cpu in the 2nd kernel. So, we
have avoided the issue by the workaround that specifying maxcpus=1 or
nr_cpus=1 in kernel parameter of the 2nd kernel.
In order to address the issue, this patch disables BSP if boot cpu is
an AP, and thus we don't try to wake up the BSP by sending INIT.
Before this idea we discussed the following two ideas but we cannot
adopt them in each reasons:
1. Switch CPU from AP to BSP via IPI NMI at crash in the 1st kernel
This is done in the kdump crash path where logic is in inconsistent
state. Any part of memory can be corrupted, including
hardware-related table being accessed for example when paging is
performed or interruption is performed.
2. Unset BSP flag of the boot cpu in the 1st kernel
Unsetting BSP flag can affect some real world firmware badly. For
example, Ma verified that some HP systems fail to reboot under this
configuration. See:
http://lkml.indiana.edu/hypermail/linux/kernel/1308.1/03574.html
Due to the idea 1, we have to address the issue in the 2nd kernel on
AP. Then, it's impossible to know which CPU is BSP by rdmsr
instruction because the CPU is the one we are now trying to wake
up. From the same reason, it's also impossible to unset BSP flag of
the BSP by wrmsr instruction.
Next, due to the idea 2, BSP is halting in the 1st kernel while
keeping BSP flag set (or possibly could be running somewhere in
catastrophic state.) In generall, CPUs except for the boot cpu in the
2nd kernel -- the cpu under which crash happened --- can be thought of
as remaining in any inconsistent state in the 1st kernel. For APs,
it's possible to recover sane state by initiating INIT to them; see
3.7.3 Processor-specific INIT in MultiProcessor
specification. However, there's no way for BSP. Therefore, there's no
other way to disable BSP.
My motivation is to generate crash dump quickly on the system with
huge memory. We can assume such system also has a lot of N-cpus and
(N-1)-cpus are still available.
To identify which CPU is BSP, we lookup ACPI table or MP table. One
concern is that ACPI guidlines BIOS *should* list the BSP in the first
MADT LAPIC entry; not *must*. In this sense, this logic relis on BIOS
following ACPI's guideline. On the other hand, we don't need to worry
about this in MP table case because it has explit BSP flag.
To avoid any undesirable bahaviour caused by any broken BIOS that
doesn't conform to the guideline, it's enough to limit the number of
cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
default kdump configuration. (But of course, it's problematic in
maxcpu=1 case if trying to wake up other cpus later in user space.)
SFI and devicetree doesn't provide BSP information, so there's no
functionality change in their codes, only assigning false for all the
entries, keeping interface uniform.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
arch/x86/include/asm/mpspec.h | 2 +-
arch/x86/kernel/acpi/boot.c | 6 +++++-
arch/x86/kernel/apic/apic.c | 18 +++++++++++++++++-
arch/x86/kernel/devicetree.c | 1 +
arch/x86/kernel/mpparse.c | 15 +++++++++++++--
arch/x86/platform/sfi/sfi.c | 2 +-
6 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 54d5f98..b5014cd 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -101,7 +101,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
#define default_get_smp_config x86_init_uint_noop
#endif
-void generic_processor_info(int apicid, int version);
+void generic_processor_info(int apicid, bool isbsp, int version);
#ifdef CONFIG_ACPI
extern void mp_register_ioapic(int id, u32 address, u32 gsi_base);
extern void mp_override_legacy_irq(u8 bus_irq, u8 polarity, u8 trigger,
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 40c7660..7944d3f 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -192,6 +192,7 @@ static int __init acpi_parse_madt(struct acpi_table_header *table)
static void acpi_register_lapic(int id, u8 enabled)
{
unsigned int ver = 0;
+ bool isbsp = false;
if (id >= MAX_LOCAL_APIC) {
printk(KERN_INFO PREFIX "skipped apicid that is too big\n");
@@ -206,7 +207,10 @@ static void acpi_register_lapic(int id, u8 enabled)
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
- generic_processor_info(id, ver);
+ if (!num_processors && !disabled_cpus)
+ isbsp = true;
+
+ generic_processor_info(id, isbsp, ver);
}
static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 62ee365..2e4e036 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2113,13 +2113,29 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}
-void generic_processor_info(int apicid, int version)
+void generic_processor_info(int apicid, bool isbsp, int version)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
phys_cpu_present_map);
/*
+ * If boot cpu is AP, we now don't have any way to initialize
+ * BSP. To save memory consumed, we disable BSP this case and
+ * use (N - 1)-cpus.
+ */
+ if (isbsp && !boot_cpu_is_bsp) {
+ int thiscpu = num_processors + disabled_cpus;
+
+ pr_warning("ACPI: The boot cpu is not BSP. "
+ "The BSP Processor %d/0x%x ignored.\n",
+ thiscpu, apicid);
+
+ disabled_cpus++;
+ return;
+ }
+
+ /*
* If boot cpu has not been detected yet, then only allow upto
* nr_cpu_ids - 1 processors and keep one slot free for boot cpu
*/
diff --git a/arch/x86/kernel/devicetree.c b/arch/x86/kernel/devicetree.c
index 376dc78..0ed2da8 100644
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -182,6 +182,7 @@ static void __init dtb_lapic_setup(void)
pic_mode = 1;
register_lapic_address(r.start);
generic_processor_info(boot_cpu_physical_apicid,
+ false,
GET_APIC_VERSION(apic_read(APIC_LVR)));
#endif
}
diff --git a/arch/x86/kernel/mpparse.c b/arch/x86/kernel/mpparse.c
index d2b5648..003b07fe 100644
--- a/arch/x86/kernel/mpparse.c
+++ b/arch/x86/kernel/mpparse.c
@@ -54,6 +54,7 @@ static void __init MP_processor_info(struct mpc_cpu *m)
{
int apicid;
char *bootup_cpu = "";
+ bool isbsp = false;
if (!(m->cpuflag & CPU_ENABLED)) {
disabled_cpus++;
@@ -64,11 +65,21 @@ static void __init MP_processor_info(struct mpc_cpu *m)
if (m->cpuflag & CPU_BOOTPROCESSOR) {
bootup_cpu = " (Bootup-CPU)";
- boot_cpu_physical_apicid = m->apicid;
+ /*
+ * boot cpu cannot be BSP if any crash happens on AP
+ * and kexec enters the 2nd kernel.
+ *
+ * Also, boot_cpu_physical_apicid can be initialized
+ * before reaching here; for example, in
+ * register_lapic_address().
+ */
+ if (boot_cpu_is_bsp && boot_cpu_physical_apicid == -1U)
+ boot_cpu_physical_apicid = m->apicid;
+ isbsp = true;
}
printk(KERN_INFO "Processor #%d%s\n", m->apicid, bootup_cpu);
- generic_processor_info(apicid, m->apicver);
+ generic_processor_info(apicid, isbsp, m->apicver);
}
#ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/platform/sfi/sfi.c b/arch/x86/platform/sfi/sfi.c
index bcd1a70..4f111c7 100644
--- a/arch/x86/platform/sfi/sfi.c
+++ b/arch/x86/platform/sfi/sfi.c
@@ -45,7 +45,7 @@ static void __init mp_sfi_register_lapic(u8 id)
pr_info("registering lapic[%d]\n", id);
- generic_processor_info(id, GET_APIC_VERSION(apic_read(APIC_LVR)));
+ generic_processor_info(id, false, GET_APIC_VERSION(apic_read(APIC_LVR)));
}
static int __init sfi_parse_cpus(struct sfi_table_header *table)
Kexec can enter the kdump 2nd kernel on AP if crash happens on AP. To
check if boot cpu is BSP, introduce a helper function
boot_cpu_is_bsp().
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
arch/x86/include/asm/mpspec.h | 7 +++++++
arch/x86/kernel/apic/apic.c | 16 ++++++++++++++++
arch/x86/kernel/setup.c | 2 ++
3 files changed, 25 insertions(+)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 626cf70..54d5f98 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -47,11 +47,18 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
extern unsigned int boot_cpu_physical_apicid;
+extern bool boot_cpu_is_bsp;
extern unsigned int max_physical_apicid;
extern int mpc_default_type;
extern unsigned long mp_lapic_addr;
#ifdef CONFIG_X86_LOCAL_APIC
+extern void boot_cpu_is_bsp_init(void);
+#else
+static inline void boot_cpu_is_bsp_init(void) { };
+#endif
+
+#ifdef CONFIG_X86_LOCAL_APIC
extern int smp_found_config;
#else
# define smp_found_config 0
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a7eb82d..62ee365 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -64,6 +64,12 @@ unsigned disabled_cpus;
unsigned int boot_cpu_physical_apicid = -1U;
/*
+ * Indicates whether the processor that is doing the boot up, is BSP
+ * processor or not.
+ */
+bool boot_cpu_is_bsp;
+
+/*
* The highest APIC ID seen during enumeration.
*/
unsigned int max_physical_apicid;
@@ -2589,3 +2595,13 @@ static int __init lapic_insert_resource(void)
* that is using request_resource
*/
late_initcall(lapic_insert_resource);
+
+void __init boot_cpu_is_bsp_init(void)
+{
+ if (cpu_has_apic) {
+ u32 l, h;
+
+ rdmsr_safe(MSR_IA32_APICBASE, &l, &h);
+ boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
+ }
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..a30bc06 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1166,6 +1166,8 @@ void __init setup_arch(char **cmdline_p)
early_quirks();
+ boot_cpu_is_bsp_init();
+
/*
* Read APIC and some other early information from ACPI tables.
*/
On Tue, Oct 15, 2013 at 02:43:22PM +0900, HATAYAMA Daisuke wrote:
> Kexec can enter the kdump 2nd kernel on AP if crash happens on AP. To
> check if boot cpu is BSP, introduce a helper function
> boot_cpu_is_bsp().
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
> arch/x86/include/asm/mpspec.h | 7 +++++++
> arch/x86/kernel/apic/apic.c | 16 ++++++++++++++++
> arch/x86/kernel/setup.c | 2 ++
> 3 files changed, 25 insertions(+)
>
> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
> index 626cf70..54d5f98 100644
> --- a/arch/x86/include/asm/mpspec.h
> +++ b/arch/x86/include/asm/mpspec.h
> @@ -47,11 +47,18 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
> extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
>
> extern unsigned int boot_cpu_physical_apicid;
> +extern bool boot_cpu_is_bsp;
> extern unsigned int max_physical_apicid;
> extern int mpc_default_type;
> extern unsigned long mp_lapic_addr;
>
> #ifdef CONFIG_X86_LOCAL_APIC
> +extern void boot_cpu_is_bsp_init(void);
> +#else
> +static inline void boot_cpu_is_bsp_init(void) { };
> +#endif
> +
> +#ifdef CONFIG_X86_LOCAL_APIC
> extern int smp_found_config;
> #else
> # define smp_found_config 0
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index a7eb82d..62ee365 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -64,6 +64,12 @@ unsigned disabled_cpus;
> unsigned int boot_cpu_physical_apicid = -1U;
>
> /*
[..]
> + * Indicates whether the processor that is doing the boot up, is BSP
> + * processor or not.
> + */
> +bool boot_cpu_is_bsp;
Should we set it to true by default? I think in most of the cases boot cpu
is going to be bsp too?
> +
> +/*
> * The highest APIC ID seen during enumeration.
> */
> unsigned int max_physical_apicid;
> @@ -2589,3 +2595,13 @@ static int __init lapic_insert_resource(void)
> * that is using request_resource
> */
> late_initcall(lapic_insert_resource);
> +
> +void __init boot_cpu_is_bsp_init(void)
> +{
> + if (cpu_has_apic) {
> + u32 l, h;
> +
> + rdmsr_safe(MSR_IA32_APICBASE, &l, &h);
> + boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
I came across following thread.
https://lkml.org/lkml/2012/4/18/370
Can we hit above read msr on old P5 class machines? Or is it safe to
call unconditionally.
Thanks
Vivek
On Tue, Oct 15, 2013 at 02:43:27PM +0900, HATAYAMA Daisuke wrote:
> Currently, on x86 architecture, if crash happens on AP in the kdump
> 1st kernel, the 2nd kernel fails to wake up multiple CPUs. The typical
> behaviour we actually see is immediate system reset or hang.
>
> This comes from the hardware specification that the processor with BSP
> flag is jumped at BIOS init code when receiving INIT; the behaviour we
> then see depends on the init code.
>
> This never happens if we use only one cpu in the 2nd kernel. So, we
> have avoided the issue by the workaround that specifying maxcpus=1 or
> nr_cpus=1 in kernel parameter of the 2nd kernel.
>
> In order to address the issue, this patch disables BSP if boot cpu is
> an AP, and thus we don't try to wake up the BSP by sending INIT.
>
> Before this idea we discussed the following two ideas but we cannot
> adopt them in each reasons:
>
> 1. Switch CPU from AP to BSP via IPI NMI at crash in the 1st kernel
>
> This is done in the kdump crash path where logic is in inconsistent
> state. Any part of memory can be corrupted, including
> hardware-related table being accessed for example when paging is
> performed or interruption is performed.
>
> 2. Unset BSP flag of the boot cpu in the 1st kernel
>
> Unsetting BSP flag can affect some real world firmware badly. For
> example, Ma verified that some HP systems fail to reboot under this
> configuration. See:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/1308.1/03574.html
>
> Due to the idea 1, we have to address the issue in the 2nd kernel on
> AP. Then, it's impossible to know which CPU is BSP by rdmsr
> instruction because the CPU is the one we are now trying to wake
> up. From the same reason, it's also impossible to unset BSP flag of
> the BSP by wrmsr instruction.
>
> Next, due to the idea 2, BSP is halting in the 1st kernel while
> keeping BSP flag set (or possibly could be running somewhere in
> catastrophic state.) In generall, CPUs except for the boot cpu in the
> 2nd kernel -- the cpu under which crash happened --- can be thought of
> as remaining in any inconsistent state in the 1st kernel. For APs,
> it's possible to recover sane state by initiating INIT to them; see
> 3.7.3 Processor-specific INIT in MultiProcessor
> specification. However, there's no way for BSP. Therefore, there's no
> other way to disable BSP.
>
> My motivation is to generate crash dump quickly on the system with
> huge memory. We can assume such system also has a lot of N-cpus and
> (N-1)-cpus are still available.
>
> To identify which CPU is BSP, we lookup ACPI table or MP table. One
> concern is that ACPI guidlines BIOS *should* list the BSP in the first
> MADT LAPIC entry; not *must*. In this sense, this logic relis on BIOS
> following ACPI's guideline. On the other hand, we don't need to worry
> about this in MP table case because it has explit BSP flag.
>
> To avoid any undesirable bahaviour caused by any broken BIOS that
> doesn't conform to the guideline, it's enough to limit the number of
> cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
> default kdump configuration. (But of course, it's problematic in
> maxcpu=1 case if trying to wake up other cpus later in user space.)
>
> SFI and devicetree doesn't provide BSP information, so there's no
> functionality change in their codes, only assigning false for all the
> entries, keeping interface uniform.
Hi Hatayama,
So we rely on ACPI reporting BSP properly. And SFI and device tree does
not provide BSP info. So for those cases situations where BSP is not
reported, situaiton is little dicy. We might try to bring up those cpus
and bring down the system.
I am wondering if there is any attribute of cpu which we can pass to
second kernel on command line. And tell second kernel not to bring up
that specific cpu. (Say exclude_cpu=<cpu_attr>)? If this works, then
if ACPI or other mechanism don't report BSP, we could possibly assume
that cpu 0 is BSP and ask second kernel to not try to boot it.
Thanks
Vivek
(2013/10/16 4:12), Vivek Goyal wrote:
> On Tue, Oct 15, 2013 at 02:43:22PM +0900, HATAYAMA Daisuke wrote:
>> Kexec can enter the kdump 2nd kernel on AP if crash happens on AP. To
>> check if boot cpu is BSP, introduce a helper function
>> boot_cpu_is_bsp().
>>
>> Signed-off-by: HATAYAMA Daisuke <[email protected]>
>> ---
>> arch/x86/include/asm/mpspec.h | 7 +++++++
>> arch/x86/kernel/apic/apic.c | 16 ++++++++++++++++
>> arch/x86/kernel/setup.c | 2 ++
>> 3 files changed, 25 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
>> index 626cf70..54d5f98 100644
>> --- a/arch/x86/include/asm/mpspec.h
>> +++ b/arch/x86/include/asm/mpspec.h
>> @@ -47,11 +47,18 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
>> extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
>>
>> extern unsigned int boot_cpu_physical_apicid;
>> +extern bool boot_cpu_is_bsp;
>> extern unsigned int max_physical_apicid;
>> extern int mpc_default_type;
>> extern unsigned long mp_lapic_addr;
>>
>> #ifdef CONFIG_X86_LOCAL_APIC
>> +extern void boot_cpu_is_bsp_init(void);
>> +#else
>> +static inline void boot_cpu_is_bsp_init(void) { };
>> +#endif
>> +
>> +#ifdef CONFIG_X86_LOCAL_APIC
>> extern int smp_found_config;
>> #else
>> # define smp_found_config 0
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index a7eb82d..62ee365 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -64,6 +64,12 @@ unsigned disabled_cpus;
>> unsigned int boot_cpu_physical_apicid = -1U;
>>
>> /*
>
> [..]
>> + * Indicates whether the processor that is doing the boot up, is BSP
>> + * processor or not.
>> + */
>> +bool boot_cpu_is_bsp;
>
> Should we set it to true by default? I think in most of the cases boot cpu
> is going to be bsp too?
>
Agreed. Most likely value should be default.
The reason why I wrote so would be that -- if there's reason -- I wanted to
write it uniform to other variables around it and wanted to avoid to let it
have static storage in binary file.
>> +
>> +/*
>> * The highest APIC ID seen during enumeration.
>> */
>> unsigned int max_physical_apicid;
>> @@ -2589,3 +2595,13 @@ static int __init lapic_insert_resource(void)
>> * that is using request_resource
>> */
>> late_initcall(lapic_insert_resource);
>> +
>> +void __init boot_cpu_is_bsp_init(void)
>> +{
>> + if (cpu_has_apic) {
>> + u32 l, h;
>> +
>> + rdmsr_safe(MSR_IA32_APICBASE, &l, &h);
>> + boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
>
> I came across following thread.
>
> https://lkml.org/lkml/2012/4/18/370
>
> Can we hit above read msr on old P5 class machines? Or is it safe to
> call unconditionally.
>
No, it's dangerous to cause #UD, and current implementation doesn't check
exception value returned by rdmsr_safe. It's meaningless to call rdmsr_safe.
At least, checking boot_cpu_data.x86 >= 6 satisfies support of IA32_APIC_BASE
MSR and this at the same time satisfies support of rdmsr instruction since the
instruction was introduced at Pentium processor. So,
if (boot_cpu_data.x86 >= 6 && cpu_has_apic()) {
u32 l, h;
rdmsr(MSR_IA32_APICBASE, &l, &h);
boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
}
--
Thanks.
HATAYAMA, Daisuke
(2013/10/16 4:30), Vivek Goyal wrote:
> On Tue, Oct 15, 2013 at 02:43:27PM +0900, HATAYAMA Daisuke wrote:
>> Currently, on x86 architecture, if crash happens on AP in the kdump
>> 1st kernel, the 2nd kernel fails to wake up multiple CPUs. The typical
>> behaviour we actually see is immediate system reset or hang.
>>
>> This comes from the hardware specification that the processor with BSP
>> flag is jumped at BIOS init code when receiving INIT; the behaviour we
>> then see depends on the init code.
>>
>> This never happens if we use only one cpu in the 2nd kernel. So, we
>> have avoided the issue by the workaround that specifying maxcpus=1 or
>> nr_cpus=1 in kernel parameter of the 2nd kernel.
>>
>> In order to address the issue, this patch disables BSP if boot cpu is
>> an AP, and thus we don't try to wake up the BSP by sending INIT.
>>
>> Before this idea we discussed the following two ideas but we cannot
>> adopt them in each reasons:
>>
>> 1. Switch CPU from AP to BSP via IPI NMI at crash in the 1st kernel
>>
>> This is done in the kdump crash path where logic is in inconsistent
>> state. Any part of memory can be corrupted, including
>> hardware-related table being accessed for example when paging is
>> performed or interruption is performed.
>>
>> 2. Unset BSP flag of the boot cpu in the 1st kernel
>>
>> Unsetting BSP flag can affect some real world firmware badly. For
>> example, Ma verified that some HP systems fail to reboot under this
>> configuration. See:
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/1308.1/03574.html
>>
>> Due to the idea 1, we have to address the issue in the 2nd kernel on
>> AP. Then, it's impossible to know which CPU is BSP by rdmsr
>> instruction because the CPU is the one we are now trying to wake
>> up. From the same reason, it's also impossible to unset BSP flag of
>> the BSP by wrmsr instruction.
>>
>> Next, due to the idea 2, BSP is halting in the 1st kernel while
>> keeping BSP flag set (or possibly could be running somewhere in
>> catastrophic state.) In generall, CPUs except for the boot cpu in the
>> 2nd kernel -- the cpu under which crash happened --- can be thought of
>> as remaining in any inconsistent state in the 1st kernel. For APs,
>> it's possible to recover sane state by initiating INIT to them; see
>> 3.7.3 Processor-specific INIT in MultiProcessor
>> specification. However, there's no way for BSP. Therefore, there's no
>> other way to disable BSP.
>>
>> My motivation is to generate crash dump quickly on the system with
>> huge memory. We can assume such system also has a lot of N-cpus and
>> (N-1)-cpus are still available.
>>
>> To identify which CPU is BSP, we lookup ACPI table or MP table. One
>> concern is that ACPI guidlines BIOS *should* list the BSP in the first
>> MADT LAPIC entry; not *must*. In this sense, this logic relis on BIOS
>> following ACPI's guideline. On the other hand, we don't need to worry
>> about this in MP table case because it has explit BSP flag.
>>
>> To avoid any undesirable bahaviour caused by any broken BIOS that
>> doesn't conform to the guideline, it's enough to limit the number of
>> cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
>> default kdump configuration. (But of course, it's problematic in
>> maxcpu=1 case if trying to wake up other cpus later in user space.)
>>
>> SFI and devicetree doesn't provide BSP information, so there's no
>> functionality change in their codes, only assigning false for all the
>> entries, keeping interface uniform.
>
> Hi Hatayama,
>
> So we rely on ACPI reporting BSP properly. And SFI and device tree does
> not provide BSP info. So for those cases situations where BSP is not
> reported, situaiton is little dicy. We might try to bring up those cpus
> and bring down the system.
>
Yes, I intend that. If there's no BIOS facility reporting BSP information
in the system, max_cpus=1 or nrcpus=1 should be specified just as so far.
> I am wondering if there is any attribute of cpu which we can pass to
> second kernel on command line. And tell second kernel not to bring up
> that specific cpu. (Say exclude_cpu=<cpu_attr>)? If this works, then
> if ACPI or other mechanism don't report BSP, we could possibly assume
> that cpu 0 is BSP and ask second kernel to not try to boot it.
>
I've come up with similar idea. If there's such kernel option, rest of
the processing can be implemented in user-land, i.e., get apicid of
BSP from /proc/cpuid and set it in kernel command line of 2nd kernel.
What kexec-tools should do on fedora/RHEL? Also, this idea covers SFI
and device tree.
The reason why I didn't choose such idea was first passing the value
via command-line seems rather ad-hoc. The second reason is that in any
case it's compromised design. Rigorously, we cannot get correct mapping
of apicid to {BSP, APIC} at the 1st kernel. That is, there's a class of
the bugs that affect BSP flag of each processor. For example, on
catastrophic state, all the cpus can have BSP flag on the 2nd kernel due
to wrmsr instructions generated by the bug causing crash. In this sense,
current implementation is less reliable than max_cpus=1 case.
If addressing this rigorously, for example, we need to check status of
BSP flag between 1st kernel and 2nd kernel to keep processor with BSP
flag unique, exclude cpus in catastrophic state that are not checked,
and to tell the 2nd kernel which cpu can be wake up.
This is not simple so I've avoided in current implementation.
--
Thanks.
HATAYAMA, Daisuke
On Wed, Oct 16, 2013 at 10:26:44AM +0900, HATAYAMA Daisuke wrote:
[..]
> >I am wondering if there is any attribute of cpu which we can pass to
> >second kernel on command line. And tell second kernel not to bring up
> >that specific cpu. (Say exclude_cpu=<cpu_attr>)? If this works, then
> >if ACPI or other mechanism don't report BSP, we could possibly assume
> >that cpu 0 is BSP and ask second kernel to not try to boot it.
> >
>
> I've come up with similar idea. If there's such kernel option, rest of
> the processing can be implemented in user-land, i.e., get apicid of
> BSP from /proc/cpuid and set it in kernel command line of 2nd kernel.
> What kexec-tools should do on fedora/RHEL? Also, this idea covers SFI
> and device tree.
>
> The reason why I didn't choose such idea was first passing the value
> via command-line seems rather ad-hoc.
We do so many things using command line. So telling kernel not to boot
certain cpus seems ok to me.
> The second reason is that in any
> case it's compromised design. Rigorously, we cannot get correct mapping
> of apicid to {BSP, APIC} at the 1st kernel. That is, there's a class of
> the bugs that affect BSP flag of each processor. For example, on
> catastrophic state, all the cpus can have BSP flag on the 2nd kernel due
> to wrmsr instructions generated by the bug causing crash. In this sense,
> current implementation is less reliable than max_cpus=1 case.
>
> If addressing this rigorously, for example, we need to check status of
> BSP flag between 1st kernel and 2nd kernel to keep processor with BSP
> flag unique, exclude cpus in catastrophic state that are not checked,
> and to tell the 2nd kernel which cpu can be wake up.
Ok, for the time being let us not do any comparision with maxcpus=1 or
nr_cpus=1 because we know that's the most robust thing to do.
For the case where we want to bring up more than one cpu in second kernel,
there seems to be two problems.
- ACPI tables or other tables might not report which is BSP. In that
case we might try to bring up BSP and crash the system.
- Due to malicious wrmsr, more than one cpu might claim being BSP. In that
case the cpu we are crashing on will think it is BSP and it can safely
bring up other cpus.
If we start sending a mask of cpus which should not be brought up in
second kernel, then it would not matter whether BSP flag in MSR is set
or not. Isn't it? And that will solve the second issue.
And if ACPI tables don't report which one is BSP, user space can first
try to look at BSP flags of processors (may be this can be reported
in /proc/cpuinfo?) and if no one has BSP flag set, then assume cpu 0
is BSP.
So to me it looks like passing which cpus to not bring up to second kernel
is more resilient approach. Isn't it?
Thanks
Vivek
(2013/10/19 2:36), Vivek Goyal wrote:
> On Wed, Oct 16, 2013 at 10:26:44AM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>>> I am wondering if there is any attribute of cpu which we can pass to
>>> second kernel on command line. And tell second kernel not to bring up
>>> that specific cpu. (Say exclude_cpu=<cpu_attr>)? If this works, then
>>> if ACPI or other mechanism don't report BSP, we could possibly assume
>>> that cpu 0 is BSP and ask second kernel to not try to boot it.
>>>
>>
>> I've come up with similar idea. If there's such kernel option, rest of
>> the processing can be implemented in user-land, i.e., get apicid of
>> BSP from /proc/cpuid and set it in kernel command line of 2nd kernel.
>> What kexec-tools should do on fedora/RHEL? Also, this idea covers SFI
>> and device tree.
>>
>> The reason why I didn't choose such idea was first passing the value
>> via command-line seems rather ad-hoc.
>
> We do so many things using command line. So telling kernel not to boot
> certain cpus seems ok to me.
>
>> The second reason is that in any
>> case it's compromised design. Rigorously, we cannot get correct mapping
>> of apicid to {BSP, APIC} at the 1st kernel. That is, there's a class of
>> the bugs that affect BSP flag of each processor. For example, on
>> catastrophic state, all the cpus can have BSP flag on the 2nd kernel due
>> to wrmsr instructions generated by the bug causing crash. In this sense,
>> current implementation is less reliable than max_cpus=1 case.
>>
>> If addressing this rigorously, for example, we need to check status of
>> BSP flag between 1st kernel and 2nd kernel to keep processor with BSP
>> flag unique, exclude cpus in catastrophic state that are not checked,
>> and to tell the 2nd kernel which cpu can be wake up.
>
> Ok, for the time being let us not do any comparision with maxcpus=1 or
> nr_cpus=1 because we know that's the most robust thing to do.
>
> For the case where we want to bring up more than one cpu in second kernel,
> there seems to be two problems.
>
> - ACPI tables or other tables might not report which is BSP. In that
> case we might try to bring up BSP and crash the system.
>
> - Due to malicious wrmsr, more than one cpu might claim being BSP. In that
> case the cpu we are crashing on will think it is BSP and it can safely
> bring up other cpus.
>
> If we start sending a mask of cpus which should not be brought up in
> second kernel, then it would not matter whether BSP flag in MSR is set
> or not. Isn't it? And that will solve the second issue.
>
No. As long as the mask is created in the 1st kernel, mapping between CPUs
and {BSP, AP} could get changed at crash. So, the ``mask'' idea never
improves reliability.
To obtain complete reliability without any hardware support to get mapping
between all the CPUS and {BSP, AP}, we must create such mask after crash,
i.e., between the 1st and 2nd kernel such as purgatory or other new phase.
The idea is, for example, that let crashing AP wait for other CPUs in purgatory
until specified number of CPUs reach there or until a certain limit time passes
in case no other CPUs reach there in catastrophic state, and let even the other
CPUs except for the crashing AP go into purgatory, not halt just as the current
implementation, to let them check the mask to represent they can be safely
woken up in the 2nd kernel and then let them halt in the purgatory until they
or part of them are woken up from the 2nd kernel.
> And if ACPI tables don't report which one is BSP, user space can first
> try to look at BSP flags of processors (may be this can be reported
> in /proc/cpuinfo?) and if no one has BSP flag set, then assume cpu 0
> is BSP.
>
> So to me it looks like passing which cpus to not bring up to second kernel
> is more resilient approach. Isn't it?
>
Yes. Though reliability is similar to the current approach, user-space approach
is better in that it doesn't depend on what kind of BIOS tables are present
in the system. Also, the idea is more general and could be applied to other
purposes, I don't know exactly what it is; disabling some part of CPUs might
be useful for the purpose of some kind of debugging?
I'll post new version later.
--
Thanks.
HATAYAMA, Daisuke