This is the patch series to address the issue that kdump 2nd kernel
now fails to wake up multiple CPUs.
This is based on 3.11-rc7.
I tested this patch series on x86. I used 2 cpus by specifying
nr_cpus=2 for the 2nd kernel. I checked both ACPI MADT and MP table
case; the former is default on my system and for the latter I did
acpi=off. I crashed the system both on BSP and AP by taskset -c {0,1}
sh -c "echo c > /proc/sysrq-trigger".
---
HATAYAMA Daisuke (2):
x86, apic: Add boot_cpu_is_bsp() to check if boot cpu is BSP
x86, apic: Disable BSP if boot cpu is AP
arch/x86/include/asm/mpspec.h | 5 ++++-
arch/x86/kernel/acpi/boot.c | 11 ++++++++++-
arch/x86/kernel/apic/apic.c | 32 +++++++++++++++++++++++++++++++-
arch/x86/kernel/devicetree.c | 1 +
arch/x86/kernel/mpparse.c | 15 +++++++++++++--
arch/x86/kernel/setup.c | 2 ++
arch/x86/platform/sfi/sfi.c | 2 +-
7 files changed, 62 insertions(+), 6 deletions(-)
--
Thanks.
HATAYAMA, Daisuke
Currently, on x86 architecture, if crash happens on AP in the kdump
1st kernel, the 2nd kernel fails to wake up multiple CPUs. The typical
behaviour we actually see is immediate system reset or hang.
This comes from the hardware specification that the processor with BSP
flag is jumped at BIOS init code when receiving INIT; the behaviour we
then see depends on the init code.
This never happens if we use only one cpu in the 2nd kernel. So, we
have avoided the issue by the workaround that specifying maxcpus=1 or
nr_cpus=1 in kernel parameter of the 2nd kernel.
In order to address the issue, this patch disables BSP if boot cpu is
AP. Then, there's no longer the BSP. There's no longer possibility to
send INIT to the BSP.
Before this idea we discussed the following two ideas but we cannot
adopt them in each reasons:
1. Switch CPU from AP to BSP via IPI NMI at crash in the 1st kernel
This is done in the kdump crash path where logic is in
inconsistent state. Any part of memory can be corrupted, including
hardware-related table being accessed for example when paging is
performed or interruption is performed.
2. Unset BSP flag of the boot cpu in the 1st kernel
Unsetting BSP flag can affect some real world firmware badly. For
example, Ma verified that some HP systems fail to reboot under
this configuration. See:
http://lkml.indiana.edu/hypermail/linux/kernel/1308.1/03574.html
Due to the idea 1, we have to address the issue in the 2nd kernel on
AP. Then, it's impossible to know which CPU is BSP by rdmsr
instruction because the CPU is the one we are now trying to wake
up. From the same reason, it's also impossible to unset BSP flag of
the BSP by wrmsr instruction.
Next, due to the idea 2, BSP is halting in the 1st kernel while
keeping BSP flag set (or possibly could be running somewhere in
catastrophic state.) In generall, CPUs except for the boot cpu in the
2nd kernel -- the cpu under which crash happened --- can be thought of
as remaining in any inconsistent state in the 1st kernel. For APs,
it's possible to recover sane state by initiating INIT to them; see
3.7.3 Processor-specific INIT in MultiProcessor
specification. However, there's no way for BSP. Therefore, there's no
other way to disable BSP.
My motivation is to generate crash dump quickly on the system with
huge memory. We can assume such system also has a lot of N-cpus and
(N-1)-cpus are still available.
To identify which CPU is BSP, we lookup ACPI table or MP table. One
concern is that ACPI guidlines BIOS *should* list the BSP in the first
MADT LAPIC entry; not *must*. In this sense, this logic relis on BIOS
following ACPI's guideline. On the other hand, we don't need to worry
about this in MP table case because it has explit BSP flag.
To avoid any undesirable bahaviour caused by any broken BIOS that
doesn't conform to the guideline, it's enough to limit the number of
cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
default kdump configuration. (But of course, it's problematic in
maxcpu=1 case if trying to wake up other cpus later in user space.)
SFI and devicetree doesn't provide BSP information, so there's no
functionality change in their codes, only assigning false for all the
entries, keeping interface uniform.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
arch/x86/include/asm/mpspec.h | 2 +-
arch/x86/kernel/acpi/boot.c | 11 ++++++++++-
arch/x86/kernel/apic/apic.c | 18 +++++++++++++++++-
arch/x86/kernel/devicetree.c | 1 +
arch/x86/kernel/mpparse.c | 15 +++++++++++++--
arch/x86/platform/sfi/sfi.c | 2 +-
6 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index a8a4338..d96f409 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -97,7 +97,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
#define default_get_smp_config x86_init_uint_noop
#endif
-void generic_processor_info(int apicid, int version);
+void generic_processor_info(int apicid, bool isbsp, int version);
#ifdef CONFIG_ACPI
extern void mp_register_ioapic(int id, u32 address, u32 gsi_base);
extern void mp_override_legacy_irq(u8 bus_irq, u8 polarity, u8 trigger,
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2627a81..78d95ec 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -198,6 +198,7 @@ static int __init acpi_parse_madt(struct acpi_table_header *table)
static void acpi_register_lapic(int id, u8 enabled)
{
unsigned int ver = 0;
+ bool isbsp = false;
if (id >= (MAX_LOCAL_APIC-1)) {
printk(KERN_INFO PREFIX "skipped apicid that is too big\n");
@@ -212,7 +213,15 @@ static void acpi_register_lapic(int id, u8 enabled)
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
- generic_processor_info(id, ver);
+ /*
+ * ACPI specification describes that platform firmware should
+ * list the boot processor as the first LAPIC entry in the
+ * MADT.
+ */
+ if (!num_processors && !disabled_cpus)
+ isbsp = true;
+
+ generic_processor_info(id, isbsp, ver);
}
static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66cab35..fd969d1 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2113,13 +2113,29 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}
-void generic_processor_info(int apicid, int version)
+void generic_processor_info(int apicid, bool isbsp, int version)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
phys_cpu_present_map);
/*
+ * If boot cpu is AP, we now don't have any way to initialize
+ * BSP. To save memory consumed, we disable BSP this case and
+ * use (N-1)-cpus.
+ */
+ if (isbsp && !boot_cpu_is_bsp) {
+ int thiscpu = num_processors + disabled_cpus;
+
+ pr_warning("ACPI: The boot cpu is not BSP. "
+ "The BSP Processor %d/0x%x ignored.\n",
+ thiscpu, apicid);
+
+ disabled_cpus++;
+ return;
+ }
+
+ /*
* If boot cpu has not been detected yet, then only allow upto
* nr_cpu_ids - 1 processors and keep one slot free for boot cpu
*/
diff --git a/arch/x86/kernel/devicetree.c b/arch/x86/kernel/devicetree.c
index 69eb2fa..fe4e39a 100644
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -183,6 +183,7 @@ static void __init dtb_lapic_setup(void)
pic_mode = 1;
register_lapic_address(r.start);
generic_processor_info(boot_cpu_physical_apicid,
+ false,
GET_APIC_VERSION(apic_read(APIC_LVR)));
#endif
}
diff --git a/arch/x86/kernel/mpparse.c b/arch/x86/kernel/mpparse.c
index d2b5648..003b07fe 100644
--- a/arch/x86/kernel/mpparse.c
+++ b/arch/x86/kernel/mpparse.c
@@ -54,6 +54,7 @@ static void __init MP_processor_info(struct mpc_cpu *m)
{
int apicid;
char *bootup_cpu = "";
+ bool isbsp = false;
if (!(m->cpuflag & CPU_ENABLED)) {
disabled_cpus++;
@@ -64,11 +65,21 @@ static void __init MP_processor_info(struct mpc_cpu *m)
if (m->cpuflag & CPU_BOOTPROCESSOR) {
bootup_cpu = " (Bootup-CPU)";
- boot_cpu_physical_apicid = m->apicid;
+ /*
+ * boot cpu cannot be BSP if any crash happens on AP
+ * and kexec enters the 2nd kernel.
+ *
+ * Also, boot_cpu_physical_apicid can be initialized
+ * before reaching here; for example, in
+ * register_lapic_address().
+ */
+ if (boot_cpu_is_bsp && boot_cpu_physical_apicid == -1U)
+ boot_cpu_physical_apicid = m->apicid;
+ isbsp = true;
}
printk(KERN_INFO "Processor #%d%s\n", m->apicid, bootup_cpu);
- generic_processor_info(apicid, m->apicver);
+ generic_processor_info(apicid, isbsp, m->apicver);
}
#ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/platform/sfi/sfi.c b/arch/x86/platform/sfi/sfi.c
index bcd1a70..4f111c7 100644
--- a/arch/x86/platform/sfi/sfi.c
+++ b/arch/x86/platform/sfi/sfi.c
@@ -45,7 +45,7 @@ static void __init mp_sfi_register_lapic(u8 id)
pr_info("registering lapic[%d]\n", id);
- generic_processor_info(id, GET_APIC_VERSION(apic_read(APIC_LVR)));
+ generic_processor_info(id, false, GET_APIC_VERSION(apic_read(APIC_LVR)));
}
static int __init sfi_parse_cpus(struct sfi_table_header *table)
Kexec can enter the kdump 2nd kernel on AP if crash happens on AP. To
check if boot cpu is BSP, introduce a helper function
boot_cpu_is_bsp().
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
arch/x86/include/asm/mpspec.h | 3 +++
arch/x86/kernel/apic/apic.c | 14 ++++++++++++++
arch/x86/kernel/setup.c | 2 ++
3 files changed, 19 insertions(+)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 626cf70..a8a4338 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -47,10 +47,13 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
extern unsigned int boot_cpu_physical_apicid;
+extern bool boot_cpu_is_bsp;
extern unsigned int max_physical_apicid;
extern int mpc_default_type;
extern unsigned long mp_lapic_addr;
+extern void boot_cpu_is_bsp_init(void);
+
#ifdef CONFIG_X86_LOCAL_APIC
extern int smp_found_config;
#else
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index eca89c5..66cab35 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -64,6 +64,12 @@ unsigned disabled_cpus;
unsigned int boot_cpu_physical_apicid = -1U;
/*
+ * Indicates whether the processor that is doing the boot up, is BSP
+ * processor or not.
+ */
+bool boot_cpu_is_bsp;
+
+/*
* The highest APIC ID seen during enumeration.
*/
unsigned int max_physical_apicid;
@@ -2589,3 +2595,11 @@ static int __init lapic_insert_resource(void)
* that is using request_resource
*/
late_initcall(lapic_insert_resource);
+
+void boot_cpu_is_bsp_init(void)
+{
+ u32 l, h;
+
+ rdmsr(MSR_IA32_APICBASE, l, h);
+ boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f8ec578..f47f716 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1169,6 +1169,8 @@ void __init setup_arch(char **cmdline_p)
early_quirks();
+ boot_cpu_is_bsp_init();
+
/*
* Read APIC and some other early information from ACPI tables.
*/
On 08/29/2013 02:27 AM, HATAYAMA Daisuke wrote:
> This is the patch series to address the issue that kdump 2nd kernel
> now fails to wake up multiple CPUs.
Please explain the "now" in the above sentence. Is this a regression?
If so, what is its impact? Is this something that needs to go into 3.11
as a post-rc7 change, which means it better be hyper-critical?
-hpa
"H. Peter Anvin" <[email protected]> writes:
> On 08/29/2013 02:27 AM, HATAYAMA Daisuke wrote:
>> This is the patch series to address the issue that kdump 2nd kernel
>> now fails to wake up multiple CPUs.
>
> Please explain the "now" in the above sentence. Is this a regression?
> If so, what is its fimpact?
This is not a regression.
> Is this something that needs to go into 3.11
> as a post-rc7 change, which means it better be hyper-critical?
No. This is something that does not need to go into 3.11.
This situation is people who run machines of unreasonable size really
would like to use multiple cpus when generating crash dumps. A
practical problem with that desire is that sending a (startup? init? I
forget which) IPI to a processor with the BSP flag sent triggers that
processor to load code from 0xfffffff0 instead of the vector specified
in the IPI. At which point the processor which jumped to 0xfffffff0 and
is running BIOS code does not call into the kernel so never comes up
which is unfortunate, and worse almost always triggers a soft reset by
writing hardware registers.
Which means the practical way to avoid this sort of thing is not to send
init/startup IPIs to processors with the BSP bit set. Which MPtables
and ACPI tables should denote as the boot processor.
In a previous attack on this problem we explored if it was possible to
clear the BSP bit and be able to use all processors but that does not
work. I think it was actually your suggestion that we just skip the
BSP.
Anyway this long simmering issue has raised it's head again and we have
iron-clad evidence that the only thing people can do is avoid the
problem so this is a patchset to allow people who wind up in a
situtation where they are not booting on the bootstrap processor to to
avoid problems, and use as many of their other processors as possible.
Which should make people with boxes of unusual size happy.
Eric
(2013/08/29 22:54), H. Peter Anvin wrote:
> On 08/29/2013 02:27 AM, HATAYAMA Daisuke wrote:
>> This is the patch series to address the issue that kdump 2nd kernel
>> now fails to wake up multiple CPUs.
>
> Please explain the "now" in the above sentence. Is this a regression?
> If so, what is its impact? Is this something that needs to go into 3.11
> as a post-rc7 change, which means it better be hyper-critical?
>
> -hpa
>
>
This is not a regression just as Eric explains.
There is also my explanation in the description of the 2nd patch. I should have described so here explicitly.
--
Thanks.
HATAYAMA, Daisuke
On Thu, Aug 29, 2013 at 04:37:44PM -0700, Eric W. Biederman wrote:
[..]
> This situation is people who run machines of unreasonable size really
> would like to use multiple cpus when generating crash dumps.
Yes. Now kdump is in a phase where people are doing scalability work. And
one of the problems is that on mutli tera byte machines, filtering is
taking long time. With-in filtering it is especially the compression which
takes the longest (hatayama and cliff wickman have done some study here).
So Idea seems to be that bring up more cpus in the second kernel and try
to parallelize compression work and try to save dump time of multi tera
byte machines.
Thanks
Vivek
On 08/29/2013 04:51 PM, HATAYAMA Daisuke wrote:
>
> This is not a regression just as Eric explains.
>
> There is also my explanation in the description of the 2nd patch. I
> should have described so here explicitly.
>
OK. We're probably days away from a merge window, so I would like to
put this in the queue for 3.13.
-hpa
On Thu, Aug 29, 2013 at 06:28:04PM +0900, HATAYAMA Daisuke wrote:
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 66cab35..fd969d1 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -2113,13 +2113,29 @@ void disconnect_bsp_APIC(int virt_wire_setup)
> apic_write(APIC_LVT1, value);
> }
>
> -void generic_processor_info(int apicid, int version)
> +void generic_processor_info(int apicid, bool isbsp, int version)
> {
> int cpu, max = nr_cpu_ids;
> bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
> phys_cpu_present_map);
>
> /*
> + * If boot cpu is AP, we now don't have any way to initialize
> + * BSP. To save memory consumed, we disable BSP this case and
I don't think we disable the BSP just so that we save memory and rather
because we hang in the kdump kernel otherwise, right?
> + * use (N-1)-cpus.
> + */
> + if (isbsp && !boot_cpu_is_bsp) {
This variable naming looks confusing, IMHO. It would probably be more
understandable if 'isbsp' was called 'boot_cpu' to denote that this is
the CPU we're booting on currently. The comment above it then explains
that it is an AP and it might also refer to the issue why we're doing
that.
> + int thiscpu = num_processors + disabled_cpus;
> +
> + pr_warning("ACPI: The boot cpu is not BSP. "
> + "The BSP Processor %d/0x%x ignored.\n",
> + thiscpu, apicid);
Visible comment, so needs a bit of correcting:
"ACPI: We're not booting on the BSP; BSP %d/0x%x ignored."
> +
> + disabled_cpus++;
> + return;
> + }
> +
> + /*
> * If boot cpu has not been detected yet, then only allow upto
> * nr_cpu_ids - 1 processors and keep one slot free for boot cpu
> */
Thanks.
--
Regards/Gruss,
Boris.
(2013/08/31 14:22), Borislav Petkov wrote:
> On Thu, Aug 29, 2013 at 06:28:04PM +0900, HATAYAMA Daisuke wrote:
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index 66cab35..fd969d1 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -2113,13 +2113,29 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>> apic_write(APIC_LVT1, value);
>> }
>>
>> -void generic_processor_info(int apicid, int version)
>> +void generic_processor_info(int apicid, bool isbsp, int version)
>> {
>> int cpu, max = nr_cpu_ids;
>> bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
>> phys_cpu_present_map);
>>
>> /*
>> + * If boot cpu is AP, we now don't have any way to initialize
>> + * BSP. To save memory consumed, we disable BSP this case and
>
> I don't think we disable the BSP just so that we save memory and rather
> because we hang in the kdump kernel otherwise, right?
>
Thanks for your reviewing.
Yes, primary reason of disabling BSP is to avoid hang/reset in the kdump
2nd kernel. Saving memory is the secondary merit compared with
the user-space workaround by specifying nr_cpus=1 or maxcpus=1 and
waking up APs later except for the BSP. I will not write this in the
next version.
>> + * use (N-1)-cpus.
>> + */
>> + if (isbsp && !boot_cpu_is_bsp) {
>
> This variable naming looks confusing, IMHO. It would probably be more
> understandable if 'isbsp' was called 'boot_cpu' to denote that this is
> the CPU we're booting on currently. The comment above it then explains
> that it is an AP and it might also refer to the issue why we're doing
> that.
>
As you suggest, boot_cpu seems more understandable also to me. BTW, please
notice that it doesn't denote that the CPU we're booting on currently,
but that the CPU with BSP flag set.
In general, current code uses many terms to denote the cpu that is run
at kernel boot-up processing such as boot cpu, bsp, cpu0 and possibly others
since in usual situation, boot cpu is always BSP and assigned to cpu0.
But it is not the case in case of kexec. I'm using the word bsp purposely
in the isbsp to mean the CPU with BSP flag set.
So I think it's better to use bsp_cpu here to denote the CPU with BSP flag set.
For the comment, how about the following one?
/*
* In this case, boot cpu is AP. This can happen on
* kexec/kdump. Consider the case that crash happens on some
* AP and enters kdump 2nd kernel with the AP.
*
* Then, there's issue that if we send INIT to BSP, due to x86
* hardware specification, it is forced to jump at BIOS init
* code and system hangs or resets immediately.
*
* To avoid the issue, we disable BSP. Then, there's no longer
* possbility to send INIT to BSP.
*/
>> + int thiscpu = num_processors + disabled_cpus;
>> +
>> + pr_warning("ACPI: The boot cpu is not BSP. "
>> + "The BSP Processor %d/0x%x ignored.\n",
>> + thiscpu, apicid);
>
> Visible comment, so needs a bit of correcting:
>
> "ACPI: We're not booting on the BSP; BSP %d/0x%x ignored."
>
Yes, I'll use this message in the next patch.
>> +
>> + disabled_cpus++;
>> + return;
>> + }
>> +
>> + /*
>> * If boot cpu has not been detected yet, then only allow upto
>> * nr_cpu_ids - 1 processors and keep one slot free for boot cpu
>> */
>
> Thanks.
>
--
Thanks.
HATAYAMA, Daisuke
On Mon, Sep 02, 2013 at 11:32:59AM +0900, HATAYAMA Daisuke wrote:
> As you suggest, boot_cpu seems more understandable also to me. BTW,
> please notice that it doesn't denote that the CPU we're booting on
> currently, but that the CPU with BSP flag set.
Hmm, by "BSP flag set" you mean it is the first LAPIC entry in the MADT,
correct? At least this is the case when you set isbsp to true. Because,
there's also the BSC flag in APIC_BAR (MSR 0x1b) which denotes the
bootstrapping core on node 0.
> In general, current code uses many terms to denote the cpu that is run
> at kernel boot-up processing such as boot cpu, bsp, cpu0 and possibly
> others since in usual situation, boot cpu is always BSP and assigned
> to cpu0. But it is not the case in case of kexec. I'm using the word
> bsp purposely in the isbsp to mean the CPU with BSP flag set.
>
> So I think it's better to use bsp_cpu here to denote the CPU with BSP
> flag set.
Right.
> For the comment, how about the following one?
>
> /*
> * In this case, boot cpu is AP. This can happen on
> * kexec/kdump. Consider the case that crash happens on some
> * AP and enters kdump 2nd kernel with the AP.
> *
> * Then, there's issue that if we send INIT to BSP, due to x86
> * hardware specification, it is forced to jump at BIOS init
> * code and system hangs or resets immediately.
> *
> * To avoid the issue, we disable BSP. Then, there's no longer
> * possbility to send INIT to BSP.
> */
Yes, much better. Especially when looking down the road and people have
forgotten what the whole fuss was about, a nice detailed comment is
priceless.
Thanks.
--
Regards/Gruss,
Boris.
(2013/09/02 16:13), Borislav Petkov wrote:
> On Mon, Sep 02, 2013 at 11:32:59AM +0900, HATAYAMA Daisuke wrote:
>> As you suggest, boot_cpu seems more understandable also to me. BTW,
>> please notice that it doesn't denote that the CPU we're booting on
>> currently, but that the CPU with BSP flag set.
>
> Hmm, by "BSP flag set" you mean it is the first LAPIC entry in the MADT,
> correct? At least this is the case when you set isbsp to true. Because,
> there's also the BSC flag in APIC_BAR (MSR 0x1b) which denotes the
> bootstrapping core on node 0.
>
The reason why I don't lookup BSP flag in MSR is that it's impossible.
To read MSR of some CPU, we need to use rdmsr instruction on the CPU.
However, in case of this issue, the BSP is halting or running in
the kdump 1st kernel.
A whole explanation is written in the patch description.
--
Thanks.
HATAYAMA, Daisuke
On Mon, Sep 02, 2013 at 06:42:44PM +0900, HATAYAMA Daisuke wrote:
> The reason why I don't lookup BSP flag in MSR is that it's impossible.
> To read MSR of some CPU, we need to use rdmsr instruction on the CPU.
> However, in case of this issue, the BSP is halting or running in
> the kdump 1st kernel.
Yes, and on the AP, that flag would be cleared which makes it not a BSP.
> A whole explanation is written in the patch description.
Those tend to get lost in git history when a bunch of whitespace jerk
offs appear ontop. So a nicely written comment in the code could be very
helpful.
Thanks.
--
Regards/Gruss,
Boris.
(2013/09/04 15:12), Borislav Petkov wrote:
> On Mon, Sep 02, 2013 at 06:42:44PM +0900, HATAYAMA Daisuke wrote:
>> The reason why I don't lookup BSP flag in MSR is that it's impossible.
>> To read MSR of some CPU, we need to use rdmsr instruction on the CPU.
>> However, in case of this issue, the BSP is halting or running in
>> the kdump 1st kernel.
>
> Yes, and on the AP, that flag would be cleared which makes it not a BSP.
>
>> A whole explanation is written in the patch description.
>
> Those tend to get lost in git history when a bunch of whitespace jerk
> offs appear ontop. So a nicely written comment in the code could be very
> helpful.
>
> Thanks.
>
Yes, that's the point this patch series misses. I'll describe the explanation
in Documentation/kexec/kexec.txt and point it in the comment of boot-up code.
--
Thanks.
HATAYAMA, Daisuke
On Fri, Aug 30, 2013 at 08:43:56AM -0700, H. Peter Anvin wrote:
> On 08/29/2013 04:51 PM, HATAYAMA Daisuke wrote:
> >
> > This is not a regression just as Eric explains.
> >
> > There is also my explanation in the description of the 2nd patch. I
> > should have described so here explicitly.
> >
>
> OK. We're probably days away from a merge window, so I would like to
> put this in the queue for 3.13.
Hi hpa,
Have you queued up these patches for 3.13?
Thanks
Vivek
Commit-ID: 68c3467bb95a9e99f46d04cb96f60c04bb997ac8
Gitweb: http://git.kernel.org/tip/68c3467bb95a9e99f46d04cb96f60c04bb997ac8
Author: HATAYAMA Daisuke <[email protected]>
AuthorDate: Thu, 29 Aug 2013 18:27:58 +0900
Committer: H. Peter Anvin <[email protected]>
CommitDate: Wed, 9 Oct 2013 15:41:10 -0700
x86, apic: Add boot_cpu_is_bsp() to check if boot cpu is BSP
Kexec can enter the kdump 2nd kernel on AP if crash happens on AP. To
check if boot cpu is BSP, introduce a helper function
boot_cpu_is_bsp().
Signed-off-by: HATAYAMA Daisuke <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/mpspec.h | 3 +++
arch/x86/kernel/apic/apic.c | 14 ++++++++++++++
arch/x86/kernel/setup.c | 2 ++
3 files changed, 19 insertions(+)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 626cf70..a8a4338 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -47,10 +47,13 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
extern unsigned int boot_cpu_physical_apicid;
+extern bool boot_cpu_is_bsp;
extern unsigned int max_physical_apicid;
extern int mpc_default_type;
extern unsigned long mp_lapic_addr;
+extern void boot_cpu_is_bsp_init(void);
+
#ifdef CONFIG_X86_LOCAL_APIC
extern int smp_found_config;
#else
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a7eb82d..488709d 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -64,6 +64,12 @@ unsigned disabled_cpus;
unsigned int boot_cpu_physical_apicid = -1U;
/*
+ * Indicates whether the processor that is doing the boot up, is BSP
+ * processor or not.
+ */
+bool boot_cpu_is_bsp;
+
+/*
* The highest APIC ID seen during enumeration.
*/
unsigned int max_physical_apicid;
@@ -2589,3 +2595,11 @@ static int __init lapic_insert_resource(void)
* that is using request_resource
*/
late_initcall(lapic_insert_resource);
+
+void boot_cpu_is_bsp_init(void)
+{
+ u32 l, h;
+
+ rdmsr(MSR_IA32_APICBASE, l, h);
+ boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..a30bc06 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1166,6 +1166,8 @@ void __init setup_arch(char **cmdline_p)
early_quirks();
+ boot_cpu_is_bsp_init();
+
/*
* Read APIC and some other early information from ACPI tables.
*/
Commit-ID: 1d79e607332d67d9132c176d99b5e7fabe1b6b7f
Gitweb: http://git.kernel.org/tip/1d79e607332d67d9132c176d99b5e7fabe1b6b7f
Author: HATAYAMA Daisuke <[email protected]>
AuthorDate: Thu, 29 Aug 2013 18:28:04 +0900
Committer: H. Peter Anvin <[email protected]>
CommitDate: Wed, 9 Oct 2013 15:41:11 -0700
x86, apic: Disable BSP if boot cpu is AP
Currently, on x86 architecture, if crash happens on AP in the kdump
1st kernel, the 2nd kernel fails to wake up multiple CPUs. The typical
behaviour we actually see is immediate system reset or hang.
This comes from the hardware specification that the processor with BSP
flag is jumped at BIOS init code when receiving INIT; the behaviour we
then see depends on the init code.
This never happens if we use only one cpu in the 2nd kernel. So, we
have avoided the issue by the workaround that specifying maxcpus=1 or
nr_cpus=1 in kernel parameter of the 2nd kernel.
In order to address the issue, this patch disables BSP if boot cpu is
an AP, and thus we don't try to wake up the BSP by sending INIT.
Before this idea we discussed the following two ideas but we cannot
adopt them in each reasons:
1. Switch CPU from AP to BSP via IPI NMI at crash in the 1st kernel
This is done in the kdump crash path where logic is in
inconsistent state. Any part of memory can be corrupted, including
hardware-related table being accessed for example when paging is
performed or interruption is performed.
2. Unset BSP flag of the boot cpu in the 1st kernel
Unsetting BSP flag can affect some real world firmware badly. For
example, Ma verified that some HP systems fail to reboot under
this configuration. See:
http://lkml.indiana.edu/hypermail/linux/kernel/1308.1/03574.html
Due to the idea 1, we have to address the issue in the 2nd kernel on
AP. Then, it's impossible to know which CPU is BSP by rdmsr
instruction because the CPU is the one we are now trying to wake
up. From the same reason, it's also impossible to unset BSP flag of
the BSP by wrmsr instruction.
Next, due to the idea 2, BSP is halting in the 1st kernel while
keeping BSP flag set (or possibly could be running somewhere in
catastrophic state.) In generall, CPUs except for the boot cpu in the
2nd kernel -- the cpu under which crash happened --- can be thought of
as remaining in any inconsistent state in the 1st kernel. For APs,
it's possible to recover sane state by initiating INIT to them; see
3.7.3 Processor-specific INIT in MultiProcessor
specification. However, there's no way for BSP. Therefore, there's no
other way to disable BSP.
My motivation is to generate crash dump quickly on the system with
huge memory. We can assume such system also has a lot of N-cpus and
(N-1)-cpus are still available.
To identify which CPU is BSP, we lookup ACPI table or MP table. One
concern is that ACPI guidlines BIOS *should* list the BSP in the first
MADT LAPIC entry; not *must*. In this sense, this logic relis on BIOS
following ACPI's guideline. On the other hand, we don't need to worry
about this in MP table case because it has explit BSP flag.
To avoid any undesirable bahaviour caused by any broken BIOS that
doesn't conform to the guideline, it's enough to limit the number of
cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
default kdump configuration. (But of course, it's problematic in
maxcpu=1 case if trying to wake up other cpus later in user space.)
SFI and devicetree doesn't provide BSP information, so there's no
functionality change in their codes, only assigning false for all the
entries, keeping interface uniform.
[ hpa: it might be better in the future for the primary kernel to
explicitly export the APIC ID for the BSP. The primary kernel will
know, both due to which CPU it was booted from and because it can
see the BSP flag. However, this seems like a net win for now. ]
Signed-off-by: HATAYAMA Daisuke <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/mpspec.h | 2 +-
arch/x86/kernel/acpi/boot.c | 11 ++++++++++-
arch/x86/kernel/apic/apic.c | 18 +++++++++++++++++-
arch/x86/kernel/devicetree.c | 1 +
arch/x86/kernel/mpparse.c | 15 +++++++++++++--
arch/x86/platform/sfi/sfi.c | 2 +-
6 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index a8a4338..d96f409 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -97,7 +97,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
#define default_get_smp_config x86_init_uint_noop
#endif
-void generic_processor_info(int apicid, int version);
+void generic_processor_info(int apicid, bool isbsp, int version);
#ifdef CONFIG_ACPI
extern void mp_register_ioapic(int id, u32 address, u32 gsi_base);
extern void mp_override_legacy_irq(u8 bus_irq, u8 polarity, u8 trigger,
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 40c7660..e0ea929 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -192,6 +192,7 @@ static int __init acpi_parse_madt(struct acpi_table_header *table)
static void acpi_register_lapic(int id, u8 enabled)
{
unsigned int ver = 0;
+ bool isbsp = false;
if (id >= MAX_LOCAL_APIC) {
printk(KERN_INFO PREFIX "skipped apicid that is too big\n");
@@ -206,7 +207,15 @@ static void acpi_register_lapic(int id, u8 enabled)
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
- generic_processor_info(id, ver);
+ /*
+ * ACPI specification describes that platform firmware should
+ * list the boot processor as the first LAPIC entry in the
+ * MADT.
+ */
+ if (!num_processors && !disabled_cpus)
+ isbsp = true;
+
+ generic_processor_info(id, isbsp, ver);
}
static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 488709d..a6046bb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2113,13 +2113,29 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}
-void generic_processor_info(int apicid, int version)
+void generic_processor_info(int apicid, bool isbsp, int version)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
phys_cpu_present_map);
/*
+ * If boot cpu is AP, we now don't have any way to initialize
+ * BSP. To save memory consumed, we disable BSP this case and
+ * use (N-1)-cpus.
+ */
+ if (isbsp && !boot_cpu_is_bsp) {
+ int thiscpu = num_processors + disabled_cpus;
+
+ pr_warning("ACPI: The boot cpu is not BSP. "
+ "The BSP Processor %d/0x%x ignored.\n",
+ thiscpu, apicid);
+
+ disabled_cpus++;
+ return;
+ }
+
+ /*
* If boot cpu has not been detected yet, then only allow upto
* nr_cpu_ids - 1 processors and keep one slot free for boot cpu
*/
diff --git a/arch/x86/kernel/devicetree.c b/arch/x86/kernel/devicetree.c
index 376dc78..0ed2da8 100644
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -182,6 +182,7 @@ static void __init dtb_lapic_setup(void)
pic_mode = 1;
register_lapic_address(r.start);
generic_processor_info(boot_cpu_physical_apicid,
+ false,
GET_APIC_VERSION(apic_read(APIC_LVR)));
#endif
}
diff --git a/arch/x86/kernel/mpparse.c b/arch/x86/kernel/mpparse.c
index d2b5648..003b07fe 100644
--- a/arch/x86/kernel/mpparse.c
+++ b/arch/x86/kernel/mpparse.c
@@ -54,6 +54,7 @@ static void __init MP_processor_info(struct mpc_cpu *m)
{
int apicid;
char *bootup_cpu = "";
+ bool isbsp = false;
if (!(m->cpuflag & CPU_ENABLED)) {
disabled_cpus++;
@@ -64,11 +65,21 @@ static void __init MP_processor_info(struct mpc_cpu *m)
if (m->cpuflag & CPU_BOOTPROCESSOR) {
bootup_cpu = " (Bootup-CPU)";
- boot_cpu_physical_apicid = m->apicid;
+ /*
+ * boot cpu cannot be BSP if any crash happens on AP
+ * and kexec enters the 2nd kernel.
+ *
+ * Also, boot_cpu_physical_apicid can be initialized
+ * before reaching here; for example, in
+ * register_lapic_address().
+ */
+ if (boot_cpu_is_bsp && boot_cpu_physical_apicid == -1U)
+ boot_cpu_physical_apicid = m->apicid;
+ isbsp = true;
}
printk(KERN_INFO "Processor #%d%s\n", m->apicid, bootup_cpu);
- generic_processor_info(apicid, m->apicver);
+ generic_processor_info(apicid, isbsp, m->apicver);
}
#ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/platform/sfi/sfi.c b/arch/x86/platform/sfi/sfi.c
index bcd1a70..4f111c7 100644
--- a/arch/x86/platform/sfi/sfi.c
+++ b/arch/x86/platform/sfi/sfi.c
@@ -45,7 +45,7 @@ static void __init mp_sfi_register_lapic(u8 id)
pr_info("registering lapic[%d]\n", id);
- generic_processor_info(id, GET_APIC_VERSION(apic_read(APIC_LVR)));
+ generic_processor_info(id, false, GET_APIC_VERSION(apic_read(APIC_LVR)));
}
static int __init sfi_parse_cpus(struct sfi_table_header *table)
On 10/09/2013 04:16 PM, tip-bot for HATAYAMA Daisuke wrote:
> Commit-ID: 1d79e607332d67d9132c176d99b5e7fabe1b6b7f
> Gitweb: http://git.kernel.org/tip/1d79e607332d67d9132c176d99b5e7fabe1b6b7f
> Author: HATAYAMA Daisuke <[email protected]>
> AuthorDate: Thu, 29 Aug 2013 18:28:04 +0900
> Committer: H. Peter Anvin <[email protected]>
> CommitDate: Wed, 9 Oct 2013 15:41:11 -0700
>
> x86, apic: Disable BSP if boot cpu is AP
This patch seems to trigger build failures in some configurations.
Specifically:
(.init.text+0x1307): undefined reference to `boot_cpu_is_bsp_init'
Unfortunately I don't have the specific configuration which triggers the
failure, as this was discovered by Fengguang's robot.
-hpa
* H. Peter Anvin <[email protected]> wrote:
> On 10/09/2013 04:16 PM, tip-bot for HATAYAMA Daisuke wrote:
> > Commit-ID: 1d79e607332d67d9132c176d99b5e7fabe1b6b7f
> > Gitweb: http://git.kernel.org/tip/1d79e607332d67d9132c176d99b5e7fabe1b6b7f
> > Author: HATAYAMA Daisuke <[email protected]>
> > AuthorDate: Thu, 29 Aug 2013 18:28:04 +0900
> > Committer: H. Peter Anvin <[email protected]>
> > CommitDate: Wed, 9 Oct 2013 15:41:11 -0700
> >
> > x86, apic: Disable BSP if boot cpu is AP
>
> This patch seems to trigger build failures in some configurations.
> Specifically:
>
> (.init.text+0x1307): undefined reference to `boot_cpu_is_bsp_init'
>
> Unfortunately I don't have the specific configuration which triggers the
> failure, as this was discovered by Fengguang's robot.
I have triggered that too and have such a config, it's attached.
Thanks,
Ingo
On Wed, 9 Oct 2013 16:20:12 -0400
Vivek Goyal <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 08:43:56AM -0700, H. Peter Anvin wrote:
> > On 08/29/2013 04:51 PM, HATAYAMA Daisuke wrote:
> > >
> > > This is not a regression just as Eric explains.
> > >
> > > There is also my explanation in the description of the 2nd patch. I
> > > should have described so here explicitly.
> > >
> >
> > OK. We're probably days away from a merge window, so I would like to
> > put this in the queue for 3.13.
>
> Hi hpa,
>
> Have you queued up these patches for 3.13?
Yeah, I'm also interested. But I think we were waiting for HATAYAMA
Daisuke to send a cleaned up version of the patch set.
@HATAYAMA: If you're too busy to clean up the series, I can do that for
you and re-send it here. But I don't want to "steal" the work from you.
Petr T