2012-10-16 04:35:22

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

Multiple CPUs are useful for CPU-bound processing like compression and
I do want to use compression to generate crash dump quickly. But now
we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
crash happens on AP. If crash happens on AP, kexec enters the 2nd
kernel with the AP, and there BSP in the 1st kernel is expected to be
haling in the 1st kernel or possibly in any fatal system error state.

To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
BSP to jump into BIOS init code. A typical visible behaviour is hang
or immediate reset, depending on the BIOS init code.

AP can be initiated by INIT even in a fatal state: MP spec explains
that processor-specific INIT can be used to recover AP from a fatal
system error. On the other hand, there's no method for BSP to recover;
it might be possible to do so by NMI plus any hand-coded reset code
that is carefully designed, but at least I have no idea in this
direction now.

Therefore, the idea I do in this patch set is simply to disable BSP if
vboot cpu is AP.

My motivation is to use multiple CPUs in order to quickly generate
crash dump on the machine with huge amount of memory. I assume such
machine tends to also have a lot of CPUs. So disabling one CPU would
be no problem.

On most BIOSs, BSP is always assigned to cpu#1; on other BIOSs, BSP
could probably be assigned to a fixed cpu number. Assuming this fact,
it might be possible to choose an idea that waking up the cpus except
for cpu#1, not waking up cpu#1 only. But I don't choose this in this
patch set because:

- It's ugly desgin to keep switch in sysfs that can unintentionally
cause system to enter undefined behaviour.

- Memory space for BSP is never used if BSP is not running. Amount of
reserved memory for 2nd kernel is typically from 128MB to 512MB
only, severely limited. If BSP is unused, I want to use the space
for another AP instead.

Note: recent upstream kernel fails reserving memory for kdump 2nd
kernel. To run kdump, please apply the patch below on top of this
patch set:
https://lkml.org/lkml/2012/8/31/238

---

HATAYAMA Daisuke (2):
x86, apic: Disable BSP if boot cpu is AP
x86, apic: Introduce boot_cpu_is_bsp indicating whether boot cpu is BSP or not


arch/x86/include/asm/mpspec.h | 5 ++++-
arch/x86/kernel/acpi/boot.c | 10 +++++++++-
arch/x86/kernel/apic/apic.c | 34 +++++++++++++++++++++++++++++++++-
arch/x86/kernel/devicetree.c | 2 +-
arch/x86/kernel/mpparse.c | 15 +++++++++++++--
arch/x86/kernel/setup.c | 2 ++
arch/x86/platform/sfi/sfi.c | 2 +-
7 files changed, 63 insertions(+), 7 deletions(-)

--
Thanks.
HATAYAMA, Daisuke


2012-10-16 04:35:28

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v1 1/2] x86, apic: Introduce boot_cpu_is_bsp indicating whether boot cpu is BSP or not

Part of boot-up code assumes booting CPU is BSP, but kexec can enter
the 2nd kernel with AP. To be able to distinguish these throughout
kernel processing, introduce boot_cpu_is_bsp.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

arch/x86/include/asm/mpspec.h | 3 +++
arch/x86/kernel/apic/apic.c | 13 +++++++++++++
arch/x86/kernel/setup.c | 2 ++
3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 3e2f42a..d56f253 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -47,10 +47,13 @@ extern int mp_bus_id_to_type[MAX_MP_BUSSES];
extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);

extern unsigned int boot_cpu_physical_apicid;
+extern bool boot_cpu_is_bsp;
extern unsigned int max_physical_apicid;
extern int mpc_default_type;
extern unsigned long mp_lapic_addr;

+extern void boot_cpu_is_bsp_init(void);
+
#ifdef CONFIG_X86_LOCAL_APIC
extern int smp_found_config;
#else
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b17416e..d8d69e4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -62,6 +62,10 @@ unsigned disabled_cpus __cpuinitdata;
/* Processor that is doing the boot up */
unsigned int boot_cpu_physical_apicid = -1U;

+/* Indicates whether the processor that is doing the boot up, is BSP
+ * processor or not */
+bool boot_cpu_is_bsp;
+
/*
* The highest APIC ID seen during enumeration.
*/
@@ -2515,3 +2519,12 @@ static int __init lapic_insert_resource(void)
* that is using request_resource
*/
late_initcall(lapic_insert_resource);
+
+void boot_cpu_is_bsp_init(void)
+{
+ u32 l, h;
+
+ rdmsr(MSR_IA32_APICBASE, l, h);
+
+ boot_cpu_is_bsp = (l & MSR_IA32_APICBASE_BSP) ? true : false;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a2bb18e..6ecb9bc 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -988,6 +988,8 @@ void __init setup_arch(char **cmdline_p)

early_quirks();

+ boot_cpu_is_bsp_init();
+
/*
* Read APIC and some other early information from ACPI tables.
*/

2012-10-16 04:35:34

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

We disable BSP if boot cpu is AP.

INIT-INIT-SIPI sequence, a protocal to initiate AP, cannot be used for
BSP since it causes BSP jump to BIOS init code; typical visible
behaviour is hang or immediate reset, depending on the BIOS init code.

INIT can be used to reset AP in a fatal system error state as
described in MP spec 3.7.3 Processor-specific INIT. In contrast, there
is no processor-specific INIT for BSP to initilize from a fatal system
error. It might be possible to do so by NMI plus any hand-crafted
reset code that is carefully designed, but at least I have no idea in
this direction now.

By the way, my motivation is to generate crash dump quickly on the
system with huge memory. I think we can assume such system also has a
lot of cpus. If so, it would be no problem if only one cpu gets
unavailable.

We lookup ACPI table or MP table to get BSP information because we
cannot run rdmsr instruction on the CPU we are about to wake up just
now.

One thing to be concerned about here is that ACPI guidlines BIOS
*should* list the BSP in the first MADT LAPIC entry; not *must*. In
this sense, this logic relis on BIOS following ACPI's guideline. On
the other hand, we don't need to worry about this in MP table case
because it has explit BSP flag.

To avoid any undesirable bahaviour caused by any broken BIOS that
doesn't conform to the guideline, it's enough to limit the number of
cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
default kdump configuration. (Of course, it's problematic in maxcpu=1
case if trying to wake up other cpus in user space later.)

Some firmware features such as hibernation and suspend needs to switch
its CPU to BSP before transitting its execution to firmware, so these
features are unavailable on the BSP-disabled setting. This is no
problem because we don't need hibernation and suspend in the kdump 2nd
kernel.

SFI and devicetree doesn't provide BSP information, so there's no
functionality change in their codes, only assigning false for all the
entries, keeping interface uniform.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

arch/x86/include/asm/mpspec.h | 2 +-
arch/x86/kernel/acpi/boot.c | 10 +++++++++-
arch/x86/kernel/apic/apic.c | 21 ++++++++++++++++++++-
arch/x86/kernel/devicetree.c | 2 +-
arch/x86/kernel/mpparse.c | 15 +++++++++++++--
arch/x86/platform/sfi/sfi.c | 2 +-
6 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index d56f253..b5d8e23 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -97,7 +97,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
#define default_get_smp_config x86_init_uint_noop
#endif

-void __cpuinit generic_processor_info(int apicid, int version);
+void __cpuinit generic_processor_info(int apicid, bool isbsp, int version);
#ifdef CONFIG_ACPI
extern void mp_register_ioapic(int id, u32 address, u32 gsi_base);
extern void mp_override_legacy_irq(u8 bus_irq, u8 polarity, u8 trigger,
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e651f7a..e873c09 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -198,6 +198,7 @@ static int __init acpi_parse_madt(struct acpi_table_header *table)
static void __cpuinit acpi_register_lapic(int id, u8 enabled)
{
unsigned int ver = 0;
+ bool isbsp = false;

if (id >= (MAX_LOCAL_APIC-1)) {
printk(KERN_INFO PREFIX "skipped apicid that is too big\n");
@@ -212,7 +213,14 @@ static void __cpuinit acpi_register_lapic(int id, u8 enabled)
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];

- generic_processor_info(id, ver);
+ /*
+ * ACPI says BIOS should list BSP in the first MADT LAPIC
+ * entry.
+ */
+ if (!num_processors && !disabled_cpus)
+ isbsp = true;
+
+ generic_processor_info(id, isbsp, ver);
}

static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d8d69e4..4184853 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2034,13 +2034,32 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}

-void __cpuinit generic_processor_info(int apicid, int version)
+void __cpuinit generic_processor_info(int apicid, bool isbsp, int version)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
phys_cpu_present_map);

/*
+ * If boot cpu is AP, we now don't have any way to initialize
+ * BSP. To save memory consumed, we disable BSP this case.
+ *
+ * Then, we cannot use the features specific to BSP such as
+ * hibernation and suspend. This is no problem because AP
+ * becomes boot cpu only on kexec triggered by crash.
+ */
+ if (isbsp && !boot_cpu_is_bsp) {
+ int thiscpu = num_processors + disabled_cpus;
+
+ pr_warning("ACPI: The boot cpu is not BSP."
+ " The BSP Processor %d/0x%x ignored.\n", thiscpu,
+ apicid);
+
+ disabled_cpus++;
+ return;
+ }
+
+ /*
* If boot cpu has not been detected yet, then only allow upto
* nr_cpu_ids - 1 processors and keep one slot free for boot cpu
*/
diff --git a/arch/x86/kernel/devicetree.c b/arch/x86/kernel/devicetree.c
index b158152..efdacc9 100644
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -182,7 +182,7 @@ static void __init dtb_lapic_setup(void)
smp_found_config = 1;
pic_mode = 1;
register_lapic_address(r.start);
- generic_processor_info(boot_cpu_physical_apicid,
+ generic_processor_info(boot_cpu_physical_apicid, false,
GET_APIC_VERSION(apic_read(APIC_LVR)));
#endif
}
diff --git a/arch/x86/kernel/mpparse.c b/arch/x86/kernel/mpparse.c
index d2b5648..33167e5 100644
--- a/arch/x86/kernel/mpparse.c
+++ b/arch/x86/kernel/mpparse.c
@@ -54,6 +54,7 @@ static void __init MP_processor_info(struct mpc_cpu *m)
{
int apicid;
char *bootup_cpu = "";
+ bool isbsp = false;

if (!(m->cpuflag & CPU_ENABLED)) {
disabled_cpus++;
@@ -64,11 +65,21 @@ static void __init MP_processor_info(struct mpc_cpu *m)

if (m->cpuflag & CPU_BOOTPROCESSOR) {
bootup_cpu = " (Bootup-CPU)";
- boot_cpu_physical_apicid = m->apicid;
+ /*
+ * boot cpu can not be BSP if any crash happens on AP
+ * and kexec enters the 2nd kernel.
+ *
+ * Also, boot_cpu_physical_apicid can be initialized
+ * before reaching here; for example, in
+ * register_lapic_address().
+ */
+ if (boot_cpu_is_bsp && boot_cpu_physical_apicid == -1U)
+ boot_cpu_physical_apicid = m->apicid;
+ isbsp = true;
}

printk(KERN_INFO "Processor #%d%s\n", m->apicid, bootup_cpu);
- generic_processor_info(apicid, m->apicver);
+ generic_processor_info(apicid, isbsp, m->apicver);
}

#ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/platform/sfi/sfi.c b/arch/x86/platform/sfi/sfi.c
index 7785b72..e646041 100644
--- a/arch/x86/platform/sfi/sfi.c
+++ b/arch/x86/platform/sfi/sfi.c
@@ -45,7 +45,7 @@ static void __cpuinit mp_sfi_register_lapic(u8 id)

pr_info("registering lapic[%d]\n", id);

- generic_processor_info(id, GET_APIC_VERSION(apic_read(APIC_LVR)));
+ generic_processor_info(id, false, GET_APIC_VERSION(apic_read(APIC_LVR)));
}

static int __init sfi_parse_cpus(struct sfi_table_header *table)

2012-10-16 04:51:55

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

> -----Original Message-----
> From: HATAYAMA Daisuke [mailto:[email protected]]
> Sent: Monday, October 15, 2012 9:35 PM
> To: [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; [email protected]; Brown, Len; Yu,
> Fenghua; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]
> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
>
> Multiple CPUs are useful for CPU-bound processing like compression and
> I do want to use compression to generate crash dump quickly. But now
> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
> crash happens on AP. If crash happens on AP, kexec enters the 2nd
> kernel with the AP, and there BSP in the 1st kernel is expected to be
> haling in the 1st kernel or possibly in any fatal system error state.
>
> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
> BSP to jump into BIOS init code. A typical visible behaviour is hang
> or immediate reset, depending on the BIOS init code.
>
> AP can be initiated by INIT even in a fatal state: MP spec explains
> that processor-specific INIT can be used to recover AP from a fatal
> system error. On the other hand, there's no method for BSP to recover;
> it might be possible to do so by NMI plus any hand-coded reset code
> that is carefully designed, but at least I have no idea in this
> direction now.

In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is
not in tip tree yet.

BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336

>
> Therefore, the idea I do in this patch set is simply to disable BSP if
> vboot cpu is AP.
>

The BSP hotplug patchset will be useful for you goal. With the BSP hotplug
patcheset, you can wake up BSP and don't need to disable it.

> My motivation is to use multiple CPUs in order to quickly generate
> crash dump on the machine with huge amount of memory. I assume such
> machine tends to also have a lot of CPUs. So disabling one CPU would
> be no problem.

Luckily you don't need to disable any CPU to archive your goal with
the BSP hotplug pachest:)

On a dual core/single thread machine, this means you get 100% performance
boost with BSP's help.

Plus crash dump kernel code is better structured by not treating BSP
specially.

Thanks.

-Fenghua


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-10-16 05:04:45

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

From: "Yu, Fenghua" <[email protected]>
Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Tue, 16 Oct 2012 04:51:36 +0000

>> -----Original Message-----
>> From: HATAYAMA Daisuke [mailto:[email protected]]
>> Sent: Monday, October 15, 2012 9:35 PM
>> To: [email protected]; [email protected];
>> [email protected]
>> Cc: [email protected]; [email protected]; [email protected]; Brown, Len; Yu,
>> Fenghua; [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]
>> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
>>
>> Multiple CPUs are useful for CPU-bound processing like compression and
>> I do want to use compression to generate crash dump quickly. But now
>> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
>> crash happens on AP. If crash happens on AP, kexec enters the 2nd
>> kernel with the AP, and there BSP in the 1st kernel is expected to be
>> haling in the 1st kernel or possibly in any fatal system error state.
>>
>> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
>> BSP to jump into BIOS init code. A typical visible behaviour is hang
>> or immediate reset, depending on the BIOS init code.
>>
>> AP can be initiated by INIT even in a fatal state: MP spec explains
>> that processor-specific INIT can be used to recover AP from a fatal
>> system error. On the other hand, there's no method for BSP to recover;
>> it might be possible to do so by NMI plus any hand-coded reset code
>> that is carefully designed, but at least I have no idea in this
>> direction now.
>
> In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is
> not in tip tree yet.
>
> BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336
>
>>
>> Therefore, the idea I do in this patch set is simply to disable BSP if
>> vboot cpu is AP.
>>
>
> The BSP hotplug patchset will be useful for you goal. With the BSP hotplug
> patcheset, you can wake up BSP and don't need to disable it.
>
>> My motivation is to use multiple CPUs in order to quickly generate
>> crash dump on the machine with huge amount of memory. I assume such
>> machine tends to also have a lot of CPUs. So disabling one CPU would
>> be no problem.
>
> Luckily you don't need to disable any CPU to archive your goal with
> the BSP hotplug pachest:)
>
> On a dual core/single thread machine, this means you get 100% performance
> boost with BSP's help.
>
> Plus crash dump kernel code is better structured by not treating BSP
> specially.
>

Hello Fenghua.

I've of course noticed your patch set and locally tested, but I saw
NMI to BSP failed in the 2nd kernel. I'll send a log to you later.

BTW, I tested with your previous v8 patch set. Did you change
something during v8 to v9 relevant to this issue?

Thanks.
HATAYAMA, Daisuke

2012-10-16 05:14:49

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

> >> My motivation is to use multiple CPUs in order to quickly generate
> >> crash dump on the machine with huge amount of memory. I assume such
> >> machine tends to also have a lot of CPUs. So disabling one CPU would
> >> be no problem.
> >
> > Luckily you don't need to disable any CPU to archive your goal with
> > the BSP hotplug pachest:)
> >
> > On a dual core/single thread machine, this means you get 100%
> performance
> > boost with BSP's help.
> >
> > Plus crash dump kernel code is better structured by not treating BSP
> > specially.
> >
>
> Hello Fenghua.
>
> I've of course noticed your patch set and locally tested, but I saw
> NMI to BSP failed in the 2nd kernel. I'll send a log to you later.
>
> BTW, I tested with your previous v8 patch set. Did you change
> something during v8 to v9 relevant to this issue?

In the patch 0/12 in v9, I describe what change is in v9 on the top of v8:

v9: Add Intel vendor check to support the feature on Intel platforms only.

Did you see the BSP wake up failure on the latest tip tree?

There is a rcu regression issue which prevents BSP from waking up in 3.6.0.
The issue has been fixed on 10/12. The commit is a4fbe35a. Please make sure
your tip tree has this commit.

Thanks.

-Fenghua


Thanks.

-Fenghua

2012-10-16 05:16:10

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

From: HATAYAMA Daisuke <[email protected]>
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Tue, 16 Oct 2012 14:03:13 +0900

> From: "Yu, Fenghua" <[email protected]>
> Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
> Date: Tue, 16 Oct 2012 04:51:36 +0000
>
>>> -----Original Message-----
>>> From: HATAYAMA Daisuke [mailto:[email protected]]
>>> Sent: Monday, October 15, 2012 9:35 PM
>>> To: [email protected]; [email protected];
>>> [email protected]
>>> Cc: [email protected]; [email protected]; [email protected]; Brown, Len; Yu,
>>> Fenghua; [email protected]; [email protected];
>>> [email protected]; [email protected];
>>> [email protected]
>>> Subject: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
>>>
>>> Multiple CPUs are useful for CPU-bound processing like compression and
>>> I do want to use compression to generate crash dump quickly. But now
>>> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
>>> crash happens on AP. If crash happens on AP, kexec enters the 2nd
>>> kernel with the AP, and there BSP in the 1st kernel is expected to be
>>> haling in the 1st kernel or possibly in any fatal system error state.
>>>
>>> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
>>> BSP to jump into BIOS init code. A typical visible behaviour is hang
>>> or immediate reset, depending on the BIOS init code.
>>>
>>> AP can be initiated by INIT even in a fatal state: MP spec explains
>>> that processor-specific INIT can be used to recover AP from a fatal
>>> system error. On the other hand, there's no method for BSP to recover;
>>> it might be possible to do so by NMI plus any hand-coded reset code
>>> that is carefully designed, but at least I have no idea in this
>>> direction now.
>>
>> In my BSP hotplug patchset, BPS is waken up by NMI. The patchset is
>> not in tip tree yet.
>>
>> BSP hotplug patchset can be found at https://lkml.org/lkml/2012/10/12/336
>>
>>>
>>> Therefore, the idea I do in this patch set is simply to disable BSP if
>>> vboot cpu is AP.
>>>
>>
>> The BSP hotplug patchset will be useful for you goal. With the BSP hotplug
>> patcheset, you can wake up BSP and don't need to disable it.
>>
>>> My motivation is to use multiple CPUs in order to quickly generate
>>> crash dump on the machine with huge amount of memory. I assume such
>>> machine tends to also have a lot of CPUs. So disabling one CPU would
>>> be no problem.
>>
>> Luckily you don't need to disable any CPU to archive your goal with
>> the BSP hotplug pachest:)
>>
>> On a dual core/single thread machine, this means you get 100% performance
>> boost with BSP's help.
>>
>> Plus crash dump kernel code is better structured by not treating BSP
>> specially.
>>
>
> Hello Fenghua.
>
> I've of course noticed your patch set and locally tested, but I saw
> NMI to BSP failed in the 2nd kernel. I'll send a log to you later.
>
> BTW, I tested with your previous v8 patch set. Did you change
> something during v8 to v9 relevant to this issue?
>

I've fogetten saying one comment that your patch distinguish BSP by
CPU#0. CPU#0 is assigned to the boot cpu, which can be AP in the kdump
2nd kernel. Distinguishing BSP by CPU#0 is not enough here.

I have my local patch set based on your v8 patch doing this, but NMI
to BSP failed. I guess this comes from the difference of BSP states:
halting in play dead in your NMI method and halting in the 1st kernel
on crash or possibly in a fatal system error on actual situation.

Thanks.
HATAYAMA, Daisuke

2012-10-16 06:38:39

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

From: "Yu, Fenghua" <[email protected]>
Subject: RE: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Tue, 16 Oct 2012 05:14:46 +0000

>> >> My motivation is to use multiple CPUs in order to quickly generate
>> >> crash dump on the machine with huge amount of memory. I assume such
>> >> machine tends to also have a lot of CPUs. So disabling one CPU would
>> >> be no problem.
>> >
>> > Luckily you don't need to disable any CPU to archive your goal with
>> > the BSP hotplug pachest:)
>> >
>> > On a dual core/single thread machine, this means you get 100%
>> performance
>> > boost with BSP's help.
>> >
>> > Plus crash dump kernel code is better structured by not treating BSP
>> > specially.
>> >
>>
>> Hello Fenghua.
>>
>> I've of course noticed your patch set and locally tested, but I saw
>> NMI to BSP failed in the 2nd kernel. I'll send a log to you later.
>>
>> BTW, I tested with your previous v8 patch set. Did you change
>> something during v8 to v9 relevant to this issue?
>
> In the patch 0/12 in v9, I describe what change is in v9 on the top of v8:
>
> v9: Add Intel vendor check to support the feature on Intel platforms only.
>
> Did you see the BSP wake up failure on the latest tip tree?
>
> There is a rcu regression issue which prevents BSP from waking up in 3.6.0.
> The issue has been fixed on 10/12. The commit is a4fbe35a. Please make sure
> your tip tree has this commit.
>

Thanks for pointing out this. And I've recalled my investigation in
the past now. So I want to stop retrying your patch v9 now. This NMI
method is definitely not applicable to 2nd kernel without any change.

Your NMI method assumes BSP thread is halting in play dead loop. But
on the 2nd kernel, BSP is halting in the 1st kernel (or possibly in a
fatail system error). Even if throwing NMI to BSP, it goes back to the
1st kernel soon again. I at least confirmed NMI handler is executed in
this case.

Also, throwing NMI changes stack in the 1st kernel, which is
unpermissible from kdump's perspective. But x86_64 uses Interrupt
Stack Table (IST), in which stack switching is not performed. So 2nd
kernel's stack is used at least on x86_64.

To sum up, to apply NMI method in the 2nd kernel, I think it necessary
to modify contexts pushed on the stack so execution goes to the 2nd
kernel's start_secondary() while initializing its state
appropreately.

Also I think it necessary to discuss whether this NMI method is
reliable enough for kdump use.

Thanks.
HATAYAMA, Daisuke

2012-10-17 14:12:58

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

On Tue, Oct 16, 2012 at 01:35:17PM +0900, HATAYAMA Daisuke wrote:
> Multiple CPUs are useful for CPU-bound processing like compression and
> I do want to use compression to generate crash dump quickly. But now
> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
> crash happens on AP. If crash happens on AP, kexec enters the 2nd
> kernel with the AP, and there BSP in the 1st kernel is expected to be
> haling in the 1st kernel or possibly in any fatal system error state.

Hatayama san,

Do you have any rough numbers on what kind of speed up we are looking
at. IOW, what % of time is gone compressing a filetered dump. On large
memory machines, saving huge dump files is anyway not an option due to
time it takes. So we need to filter it to bare minimum and after that
vmcore size should be reasonable and compression time might not be a
big factor. Hence I am curious what kind of gains we are looking at.

>
> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
> BSP to jump into BIOS init code. A typical visible behaviour is hang
> or immediate reset, depending on the BIOS init code.
>
> AP can be initiated by INIT even in a fatal state: MP spec explains
> that processor-specific INIT can be used to recover AP from a fatal
> system error. On the other hand, there's no method for BSP to recover;
> it might be possible to do so by NMI plus any hand-coded reset code
> that is carefully designed, but at least I have no idea in this
> direction now.
>
> Therefore, the idea I do in this patch set is simply to disable BSP if
> vboot cpu is AP.

So in regular boot BSP still works as we boot on BSP. So this will take
effect only in kdump kernel?

How well does it work with nr_cpus kernel parameter. Currently we boot
with nr_cpus=1 to save upon amount of memory to be reserved. I guess
you might want to boot with nr_cpus=2 or nr_cpus=4 in your case to
speed up compression?


[..]
> Note: recent upstream kernel fails reserving memory for kdump 2nd
> kernel. To run kdump, please apply the patch below on top of this
> patch set:
> https://lkml.org/lkml/2012/8/31/238

Above is a big issue. 3.6 kernel is broken and I can't take dump on F18
either. (works only on one machine). I have not looked enough into it
the issue to figure out what's the issue at hand, but we really need
atleast a stop gap fix (assuming others are working on longer term
fix).

Thanks
Vivek

2012-10-18 03:09:32

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

From: Vivek Goyal <[email protected]>
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Wed, 17 Oct 2012 10:12:34 -0400

> On Tue, Oct 16, 2012 at 01:35:17PM +0900, HATAYAMA Daisuke wrote:
>> Multiple CPUs are useful for CPU-bound processing like compression and
>> I do want to use compression to generate crash dump quickly. But now
>> we cannot wakeup the 2nd and later cpus in the kdump 2nd kernel if
>> crash happens on AP. If crash happens on AP, kexec enters the 2nd
>> kernel with the AP, and there BSP in the 1st kernel is expected to be
>> haling in the 1st kernel or possibly in any fatal system error state.
>
> Hatayama san,
>

Hello Vivek,

> Do you have any rough numbers on what kind of speed up we are looking
> at. IOW, what % of time is gone compressing a filetered dump. On large
> memory machines, saving huge dump files is anyway not an option due to
> time it takes. So we need to filter it to bare minimum and after that
> vmcore size should be reasonable and compression time might not be a
> big factor. Hence I am curious what kind of gains we are looking at.
>

I did two kinds of benchmark 1) to evaluate how well compression and
writing dump into multiple disks performs on crash dump and 2) to
compare three kinds of compression algorhythm --- zlib, lzo and snappy
--- for use of crash dump.

>From 1), 4 disks with 4 cpus performs 300 MB/s on compression with
snappy. 1 hour for 1 TB. But on this benchmark, sample data is
intentionally randomized enough so data size is not reduced during
compression, it must be quicker on most of actual dumps. See also
bench_comp_multi_IO.tar.xz for image of graph.

In the future, I'm going to do this benchmark again using quicker SSD
disks if I get them.

>From 2), zlib, used when doing makedumpfile -c, turns out to be too
slow to use it for crash dump. lzo and snappy is quick and relatively
as good compression ratio as zlib. In particular, snappy speed is
stable on any ratio of randomized part. See also
bench_compare_zlib_lzo_snappy.tar.xz for image of graph.

BTW, makedumpfile has already supported lzo since v1.4.4 and is going
to support snappy on v1.5.1.

OTOH, we have some requirements where we cannot use filtering.
Examples are:

- high-availability cluster system where application triggers crash
dump to switch the active node to inactive node quickly. We retrieve
the application image as process core dump later and analize it. We
cannot filter user-space memory.

- On qemu/kvm environment, we sometimes face a complicated bug caused
by interaction between guest and host.

For example, previously, there was a bug causing guest machine to
hang, where IO scheduler handled guest's request as wrongly lower
request than the actual one and guest was waiting for IO completion
endlessly, repeating VMenter-VMexit forever.

To address such kind of bug, we first reproduce the bug, get host's
crash dump to capture the situation, and then analyze the bug by
investigating the situation from both host's and guest's views. On
the bug above, we saw guest machine was waiting for IO, and we could
resolve the issue relatively quickly. For this kind of complicated
bug relevant to qemu/kvm, both host and guest views are very
helpful.

guest image is in user-space memory, qemu process, and again we
cannot filter user-space memory.

- Filesystem people say page cache is often necessary for analysis of
crash dump.

Of course, we use filtering positively on the system where no such
requreirement is given.

>>
>> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
>> BSP to jump into BIOS init code. A typical visible behaviour is hang
>> or immediate reset, depending on the BIOS init code.
>>
>> AP can be initiated by INIT even in a fatal state: MP spec explains
>> that processor-specific INIT can be used to recover AP from a fatal
>> system error. On the other hand, there's no method for BSP to recover;
>> it might be possible to do so by NMI plus any hand-coded reset code
>> that is carefully designed, but at least I have no idea in this
>> direction now.
>>
>> Therefore, the idea I do in this patch set is simply to disable BSP if
>> vboot cpu is AP.
>
> So in regular boot BSP still works as we boot on BSP. So this will take
> effect only in kdump kernel?
>

Yes, this patch is effective only for the case where boot cpu is not
BSP, AP, and this happens in kexec case only.

> How well does it work with nr_cpus kernel parameter. Currently we boot
> with nr_cpus=1 to save upon amount of memory to be reserved. I guess
> you might want to boot with nr_cpus=2 or nr_cpus=4 in your case to
> speed up compression?

Exactly, it seems reasonable to specify at most nr_cpus=4 on usual
machines becaue reserved memory is severely limited, and many disks
are difficult to connect only for crash dump use without special
requrement.

But there might be the system where crash dump is definitely done
quickly and for it, more reserved memory and more disks are no
problem. On such system, I think it's necessary to be able to set up
more reserved memory and more cpus.

>
>
> [..]
>> Note: recent upstream kernel fails reserving memory for kdump 2nd
>> kernel. To run kdump, please apply the patch below on top of this
>> patch set:
>> https://lkml.org/lkml/2012/8/31/238
>
> Above is a big issue. 3.6 kernel is broken and I can't take dump on F18
> either. (works only on one machine). I have not looked enough into it
> the issue to figure out what's the issue at hand, but we really need
> atleast a stop gap fix (assuming others are working on longer term
> fix).
>

To be honest, I didn't follow this in some monthes, but I expect the
topic below to fix this issue. What's the status of bzImage
limitation?

http://lists.infradead.org/pipermail/kexec/2012-September/006855.html

Thanks.
HATAYAMA, Daisuke


Attachments:
bench_comp_multi_IO.tar.xz (60.66 kB)
bench_compare_zlib_lzo_snappy.tar.xz (143.67 kB)
Download all attachments

2012-10-18 14:15:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

On Thu, Oct 18, 2012 at 12:08:05PM +0900, HATAYAMA Daisuke wrote:

[..]
> > Do you have any rough numbers on what kind of speed up we are looking
> > at. IOW, what % of time is gone compressing a filetered dump. On large
> > memory machines, saving huge dump files is anyway not an option due to
> > time it takes. So we need to filter it to bare minimum and after that
> > vmcore size should be reasonable and compression time might not be a
> > big factor. Hence I am curious what kind of gains we are looking at.
> >
>
> I did two kinds of benchmark 1) to evaluate how well compression and
> writing dump into multiple disks performs on crash dump and 2) to
> compare three kinds of compression algorhythm --- zlib, lzo and snappy
> --- for use of crash dump.
>
> >From 1), 4 disks with 4 cpus performs 300 MB/s on compression with
> snappy. 1 hour for 1 TB. But on this benchmark, sample data is
> intentionally randomized enough so data size is not reduced during
> compression, it must be quicker on most of actual dumps. See also
> bench_comp_multi_IO.tar.xz for image of graph.

Ok, I looked at the graphs. So somehow you seem to be dumping to multiple
disks. How do you do that? Are these disks in some stripe configuration
or they are JBOD and you have written special programs to dump a
particular section of memory to a specific disk to achieve parallelism?

Looking at your graphs, 1 cpu can keep up with 4 disks and achieve
300MB/s and after that it looks like cpu saturates. Adding more disks
with 1 cpu does not help. But increasing number of cpus can keep up
with increasing number of disks and you achieve 800MB/s. Sounds good.

>
> In the future, I'm going to do this benchmark again using quicker SSD
> disks if I get them.
>
> >From 2), zlib, used when doing makedumpfile -c, turns out to be too
> slow to use it for crash dump. lzo and snappy is quick and relatively
> as good compression ratio as zlib. In particular, snappy speed is
> stable on any ratio of randomized part. See also
> bench_compare_zlib_lzo_snappy.tar.xz for image of graph.
>
> BTW, makedumpfile has already supported lzo since v1.4.4 and is going
> to support snappy on v1.5.1.
>
> OTOH, we have some requirements where we cannot use filtering.
> Examples are:
>
> - high-availability cluster system where application triggers crash
> dump to switch the active node to inactive node quickly. We retrieve
> the application image as process core dump later and analize it. We
> cannot filter user-space memory.

Do you have to really crash the node to take it offline? There should
be other ways to do this? If you are analyzing application performance
issues, why should you crash kernel and capture the whole crash dump.
There should be other ways to debug applications?

>
> - On qemu/kvm environment, we sometimes face a complicated bug caused
> by interaction between guest and host.
>
> For example, previously, there was a bug causing guest machine to
> hang, where IO scheduler handled guest's request as wrongly lower
> request than the actual one and guest was waiting for IO completion
> endlessly, repeating VMenter-VMexit forever.
>
> To address such kind of bug, we first reproduce the bug, get host's
> crash dump to capture the situation, and then analyze the bug by
> investigating the situation from both host's and guest's views. On
> the bug above, we saw guest machine was waiting for IO, and we could
> resolve the issue relatively quickly. For this kind of complicated
> bug relevant to qemu/kvm, both host and guest views are very
> helpful.
>
> guest image is in user-space memory, qemu process, and again we
> cannot filter user-space memory.

Instead of capturing the dump of whole memory, isn't it more efficient
to capture the crash dump of VM in question and then if need be just
take filtered crash dump of host kernel.

I think that trying to take unfiltered crash dumps of tera bytes of memory
is not practical or woth it for most of the use cases.

>
> - Filesystem people say page cache is often necessary for analysis of
> crash dump.
>

Do you have some examples of how does it help?

> Of course, we use filtering positively on the system where no such
> requreirement is given.
>
> >>
> >> To wake up AP, we use the method called INIT-INIT-SIPI. INIT causes
> >> BSP to jump into BIOS init code. A typical visible behaviour is hang
> >> or immediate reset, depending on the BIOS init code.
> >>
> >> AP can be initiated by INIT even in a fatal state: MP spec explains
> >> that processor-specific INIT can be used to recover AP from a fatal
> >> system error. On the other hand, there's no method for BSP to recover;
> >> it might be possible to do so by NMI plus any hand-coded reset code
> >> that is carefully designed, but at least I have no idea in this
> >> direction now.
> >>
> >> Therefore, the idea I do in this patch set is simply to disable BSP if
> >> vboot cpu is AP.
> >
> > So in regular boot BSP still works as we boot on BSP. So this will take
> > effect only in kdump kernel?
> >
>
> Yes, this patch is effective only for the case where boot cpu is not
> BSP, AP, and this happens in kexec case only.
>
> > How well does it work with nr_cpus kernel parameter. Currently we boot
> > with nr_cpus=1 to save upon amount of memory to be reserved. I guess
> > you might want to boot with nr_cpus=2 or nr_cpus=4 in your case to
> > speed up compression?
>
> Exactly, it seems reasonable to specify at most nr_cpus=4 on usual
> machines becaue reserved memory is severely limited, and many disks
> are difficult to connect only for crash dump use without special
> requrement.
>
> But there might be the system where crash dump is definitely done
> quickly and for it, more reserved memory and more disks are no
> problem. On such system, I think it's necessary to be able to set up
> more reserved memory and more cpus.

We have this limitation of on x86 that we can't reserve more memory. I
think for x86_64, we could not load kernel above 896MB, due to
various limitations. So you will have to cross those barriers too if
you want to reserve more memory to capture full dumps.

So I am fine with trying to bring up more cpus in second kernel in an
effort to improve scalability but I remain skeptical about the
practicality of dumping TBs of unfiltered data after crash. Filtering
capability was the primary reason that s390 also wants to support
kdump otherwise there firmware dumping mechanism was working just
fine.

Thanks
Vivek

2012-10-19 03:21:46

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

From: Vivek Goyal <[email protected]>
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Thu, 18 Oct 2012 10:14:49 -0400

> On Thu, Oct 18, 2012 at 12:08:05PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>> > Do you have any rough numbers on what kind of speed up we are looking
>> > at. IOW, what % of time is gone compressing a filetered dump. On large
>> > memory machines, saving huge dump files is anyway not an option due to
>> > time it takes. So we need to filter it to bare minimum and after that
>> > vmcore size should be reasonable and compression time might not be a
>> > big factor. Hence I am curious what kind of gains we are looking at.
>> >
>>
>> I did two kinds of benchmark 1) to evaluate how well compression and
>> writing dump into multiple disks performs on crash dump and 2) to
>> compare three kinds of compression algorhythm --- zlib, lzo and snappy
>> --- for use of crash dump.
>>
>> >From 1), 4 disks with 4 cpus performs 300 MB/s on compression with
>> snappy. 1 hour for 1 TB. But on this benchmark, sample data is
>> intentionally randomized enough so data size is not reduced during
>> compression, it must be quicker on most of actual dumps. See also
>> bench_comp_multi_IO.tar.xz for image of graph.
>
> Ok, I looked at the graphs. So somehow you seem to be dumping to multiple
> disks. How do you do that? Are these disks in some stripe configuration
> or they are JBOD and you have written special programs to dump a
> particular section of memory to a specific disk to achieve parallelism?
>

This was neither stripe like RAID0 nor JBOD, but the purpose was
stripe and what I did is simpler than such disk management features,
splitting crash dump into a given number of sections and writing them
into the same number of disks seprately. makedumpfiles already
supports this; see makedumpfile --split.

However, I didn't use makedumpfile --split. I wrote a benchmark script
that does this in Python and I used it. The reason is that
makedumpfile can input only crash dump format, and if I used
makedumpfile, I had to modify sample data in some crash dump formats,
which was not flexble.

> Looking at your graphs, 1 cpu can keep up with 4 disks and achieve
> 300MB/s and after that it looks like cpu saturates. Adding more disks
> with 1 cpu does not help. But increasing number of cpus can keep up
> with increasing number of disks and you achieve 800MB/s. Sounds good.
>

BTW, recent SDD shows over 1000MB/s maximally. lzo and snappy shows
200MB/s on worst case of my benchmark for compression
performance. This is too slow to use. I guess compression block size
needs to be increased. Then, compression needs more cpu power, and
difference on 4 disks case between cpu=1 to cpu=4 gets more clear.

>>
>> In the future, I'm going to do this benchmark again using quicker SSD
>> disks if I get them.
>>
>> >From 2), zlib, used when doing makedumpfile -c, turns out to be too
>> slow to use it for crash dump. lzo and snappy is quick and relatively
>> as good compression ratio as zlib. In particular, snappy speed is
>> stable on any ratio of randomized part. See also
>> bench_compare_zlib_lzo_snappy.tar.xz for image of graph.
>>
>> BTW, makedumpfile has already supported lzo since v1.4.4 and is going
>> to support snappy on v1.5.1.
>>
>> OTOH, we have some requirements where we cannot use filtering.
>> Examples are:
>>
>> - high-availability cluster system where application triggers crash
>> dump to switch the active node to inactive node quickly. We retrieve
>> the application image as process core dump later and analize it. We
>> cannot filter user-space memory.
>
> Do you have to really crash the node to take it offline? There should
> be other ways to do this? If you are analyzing application performance
> issues, why should you crash kernel and capture the whole crash dump.
> There should be other ways to debug applications?
>

Certainly, this might be weak to the current situation where memory is
huge. I only wanted to say here that we sometimes need user-space
memory too.

>>
>> - On qemu/kvm environment, we sometimes face a complicated bug caused
>> by interaction between guest and host.
>>
>> For example, previously, there was a bug causing guest machine to
>> hang, where IO scheduler handled guest's request as wrongly lower
>> request than the actual one and guest was waiting for IO completion
>> endlessly, repeating VMenter-VMexit forever.
>>
>> To address such kind of bug, we first reproduce the bug, get host's
>> crash dump to capture the situation, and then analyze the bug by
>> investigating the situation from both host's and guest's views. On
>> the bug above, we saw guest machine was waiting for IO, and we could
>> resolve the issue relatively quickly. For this kind of complicated
>> bug relevant to qemu/kvm, both host and guest views are very
>> helpful.
>>
>> guest image is in user-space memory, qemu process, and again we
>> cannot filter user-space memory.
>
> Instead of capturing the dump of whole memory, isn't it more efficient
> to capture the crash dump of VM in question and then if need be just
> take filtered crash dump of host kernel.
>
> I think that trying to take unfiltered crash dumps of tera bytes of memory
> is not practical or woth it for most of the use cases.
>

If there's a lag between VM dump and host dump, situation on the host
can change, and VM dump itself changes the situation. Then, we cannot
know what kind of bug resides in now, so we want to do as few things
as possible between detecting the bug reproduced and taking host
dump. So I expressed ``capturing the situation''.

>>
>> - Filesystem people say page cache is often necessary for analysis of
>> crash dump.
>>
>
> Do you have some examples of how does it help?
>

Sorry, I heard this only and don't know in more detail now on the
usecase.

<cut>
>> > How well does it work with nr_cpus kernel parameter. Currently we boot
>> > with nr_cpus=1 to save upon amount of memory to be reserved. I guess
>> > you might want to boot with nr_cpus=2 or nr_cpus=4 in your case to
>> > speed up compression?
>>
>> Exactly, it seems reasonable to specify at most nr_cpus=4 on usual
>> machines becaue reserved memory is severely limited, and many disks
>> are difficult to connect only for crash dump use without special
>> requrement.
>>
>> But there might be the system where crash dump is definitely done
>> quickly and for it, more reserved memory and more disks are no
>> problem. On such system, I think it's necessary to be able to set up
>> more reserved memory and more cpus.
>
> We have this limitation of on x86 that we can't reserve more memory. I
> think for x86_64, we could not load kernel above 896MB, due to
> various limitations. So you will have to cross those barriers too if
> you want to reserve more memory to capture full dumps.
>
> So I am fine with trying to bring up more cpus in second kernel in an
> effort to improve scalability but I remain skeptical about the
> practicality of dumping TBs of unfiltered data after crash. Filtering

I also think I don't want to take unnecessary parts because it makes
dump slow only. Taking a whole memory would be non-sense. So, I think
another featuer might also be needed here. For example, it might be
useful if we can specify filtering user-space per process.

But even so, I suspect crash dump with user-space necessary for us
only can sometimes reach at most 500GB ~ 1TB. Then, it takes 2 ~ 4
hours to take dump on one cpu and one disk.

> capability was the primary reason that s390 also wants to support
> kdump otherwise there firmware dumping mechanism was working just
> fine.
>

I don't know s390 firmware dumping mechanism at all, but is it possble
for s390 to filter crash dump even on firmware dumping mechanism?

Thanks.
HATAYAMA, Daisuke

2012-10-19 15:18:26

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

On Fri, Oct 19, 2012 at 12:20:54PM +0900, HATAYAMA Daisuke wrote:

[..]
> > Instead of capturing the dump of whole memory, isn't it more efficient
> > to capture the crash dump of VM in question and then if need be just
> > take filtered crash dump of host kernel.
> >
> > I think that trying to take unfiltered crash dumps of tera bytes of memory
> > is not practical or woth it for most of the use cases.
> >
>
> If there's a lag between VM dump and host dump, situation on the host
> can change, and VM dump itself changes the situation. Then, we cannot
> know what kind of bug resides in now, so we want to do as few things
> as possible between detecting the bug reproduced and taking host
> dump. So I expressed ``capturing the situation''.

I would rather first detect the problem on guest and figure out what's
happening. Once it has been determined that something is wrong from
host side then debug what's wrong with host by using regular kernel
debugging techiniques.

Even if you are interested in capturing crash dump, after you have
decided that it is a host problem, then you can write some scripts which
trigger host crash dump when relevant event happens.

Seriously, this argument could be extended to regular processes also.
Something is wrong with my application, so lets dump the whole system,
provide a facility to extract each process's code dump from that huge
dump and then examine if it was an application issue or kernel issue.

I am skeptical that this approach is going to fly in practice. Dumping
huge images, processing and transferring these is not very practical.
So I would rather narrow down the problem on a running system and take
filtered dump of appropriate component where I suspect the problem is.

[..]
> > capability was the primary reason that s390 also wants to support
> > kdump otherwise there firmware dumping mechanism was working just
> > fine.
> >
>
> I don't know s390 firmware dumping mechanism at all, but is it possble
> for s390 to filter crash dump even on firmware dumping mechanism?

AFAIK, s390 dump mechanism could not filter dump and tha's the reason
they wanted to support kdump and /proc/vmcore so that makedumpfile
could filter it. I am CCing Michael Holzheu, who did the s390 kdump work.
He can tell it better.

Thanks
Vivek

2012-10-22 06:33:04

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

From: Vivek Goyal <[email protected]>
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
Date: Fri, 19 Oct 2012 11:17:53 -0400

> On Fri, Oct 19, 2012 at 12:20:54PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>> > Instead of capturing the dump of whole memory, isn't it more efficient
>> > to capture the crash dump of VM in question and then if need be just
>> > take filtered crash dump of host kernel.
>> >
>> > I think that trying to take unfiltered crash dumps of tera bytes of memory
>> > is not practical or woth it for most of the use cases.
>> >
>>
>> If there's a lag between VM dump and host dump, situation on the host
>> can change, and VM dump itself changes the situation. Then, we cannot
>> know what kind of bug resides in now, so we want to do as few things
>> as possible between detecting the bug reproduced and taking host
>> dump. So I expressed ``capturing the situation''.
>
> I would rather first detect the problem on guest and figure out what's
> happening. Once it has been determined that something is wrong from
> host side then debug what's wrong with host by using regular kernel
> debugging techiniques.
>
> Even if you are interested in capturing crash dump, after you have
> decided that it is a host problem, then you can write some scripts which
> trigger host crash dump when relevant event happens.
>
> Seriously, this argument could be extended to regular processes also.
> Something is wrong with my application, so lets dump the whole system,
> provide a facility to extract each process's code dump from that huge
> dump and then examine if it was an application issue or kernel issue.
>
> I am skeptical that this approach is going to fly in practice. Dumping
> huge images, processing and transferring these is not very practical.
> So I would rather narrow down the problem on a running system and take
> filtered dump of appropriate component where I suspect the problem is.
>
> [..]

Such bug is often complicated and it often takes some time to
reproduce it. Once it is reproduced, we cannot use system where the
bug has been reproduced, long time since it's mainly managed by
another teams under some project and there are many tasks that must be
done on the system. It would be best to proceed the debug little by
little as you suggest, but it's difficult to do that from practiacl
reason.

Crash dump of tera bytes cannot support our method perfectly, I can
understand this. The reason why we have done these ways with no
problem is that memory size was not so large that it never made crash
dump long to complete. But doing filtering means that the data
necessary for debug might be excluded. This is the worst case we must
avoid. Consdering trade-off between recent large memory size and
filtering, I think filtering with more granurality is needed. I want
to consider this in the future work.

BTW, I feel you basically shows positive attitude to this patch set
itself. I believe that kdump has merits only. Could you tell me what
is needed for this patch set to be acked by you?

>> > capability was the primary reason that s390 also wants to support
>> > kdump otherwise there firmware dumping mechanism was working just
>> > fine.
>> >
>>
>> I don't know s390 firmware dumping mechanism at all, but is it possble
>> for s390 to filter crash dump even on firmware dumping mechanism?
>
> AFAIK, s390 dump mechanism could not filter dump and tha's the reason
> they wanted to support kdump and /proc/vmcore so that makedumpfile
> could filter it. I am CCing Michael Holzheu, who did the s390 kdump work.
> He can tell it better.
>

Hmm, we Fujitsu also have firmware dump mechanism and it cannot filter
memory too. I think it would be a similar situation.

Thanks.
HATAYAMA, Daisuke

2012-10-22 16:03:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

On 10/15/2012 11:38 PM, HATAYAMA Daisuke wrote:
>
> Thanks for pointing out this. And I've recalled my investigation in
> the past now. So I want to stop retrying your patch v9 now. This NMI
> method is definitely not applicable to 2nd kernel without any change.
>
> Your NMI method assumes BSP thread is halting in play dead loop. But
> on the 2nd kernel, BSP is halting in the 1st kernel (or possibly in a
> fatail system error). Even if throwing NMI to BSP, it goes back to the
> 1st kernel soon again. I at least confirmed NMI handler is executed in
> this case.
>
> Also, throwing NMI changes stack in the 1st kernel, which is
> unpermissible from kdump's perspective. But x86_64 uses Interrupt
> Stack Table (IST), in which stack switching is not performed. So 2nd
> kernel's stack is used at least on x86_64.
>
> To sum up, to apply NMI method in the 2nd kernel, I think it necessary
> to modify contexts pushed on the stack so execution goes to the 2nd
> kernel's start_secondary() while initializing its state
> appropreately.
>
> Also I think it necessary to discuss whether this NMI method is
> reliable enough for kdump use.
>

I think it's pretty clear it is *not*. NMI or monitor would either have
to rely on context set up by the first kernel, which simply isn't safe.
Out of those two options, a monitor would actually be safer, since it
can be self-contained in a completely different way.

However, it seems that running on N-1 CPUs in kdump is perfectly acceptable.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-10-22 17:10:24

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

On Fri, 19 Oct 2012 11:17:53 -0400
Vivek Goyal <[email protected]> wrote:

[..]

> On Fri, Oct 19, 2012 at 12:20:54PM +0900, HATAYAMA Daisuke wrote:
> I am skeptical that this approach is going to fly in practice. Dumping
> huge images, processing and transferring these is not very practical.
> So I would rather narrow down the problem on a running system and take
> filtered dump of appropriate component where I suspect the problem is.
>
> [..]
> > > capability was the primary reason that s390 also wants to support
> > > kdump otherwise there firmware dumping mechanism was working just
> > > fine.
> > >
> >
> > I don't know s390 firmware dumping mechanism at all, but is it
> > possble for s390 to filter crash dump even on firmware dumping
> > mechanism?
>
> AFAIK, s390 dump mechanism could not filter dump and tha's the reason
> they wanted to support kdump and /proc/vmcore so that makedumpfile
> could filter it. I am CCing Michael Holzheu, who did the s390 kdump
> work. He can tell it better.

Correct. The other s390 dump mechanisms (stand-alone and hypervisor
dump) are not able to do filtering and therefore the handling of large
dumps has been a problem for us.

This was one of the main reasons to implement kdump on s390.

Michael

2012-10-22 18:38:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP

On Mon, Oct 22, 2012 at 03:32:19PM +0900, HATAYAMA Daisuke wrote:

[..]
> BTW, I feel you basically shows positive attitude to this patch set
> itself. I believe that kdump has merits only. Could you tell me what
> is needed for this patch set to be acked by you?

I don't have any objections to bringing up more than 1 cpu in second
kernel. Somebody who needs the code better in this area needs to provide
by Reviewed-by/Acked-by though.

Thanks
Vivek

2012-10-22 20:04:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

HATAYAMA Daisuke <[email protected]> writes:

> We disable BSP if boot cpu is AP.
>
> INIT-INIT-SIPI sequence, a protocal to initiate AP, cannot be used for
> BSP since it causes BSP jump to BIOS init code; typical visible
> behaviour is hang or immediate reset, depending on the BIOS init code.
>
> INIT can be used to reset AP in a fatal system error state as
> described in MP spec 3.7.3 Processor-specific INIT. In contrast, there
> is no processor-specific INIT for BSP to initilize from a fatal system
> error. It might be possible to do so by NMI plus any hand-crafted
> reset code that is carefully designed, but at least I have no idea in
> this direction now.

Has anyone looked at clearing bit 8 of the IA32_APIC_BASE_MSR (0x1B) on
the bootstrap processor? Bit 8 being the bit that indicates we are a
bootstrap processor.

If we can clear that bit INIT will always place the processor in
wait-for-startup-ipi mode and we won't have this problem.

That would also solve the hotunplug the bootstrap processor without
using an NMI as well.



If clearing bit 8 doesn't work and we have to go with a variant of
magically detecting the bootstrap processor. I would make the logic.

1. Test bit 8 to see if we are on the bootstrap processor.
2. If we are not on the bootstrap processor and we don't have a table
that will tell us guess that the bootstrap processor has apic id 0.

It is so overwhelmingly common that the bootstrap processor has apic
id 0 that any other assumption is silly.

More important is testing to see if we are on the bootstrap processor
and if we are not disabling any cpus in that case. As that will
guarantee that introducing code to not start the bootstrap processor
won't cause a regression outside of the kexec on panic case.

Eric

2012-10-22 20:17:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

On 10/22/2012 01:04 PM, Eric W. Biederman wrote:
> HATAYAMA Daisuke <[email protected]> writes:
>
>> We disable BSP if boot cpu is AP.
>>
>> INIT-INIT-SIPI sequence, a protocal to initiate AP, cannot be used for
>> BSP since it causes BSP jump to BIOS init code; typical visible
>> behaviour is hang or immediate reset, depending on the BIOS init code.
>>
>> INIT can be used to reset AP in a fatal system error state as
>> described in MP spec 3.7.3 Processor-specific INIT. In contrast, there
>> is no processor-specific INIT for BSP to initilize from a fatal system
>> error. It might be possible to do so by NMI plus any hand-crafted
>> reset code that is carefully designed, but at least I have no idea in
>> this direction now.
>
> Has anyone looked at clearing bit 8 of the IA32_APIC_BASE_MSR (0x1B) on
> the bootstrap processor? Bit 8 being the bit that indicates we are a
> bootstrap processor.
>
> If we can clear that bit INIT will always place the processor in
> wait-for-startup-ipi mode and we won't have this problem.
>
> That would also solve the hotunplug the bootstrap processor without
> using an NMI as well.
>

IIRC Fenghua experimented with that and it didn't work. Not all BIOSes
use that bit to determine BSP-ness.

-hpa

2012-10-22 20:31:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

"H. Peter Anvin" <[email protected]> writes:

> On 10/22/2012 01:04 PM, Eric W. Biederman wrote:
>> HATAYAMA Daisuke <[email protected]> writes:
>>
>>> We disable BSP if boot cpu is AP.
>>>
>>> INIT-INIT-SIPI sequence, a protocal to initiate AP, cannot be used for
>>> BSP since it causes BSP jump to BIOS init code; typical visible
>>> behaviour is hang or immediate reset, depending on the BIOS init code.
>>>
>>> INIT can be used to reset AP in a fatal system error state as
>>> described in MP spec 3.7.3 Processor-specific INIT. In contrast, there
>>> is no processor-specific INIT for BSP to initilize from a fatal system
>>> error. It might be possible to do so by NMI plus any hand-crafted
>>> reset code that is carefully designed, but at least I have no idea in
>>> this direction now.
>>
>> Has anyone looked at clearing bit 8 of the IA32_APIC_BASE_MSR (0x1B) on
>> the bootstrap processor? Bit 8 being the bit that indicates we are a
>> bootstrap processor.
>>
>> If we can clear that bit INIT will always place the processor in
>> wait-for-startup-ipi mode and we won't have this problem.
>>
>> That would also solve the hotunplug the bootstrap processor without
>> using an NMI as well.
>>
>
> IIRC Fenghua experimented with that and it didn't work. Not all BIOSes
> use that bit to determine BSP-ness.

What does a BIOS have to do with anything?

The practical issue here is does an INIT IPI cause the cpu to go into
startup-ipi-wait or to start booting at 4G-16 bytes.

For dealing with BIOSen we may still need to use the bootstrap processor
for firmware calls, cpu suspend, and other firmware weirdness, but that
should all be completely orthogonal to the behavior to what happens
when an INIT IPI is sent to the cpu.

The only firmware problem I can imagine having is cpu virtualization
bug.

Eric

2012-10-22 20:33:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

On 10/22/2012 01:31 PM, Eric W. Biederman wrote:
>>
>> IIRC Fenghua experimented with that and it didn't work. Not all BIOSes
>> use that bit to determine BSP-ness.
>
> What does a BIOS have to do with anything?
>
> The practical issue here is does an INIT IPI cause the cpu to go into
> startup-ipi-wait or to start booting at 4G-16 bytes.
>
> For dealing with BIOSen we may still need to use the bootstrap processor
> for firmware calls, cpu suspend, and other firmware weirdness, but that
> should all be completely orthogonal to the behavior to what happens
> when an INIT IPI is sent to the cpu.
>
> The only firmware problem I can imagine having is cpu virtualization
> bug.
>

The whole problem is that some BIOSes go wonky after receiving an INIT
(as in INIT-SIPI-SIPI) to the BSP.

-hpa

2012-10-22 20:39:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

On 10/22/2012 01:33 PM, H. Peter Anvin wrote:
> On 10/22/2012 01:31 PM, Eric W. Biederman wrote:
>>>
>>> IIRC Fenghua experimented with that and it didn't work. Not all BIOSes
>>> use that bit to determine BSP-ness.
>>
>> What does a BIOS have to do with anything?
>>
>> The practical issue here is does an INIT IPI cause the cpu to go into
>> startup-ipi-wait or to start booting at 4G-16 bytes.
>>
>> For dealing with BIOSen we may still need to use the bootstrap processor
>> for firmware calls, cpu suspend, and other firmware weirdness, but that
>> should all be completely orthogonal to the behavior to what happens
>> when an INIT IPI is sent to the cpu.
>>
>> The only firmware problem I can imagine having is cpu virtualization
>> bug.
>>
>
> The whole problem is that some BIOSes go wonky after receiving an INIT
> (as in INIT-SIPI-SIPI) to the BSP.
>

(I presume the failure is SMM-related, but Fenghua might have more
details...)

-hpa

2012-10-22 20:43:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

"H. Peter Anvin" <[email protected]> writes:

> On 10/22/2012 01:31 PM, Eric W. Biederman wrote:
>>>
>>> IIRC Fenghua experimented with that and it didn't work. Not all BIOSes
>>> use that bit to determine BSP-ness.
>>
>> What does a BIOS have to do with anything?
>>
>> The practical issue here is does an INIT IPI cause the cpu to go into
>> startup-ipi-wait or to start booting at 4G-16 bytes.
>>
>> For dealing with BIOSen we may still need to use the bootstrap processor
>> for firmware calls, cpu suspend, and other firmware weirdness, but that
>> should all be completely orthogonal to the behavior to what happens
>> when an INIT IPI is sent to the cpu.
>>
>> The only firmware problem I can imagine having is cpu virtualization
>> bug.
>>
>
> The whole problem is that some BIOSes go wonky after receiving an INIT
> (as in INIT-SIPI-SIPI) to the BSP.

The reason the BIOSen go wonky is the INIT cause the cpu to go to the
reset vector at 4G-16 bytes. So it is very much expected that the
BIOSen start acting like you just came out of reset.

If you can clear bit 8 of IA32_APIC_BASE_MSR and inform the cpu to not
send the cpu to 4G-16 bytes and instead send the cpu into it's magic
startup-ipi-wait mode then the BIOSen will not be involved on that path.

It is a simple question of does the cpu support clearing bit 8
meaningfully.

If the cpu allows bit 8 to be cleared and sends the cpu to the reset
vector on receipt of the INIT IPI I would call that a deviation from the
x86 cpu specification.

So clearing bit 8 is not a question about BIOSen it is a question of can
we avoid the BIOSen, by using an obscure under-documented cpu feature.

Eric

2012-10-22 20:47:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

On 10/22/2012 01:43 PM, Eric W. Biederman wrote:
>
> The reason the BIOSen go wonky is the INIT cause the cpu to go to the
> reset vector at 4G-16 bytes. So it is very much expected that the
> BIOSen start acting like you just came out of reset.
>
> If you can clear bit 8 of IA32_APIC_BASE_MSR and inform the cpu to not
> send the cpu to 4G-16 bytes and instead send the cpu into it's magic
> startup-ipi-wait mode then the BIOSen will not be involved on that path.
>
> It is a simple question of does the cpu support clearing bit 8
> meaningfully.
>
> If the cpu allows bit 8 to be cleared and sends the cpu to the reset
> vector on receipt of the INIT IPI I would call that a deviation from the
> x86 cpu specification.
>
> So clearing bit 8 is not a question about BIOSen it is a question of can
> we avoid the BIOSen, by using an obscure under-documented cpu feature.
>

As I said, I thought Fenghua tried that but it didn't work, experimentally.

-hpa

2012-10-22 21:29:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

"H. Peter Anvin" <[email protected]> writes:

> On 10/22/2012 01:43 PM, Eric W. Biederman wrote:
>>
>> The reason the BIOSen go wonky is the INIT cause the cpu to go to the
>> reset vector at 4G-16 bytes. So it is very much expected that the
>> BIOSen start acting like you just came out of reset.
>>
>> If you can clear bit 8 of IA32_APIC_BASE_MSR and inform the cpu to not
>> send the cpu to 4G-16 bytes and instead send the cpu into it's magic
>> startup-ipi-wait mode then the BIOSen will not be involved on that path.
>>
>> It is a simple question of does the cpu support clearing bit 8
>> meaningfully.
>>
>> If the cpu allows bit 8 to be cleared and sends the cpu to the reset
>> vector on receipt of the INIT IPI I would call that a deviation from the
>> x86 cpu specification.
>>
>> So clearing bit 8 is not a question about BIOSen it is a question of can
>> we avoid the BIOSen, by using an obscure under-documented cpu feature.
>>
>
> As I said, I thought Fenghua tried that but it didn't work, experimentally.

Fair enough. You described the problem with clearing bit 8 in a weird
way.

If the best we can muster are fuzzy memories it may be worth revisiting.
Perhaps it works on enough cpu models to be interesting.

Eric

2012-10-23 00:36:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

On 10/22/2012 02:29 PM, Eric W. Biederman wrote:
>>
>> As I said, I thought Fenghua tried that but it didn't work, experimentally.
>
> Fair enough. You described the problem with clearing bit 8 in a weird
> way.
>
> If the best we can muster are fuzzy memories it may be worth revisiting.
> Perhaps it works on enough cpu models to be interesting.
>

It isn't fuzzy memories... this was done as late as 1-2 months ago. I
just don't know the details.

Fenghua, could you help fill us in?

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-10-26 03:24:53

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

From: "H. Peter Anvin" <[email protected]>
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
Date: Mon, 22 Oct 2012 17:35:47 -0700

> On 10/22/2012 02:29 PM, Eric W. Biederman wrote:
>>>
>>> As I said, I thought Fenghua tried that but it didn't work,
>>> experimentally.
>>
>> Fair enough. You described the problem with clearing bit 8 in a weird
>> way.
>>
>> If the best we can muster are fuzzy memories it may be worth
>> revisiting.
>> Perhaps it works on enough cpu models to be interesting.
>>
>
> It isn't fuzzy memories... this was done as late as 1-2 months ago. I
> just don't know the details.
>
> Fenghua, could you help fill us in?
>

I overlooked completely the fact that BSP flag is rewritable.

I tried Eric's suggestion using attached test programs and saw it
worked fine at least on the three cpus around me below:

- Intel(R) Xeon(R) CPU E7- 4820 @ 2.00GHz
- Intel(R) Xeon(R) CPU E7- 8870 @ 2.40GHz
- Intel(R) Xeon(TM) CPU 1.80GHz
- 32 bits CPU

Next I found the description about this in 8.4.2, IASDM Vol.3:

The MP initialization protocol imposes the following requirements
and restrictions on the system:

* The MP protocol is executed only after a power-up or RESET. If the
MP protocol has completed and a BSP is chosen, subsequent INITs
(either to a specific processor or system wide) do not cause the
MP protocol to be repeated. Instead, each logical processor
examines its BSP flag (in the IA32_APIC_BASE MSR) to determine
whether it should execute the BIOS boot-strap code (if it is the
BSP) or enter a wait-for-SIPI state (if it is an AP).

So this is no longer undocumented behaviour for recent cpus, I think.

Considering these, I'll make a patch to clear BSP flag at appropreate
position in kernel boot-up code. OTOH, according to the discussion, it
was reported that clearing BSP flag affected some BIOSes. To deal with
this, I'll prepare a kernel option to decide whether to clear BSP flag
or not.

Does anyone have any comments now? Or please comment after I submit a
new patch.

Thanks.
HATAYAMA, Daisuke


Attachments:
bsp_flag_modules.tar.bz2 (9.08 kB)

2012-10-26 04:13:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

HATAYAMA Daisuke <[email protected]> writes:

> From: "H. Peter Anvin" <[email protected]>
> Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
> Date: Mon, 22 Oct 2012 17:35:47 -0700
>
>> On 10/22/2012 02:29 PM, Eric W. Biederman wrote:
>>>>
>>>> As I said, I thought Fenghua tried that but it didn't work,
>>>> experimentally.
>>>
>>> Fair enough. You described the problem with clearing bit 8 in a weird
>>> way.
>>>
>>> If the best we can muster are fuzzy memories it may be worth
>>> revisiting.
>>> Perhaps it works on enough cpu models to be interesting.
>>>
>>
>> It isn't fuzzy memories... this was done as late as 1-2 months ago. I
>> just don't know the details.
>>
>> Fenghua, could you help fill us in?
>>
>
> I overlooked completely the fact that BSP flag is rewritable.
>
> I tried Eric's suggestion using attached test programs and saw it
> worked fine at least on the three cpus around me below:
>
> - Intel(R) Xeon(R) CPU E7- 4820 @ 2.00GHz
> - Intel(R) Xeon(R) CPU E7- 8870 @ 2.40GHz
> - Intel(R) Xeon(TM) CPU 1.80GHz
> - 32 bits CPU
>
> Next I found the description about this in 8.4.2, IASDM Vol.3:
>
> The MP initialization protocol imposes the following requirements
> and restrictions on the system:
>
> * The MP protocol is executed only after a power-up or RESET. If the
> MP protocol has completed and a BSP is chosen, subsequent INITs
> (either to a specific processor or system wide) do not cause the
> MP protocol to be repeated. Instead, each logical processor
> examines its BSP flag (in the IA32_APIC_BASE MSR) to determine
> whether it should execute the BIOS boot-strap code (if it is the
> BSP) or enter a wait-for-SIPI state (if it is an AP).
>
> So this is no longer undocumented behaviour for recent cpus, I think.

The underdocumented bit is the ability to clear the flag.
And of course these are processor specific registers.

> Considering these, I'll make a patch to clear BSP flag at appropreate
> position in kernel boot-up code. OTOH, according to the discussion, it
> was reported that clearing BSP flag affected some BIOSes. To deal with
> this, I'll prepare a kernel option to decide whether to clear BSP flag
> or not.
>
> Does anyone have any comments now? Or please comment after I submit a
> new patch.

I think you are on right track with preparing some patches, and this
certainly looks like worth experimenting with.

At least for i386 the code need to verify you have a cpu new enough to
have an APIC_BASE_MSR, but I don't think that is going to be hard.

Eric

2013-03-11 01:07:31

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
Date: Thu, 25 Oct 2012 21:13:25 -0700

> HATAYAMA Daisuke <[email protected]> writes:
>
>> From: "H. Peter Anvin" <[email protected]>
>> Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
>> Date: Mon, 22 Oct 2012 17:35:47 -0700
>>
>>> On 10/22/2012 02:29 PM, Eric W. Biederman wrote:
<cut>
>> Considering these, I'll make a patch to clear BSP flag at appropreate
>> position in kernel boot-up code. OTOH, according to the discussion, it
>> was reported that clearing BSP flag affected some BIOSes. To deal with
>> this, I'll prepare a kernel option to decide whether to clear BSP flag
>> or not.
>>
>> Does anyone have any comments now? Or please comment after I submit a
>> new patch.
>
> I think you are on right track with preparing some patches, and this
> certainly looks like worth experimenting with.
>
> At least for i386 the code need to verify you have a cpu new enough to
> have an APIC_BASE_MSR, but I don't think that is going to be hard.

Eric, you have probably forgotten this work but I want to restart the
work to allow multiple CPUs on the 2nd kernel. But on my
investigation, I have a question about inconsistent states kdump
framework assumes in the crash path on the 1st kernel.

Now I'm re-investigating how to unset BSP flag on the 1st kernel in a
safe manner. But then I must discuss possibility of BSP flag being set
again after the unsetting of BSP. This includes firmware that assumes
BSP flag is kept set throughtout system execution, but I noticed,
fundamentally, it can happen even only with kernel code in the
inconsistent state from the point where any bug happpens to before
entering 2nd kernel.

For example, some bug that causes buffer overrun can rewrite kdump
code so some part of it be wrmsr but any other part is safe enough to
boot 2nd kernel successfully... Although this is very low, but it must
actually happen. Of course, we face the same situation if we put
unsetting code in machine_shutdown() path, which is similarly not
guaranteed to work well in inconsistent state.

Different from kernel state and similar to any other device states, it
seems to me that it's impossible to unset BSP flag in a safe manner
together with inconsistent state kdump framework considers. Then, it
seems to me that disabling BSP on 2nd kernel is a final resort.

Thanks.
HATAYAMA, Daisuke

2013-03-11 02:14:00

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

From: HATAYAMA Daisuke <[email protected]>
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
Date: Mon, 11 Mar 2013 10:07:21 +0900

> From: "Eric W. Biederman" <[email protected]>
> Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
> Date: Thu, 25 Oct 2012 21:13:25 -0700
>
>> HATAYAMA Daisuke <[email protected]> writes:
>>
>>> From: "H. Peter Anvin" <[email protected]>
>>> Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP
>>> Date: Mon, 22 Oct 2012 17:35:47 -0700
>>>
>>>> On 10/22/2012 02:29 PM, Eric W. Biederman wrote:
> <cut>
>>> Considering these, I'll make a patch to clear BSP flag at appropreate
>>> position in kernel boot-up code. OTOH, according to the discussion, it
>>> was reported that clearing BSP flag affected some BIOSes. To deal with
>>> this, I'll prepare a kernel option to decide whether to clear BSP flag
>>> or not.
>>>
>>> Does anyone have any comments now? Or please comment after I submit a
>>> new patch.
>>
>> I think you are on right track with preparing some patches, and this
>> certainly looks like worth experimenting with.
>>
>> At least for i386 the code need to verify you have a cpu new enough to
>> have an APIC_BASE_MSR, but I don't think that is going to be hard.
>
> Eric, you have probably forgotten this work but I want to restart the
> work to allow multiple CPUs on the 2nd kernel. But on my
> investigation, I have a question about inconsistent states kdump
> framework assumes in the crash path on the 1st kernel.
>
> Now I'm re-investigating how to unset BSP flag on the 1st kernel in a
> safe manner. But then I must discuss possibility of BSP flag being set
> again after the unsetting of BSP. This includes firmware that assumes
> BSP flag is kept set throughtout system execution, but I noticed,
> fundamentally, it can happen even only with kernel code in the
> inconsistent state from the point where any bug happpens to before
> entering 2nd kernel.
>
> For example, some bug that causes buffer overrun can rewrite kdump
> code so some part of it be wrmsr but any other part is safe enough to
> boot 2nd kernel successfully... Although this is very low, but it must
> actually happen. Of course, we face the same situation if we put
> unsetting code in machine_shutdown() path, which is similarly not
> guaranteed to work well in inconsistent state.
>
> Different from kernel state and similar to any other device states, it
> seems to me that it's impossible to unset BSP flag in a safe manner
> together with inconsistent state kdump framework considers. Then, it
> seems to me that disabling BSP on 2nd kernel is a final resort.

I noticed this was not enough. In the inconsistent state, even AP can
have BSP flag set due to some bug. Then, in conclusion, we cannot use
multiple cpus on the 2nd kernel on top of kdump framework policy if
any change cannot be made there.

It seems to me that at least there needs to be the following design
policy for multiple CPUs on the 2nd kenrel:

- There's no firmware, kernel components and modules that depend on
BSP flag being kept set on the original BSP flag and never set BSP
flag of any of the existing CPUs again at runtime.

- Exclude a kind of bugs on which kdump framework works well, that set
BSP flag on any of the existing CPUs including AP.

If one of the assumption doesn't hold, we have to accept a risk of
system leading to unspecified behaviour.

Thanks.
HATAYAMA, Daisuke

2013-03-11 04:14:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] x86, apic: Disable BSP if boot cpu is AP

On 03/10/2013 07:13 PM, HATAYAMA Daisuke wrote:
>
> It seems to me that at least there needs to be the following design
> policy for multiple CPUs on the 2nd kenrel:
>
> - There's no firmware, kernel components and modules that depend on
> BSP flag being kept set on the original BSP flag and never set BSP
> flag of any of the existing CPUs again at runtime.
>

And that is provably false in the case of firmware.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.