Target audience for this patches is mostly virt. environments, where
physical CPUs are shared beetween many guests and on overcommited
host it can uncover different race conditions during secondary CPU
bring-up.
Signed-off-by: Igor Mammedov <[email protected]>
When bringing up cpuX1, it could stall in start_secondary
before setting cpu_callin_mask for more than 5 sec. That forces
do_boot_cpu() to give up on waiting and go to error return path
printing messages:
pr_err("CPU%d: Stuck ??\n", cpuX1);
or
pr_err("CPU%d: Not responding.\n", cpuX1);
and native_cpu_up returns early with -EIO. However AP may continue
its boot process till it reaches check_tsc_sync_target(), where
it will wait for boot cpu to run cpu_up...=>check_tsc_sync_source.
That will never happen since cpu_up have returned with error before.
Now we need to note that cpuX1 is marked as active in smp_callin
before it stuck in check_tsc_sync_target. And when another cpuX2
is being onlined, start_secondary on it will call
smp_callin
-> smp_store_cpu_info
-> identify_secondary_cpu
-> mtrr_ap_init
-> set_mtrr_from_inactive_cpu
-> stop_machine_from_inactive_cpu
where it's going to schedule stop_machine work on all ACTIVE cpus
smdata.num_threads = num_active_cpus() + 1;
and wait till they all complete it before continuing. As was noted
before cpuX1 was marked as active but can't execute any work since
it's not completed initialization and stuck in check_tsc_sync_target.
As result system soft lockups in stop_machine_cpu_stop.
backtrace from reproducer:
PID: 3324 TASK: ffff88007c00ae20 CPU: other cpus COMMAND: "migration/1"
[exception RIP: stop_machine_cpu_stop+131]
...
#0 [ffff88007b4d7de8] cpu_stopper_thread at ffffffff810c66bd
#1 [ffff88007b4d7ee8] kthread at ffffffff8107871e
#2 [ffff88007b4d7f48] kernel_thread_helper at ffffffff8154af24
PID: 0 TASK: ffff88007c029710 CPU: 2 COMMAND: "swapper/2"
[exception RIP: check_tsc_sync_target+33]
...
#0 [ffff88007c025f30] start_secondary at ffffffff81539876
PID: 0 TASK: ffff88007c041710 CPU: 3 COMMAND: "swapper/3"
[exception RIP: stop_machine_cpu_stop+131]
...
#0 [ffff88007c04be50] stop_machine_from_inactive_cpu at ffffffff810c6b2f
#1 [ffff88007c04bee0] mtrr_ap_init at ffffffff8102e963
#2 [ffff88007c04bf10] identify_secondary_cpu at ffffffff81536799
#3 [ffff88007c04bf20] smp_store_cpu_info at ffffffff815396d5
#4 [ffff88007c04bf30] start_secondary at ffffffff81539800
Could be fixed by not marking being onlined cpu as active too early.
Signed-off-by: Igor Mammedov <[email protected]>
---
arch/x86/kernel/smpboot.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6e1e406..ae19d90 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -232,8 +232,6 @@ static void __cpuinit smp_callin(void)
set_cpu_sibling_map(raw_smp_processor_id());
wmb();
- notify_cpu_starting(cpuid);
-
/*
* Allow the master to continue.
*/
@@ -268,6 +266,8 @@ notrace static void __cpuinit start_secondary(void *unused)
*/
check_tsc_sync_target();
+ notify_cpu_starting(smp_processor_id());
+
/*
* We need to hold call_lock, so there is no inconsistency
* between the time smp_call_function() determines number of
--
1.7.1
conditions:
multiliple cpus guest on over-commited host running aggressive
cpu online/offline script.
real use-case:
cpu hotplug in virt environment
Guest may hung with following trace:
---- cut ----
[ 54.234569] CPU3: Stuck ??
./offV2.sh: line 11: echo: write error: Input/output error
[ 54.250408] CPU 1 is now offline
[ 54.260887] CPU 2 is now offline
[ 54.261350] SMP alternatives: switching to UP code
./offV2.sh: line 8: echo: write error: Device or resource busy
[ 54.286857] SMP alternatives: switching to SMP code
[ 54.299540] Booting Node 0 Processor 1 APIC 0x1
[ 54.250401] Disabled fast string operations
[ 54.591023] ------------[ cut here ]------------
[ 54.591023] WARNING: at kernel/watchdog.c:241 watchdog_overflow_callback+0x98/0xc0()
[ 54.591023] Hardware name: KVM
[ 54.591023] Watchdog detected hard LOCKUP on cpu 0
[ 54.591023] Modules linked in: sunrpc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc crc32c_intel ghash_clmulni_intel microcode serio_raw virtio_balloon e1000 i2c_piix4 i2c_core floppy [last unloaded: scsi_wait_scan]
[ 54.591023] Pid: 1193, comm: offV2.sh Not tainted 3.4.0-rc4+ #203
[ 54.591023] Call Trace:
[ 54.591023] <NMI> [<ffffffff810566df>] warn_slowpath_common+0x7f/0xc0
[ 54.591023] [<ffffffff810567d6>] warn_slowpath_fmt+0x46/0x50
[ 54.591023] [<ffffffff810dddf8>] watchdog_overflow_callback+0x98/0xc0
[ 54.591023] [<ffffffff8111685c>] __perf_event_overflow+0x9c/0x220
[ 54.591023] [<ffffffff81023c18>] ? x86_perf_event_set_period+0xd8/0x160
[ 54.591023] [<ffffffff81116e14>] perf_event_overflow+0x14/0x20
[ 54.591023] [<ffffffff81029405>] intel_pmu_handle_irq+0x1a5/0x310
[ 54.591023] [<ffffffff815436e1>] perf_event_nmi_handler+0x21/0x30
[ 54.591023] [<ffffffff81542f52>] do_nmi+0x132/0x3d0
[ 54.591023] [<ffffffff815424bc>] end_repeat_nmi+0x1a/0x1e
[ 54.591023] [<ffffffff81053adf>] ? __cleanup_sighand+0x2f/0x40
[ 54.591023] [<ffffffff81539b27>] ? check_tsc_sync_source+0x5c/0x14a
[ 54.591023] [<ffffffff81539b27>] ? check_tsc_sync_source+0x5c/0x14a
[ 54.591023] [<ffffffff81539b27>] ? check_tsc_sync_source+0x5c/0x14a
[ 54.591023] <<EOE>> [<ffffffff81539274>] native_cpu_up+0x156/0x181
[ 54.591023] [<ffffffff8153ad8c>] _cpu_up+0x9c/0x10e
[ 54.591023] [<ffffffff8153ae4a>] cpu_up+0x4c/0x5c
[ 54.591023] [<ffffffff8152cd1c>] store_online+0x9c/0xd0
[ 54.591023] [<ffffffff8136d210>] dev_attr_store+0x20/0x30
[ 54.591023] [<ffffffff811e819f>] sysfs_write_file+0xef/0x170
[ 54.591023] [<ffffffff81179618>] vfs_write+0xc8/0x190
[ 54.591023] [<ffffffff811797e1>] sys_write+0x51/0x90
[ 54.591023] [<ffffffff81549c29>] system_call_fastpath+0x16/0x1b
[ 54.591023] ---[ end trace d5170b988db7c86f ]---
---- cut ----
this happens when boot cpu0 give-ups on waiting for cpu3 due to cpu3 not
setting cpu_callin_mask in specified timeout. And then cpu1 is being brought
online.
1. In this case cpu3 already incremented start_count => 1 and waiting at:
while (atomic_read(&start_count) != cpus) /// local const int cpus = 2
2. then cpu1 arrives, increments start_count => 2 and cpu[1,3] continue
to boot as if everything ok.
3. then cpu0 arrives to check_tsc_sync_source and lock-ups on
while (atomic_read(&start_count) != cpus-1) // start_count is 2
Fix race by making boot cpu lead rendezvous process and forcing failed
cpu to return early from check_tsc_sync_target not doing clock sync.
Signed-off-by: Igor Mammedov <[email protected]>
---
arch/x86/kernel/tsc_sync.c | 38 ++++++++++++++++----------------------
1 files changed, 16 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index fc25e60..5f06138 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -25,8 +25,8 @@
* Entry/exit counters that make sure that both CPUs
* run the measurement code at once:
*/
-static __cpuinitdata atomic_t start_count;
-static __cpuinitdata atomic_t stop_count;
+static __cpuinitdata atomic_t start_tsc_sync;
+static __cpuinitdata atomic_t stop_tsc_sync;
/*
* We use a raw spinlock in this exceptional case, because
@@ -123,8 +123,6 @@ static inline unsigned int loop_timeout(int cpu)
*/
void __cpuinit check_tsc_sync_source(int cpu)
{
- int cpus = 2;
-
/*
* No need to check if we already know that the TSC is not
* synchronized:
@@ -140,23 +138,19 @@ void __cpuinit check_tsc_sync_source(int cpu)
}
/*
- * Reset it - in case this is a second bootup:
+ * Sygnal AP[s] that source cpu is arrived
*/
- atomic_set(&stop_count, 0);
+ atomic_set(&start_tsc_sync, 1);
/*
* Wait for the target to arrive:
*/
- while (atomic_read(&start_count) != cpus-1)
+ while (atomic_read(&start_tsc_sync))
cpu_relax();
- /*
- * Trigger the target to continue into the measurement too:
- */
- atomic_inc(&start_count);
check_tsc_warp(loop_timeout(cpu));
- while (atomic_read(&stop_count) != cpus-1)
+ while (!atomic_read(&stop_tsc_sync))
cpu_relax();
if (nr_warps) {
@@ -173,7 +167,6 @@ void __cpuinit check_tsc_sync_source(int cpu)
/*
* Reset it - just in case we boot another CPU later:
*/
- atomic_set(&start_count, 0);
nr_warps = 0;
max_warp = 0;
last_tsc = 0;
@@ -181,7 +174,7 @@ void __cpuinit check_tsc_sync_source(int cpu)
/*
* Let the target continue with the bootup:
*/
- atomic_inc(&stop_count);
+ atomic_set(&stop_tsc_sync, 0);
}
/*
@@ -189,29 +182,30 @@ void __cpuinit check_tsc_sync_source(int cpu)
*/
void __cpuinit check_tsc_sync_target(void)
{
- int cpus = 2;
-
if (unsynchronized_tsc() || tsc_clocksource_reliable)
return;
/*
- * Register this CPU's participation and wait for the
- * source CPU to start the measurement:
+ * Wait for the source CPU to start the measurement
*/
- atomic_inc(&start_count);
- while (atomic_read(&start_count) != cpus)
+ while (!atomic_read(&start_tsc_sync))
cpu_relax();
+ if (!cpumask_test_cpu(smp_processor_id(), cpu_initialized_mask))
+ return;
+
+ atomic_set(&start_tsc_sync, 0);
+
check_tsc_warp(loop_timeout(smp_processor_id()));
/*
* Ok, we are done:
*/
- atomic_inc(&stop_count);
+ atomic_set(&stop_tsc_sync, 1);
/*
* Wait for the source CPU to print stuff:
*/
- while (atomic_read(&stop_count) != cpus)
+ while (atomic_read(&stop_tsc_sync))
cpu_relax();
}
--
1.7.1
It will allow to boot cpu later if possible.
v2:
Introduce failed_cpu_boots_limit cmd-line parameter.
At startup udev might try to online cpu even if it have failed to boot
first time. And udev will loop there on cpu that refuses to boot.
So disable cpu after failed_cpu_boots_limit is reached to prevent
udev spinning on onlining persistently faulty cpu.
Guest kernel on overcomitted hosts could use this parameter to set
limit to acceptable number of cpu online failures.
Signed-off-by: Igor Mammedov <[email protected]>
---
Documentation/kernel-parameters.txt | 6 +++++
arch/x86/kernel/smpboot.c | 36 +++++++++++++++++++++++++++++++++-
2 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index c1601e5..6b9bbbc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -825,6 +825,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Format: <interval>,<probability>,<space>,<times>
See also Documentation/fault-injection/.
+ failed_cpu_boots_limit=[SMP,X86]
+ Number of tries kernel allowed to boot not responding /
+ stuck cpu. When fail attempts are reached, kernel will
+ disable failed cpu and mark it as not present.
+ Default: 0
+
floppy= [HW]
See Documentation/blockdev/floppy.txt.
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index af63cab..2d72a8a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -136,6 +136,28 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
atomic_t init_deasserted;
+static int failed_cpu_boots_limit = 0;
+static int cpu_boot_error_nr[NR_CPUS];
+
+static int parse_failed_cpu_boots(char *str)
+{
+ unsigned long val;
+ int err;
+
+ if (!str)
+ return -EINVAL;
+
+ err = kstrtoul(str, 0, &failed_cpu_boots_limit);
+ if (err)
+ return -EINVAL;
+
+ printk(KERN_NOTICE "Limit CPU failed boot attempts: %d\n",
+ failed_cpu_boots_limit);
+
+ return 0;
+}
+__setup("failed_cpu_boots_limit=", parse_failed_cpu_boots);
+
/*
* Report back to the Boot Processor.
* Running on AP.
@@ -810,8 +832,18 @@ do_rest:
/* was set by cpu_init() */
cpumask_clear_cpu(cpu, cpu_initialized_mask);
- set_cpu_present(cpu, false);
- per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
+ /* was set by smp_callin() */
+ cpumask_clear_cpu(cpu, cpu_callin_mask);
+
+ /* disable CPU if it's failed to boot N times in a row */
+ if (cpu_boot_error_nr[cpu]++ > failed_cpu_boots_limit) {
+ set_cpu_present(cpu, false);
+ per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
+ pr_err("CPU%d: repeatedly fails to boot, disabling.\n",
+ cpu);
+ }
+ } else {
+ cpu_boot_error_nr[cpu] = 0;
}
/* mark "stuck" area as not stuck */
--
1.7.1
Instead go to idle without setting cpu online, which results in calling play_dead()
Signed-off-by: Igor Mammedov <[email protected]>
---
arch/x86/include/asm/tsc.h | 2 +-
arch/x86/kernel/smpboot.c | 12 +++++++++++-
arch/x86/kernel/tsc_sync.c | 10 ++++++----
3 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c91e8b9..b2491fc 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -58,7 +58,7 @@ extern int tsc_clocksource_reliable;
* all CPUs/cores:
*/
extern void check_tsc_sync_source(int cpu);
-extern void check_tsc_sync_target(void);
+extern bool check_tsc_sync_target(void);
extern int notsc_setup(char *);
extern void tsc_save_sched_clock_state(void);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ae19d90..af63cab 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -264,7 +264,16 @@ notrace static void __cpuinit start_secondary(void *unused)
/*
* Check TSC synchronization with the BP:
*/
- check_tsc_sync_target();
+ if (!check_tsc_sync_target()) {
+ clear_local_APIC();
+
+ /* was set by cpu_init() */
+ cpumask_clear_cpu(smp_processor_id(), cpu_initialized_mask);
+ /* was set by smp_callin() */
+ cpumask_clear_cpu(smp_processor_id(), cpu_callin_mask);
+
+ goto do_idle;
+ }
notify_cpu_starting(smp_processor_id());
@@ -296,6 +305,7 @@ notrace static void __cpuinit start_secondary(void *unused)
x86_cpuinit.setup_percpu_clockev();
+do_idle:
wmb();
cpu_idle();
}
diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index 1741385..45a593e 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -180,22 +180,22 @@ void __cpuinit check_tsc_sync_source(int cpu)
/*
* Freshly booted CPUs call into this:
*/
-void __cpuinit check_tsc_sync_target(void)
+bool __cpuinit check_tsc_sync_target(void)
{
if (unsynchronized_tsc() || tsc_clocksource_reliable)
- return;
+ return true;
/*
* Wait for the source CPU to start the measurement
*/
while (!atomic_read(&start_tsc_sync)) {
if (!cpumask_test_cpu(smp_processor_id(), cpu_callout_mask))
- return;
+ return false;
cpu_relax();
}
if (!cpumask_test_cpu(smp_processor_id(), cpu_callout_mask))
- return;
+ return false;
atomic_set(&start_tsc_sync, 0);
@@ -211,4 +211,6 @@ void __cpuinit check_tsc_sync_target(void)
*/
while (atomic_read(&stop_tsc_sync))
cpu_relax();
+
+ return true;
}
--
1.7.1
Use cpu_callout_mask for checking if boot cpu is still waiting on us
Signed-off-by: Igor Mammedov <[email protected]>
---
arch/x86/kernel/tsc_sync.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index 5f06138..1741385 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -188,10 +188,13 @@ void __cpuinit check_tsc_sync_target(void)
/*
* Wait for the source CPU to start the measurement
*/
- while (!atomic_read(&start_tsc_sync))
+ while (!atomic_read(&start_tsc_sync)) {
+ if (!cpumask_test_cpu(smp_processor_id(), cpu_callout_mask))
+ return;
cpu_relax();
+ }
- if (!cpumask_test_cpu(smp_processor_id(), cpu_initialized_mask))
+ if (!cpumask_test_cpu(smp_processor_id(), cpu_callout_mask))
return;
atomic_set(&start_tsc_sync, 0);
--
1.7.1
On Wed, 2012-05-09 at 12:24 +0200, Igor Mammedov wrote:
> Target audience for this patches is mostly virt. environments, where
> physical CPUs are shared beetween many guests and on overcommited
> host it can uncover different race conditions during secondary CPU
> bring-up.
The good news is that you're working on this, the bad news is that all
this code is slated for the scrap heap :-)
Thomas is currently in the process of doing a massive overhaul of the
hotplug code, included in that would be the stuff you're touching.
Every architecture does this hand-shake differently and probably buggy,
all that needs to move into generic code. The only bits needed in the
arch code are the cpu wakeup and initial trampoline, the rest should be
generic.
I'm not quite sure how far along he is, but it would be awesome if you
could help him out somehow.
On 05/09/2012 11:19 AM, Peter Zijlstra wrote:
> On Wed, 2012-05-09 at 12:24 +0200, Igor Mammedov wrote:
>> Target audience for this patches is mostly virt. environments, where
>> physical CPUs are shared beetween many guests and on overcommited
>> host it can uncover different race conditions during secondary CPU
>> bring-up.
>
> The good news is that you're working on this, the bad news is that all
> this code is slated for the scrap heap :-)
>
> Thomas is currently in the process of doing a massive overhaul of the
> hotplug code, included in that would be the stuff you're touching.
If Thomas' rewrite is progressed well and could be completed for 3.5 then
there is no big harm in throwing this patches away. However if it's not,
it might have sense to apply these patches in 3.5 devel cycle.
Also massive rewrite would be unlikely backport candidate to stable 3.x trees,
and some of these patches might be considered as such ones.
> Every architecture does this hand-shake differently and probably buggy,
> all that needs to move into generic code. The only bits needed in the
> arch code are the cpu wakeup and initial trampoline, the rest should be
> generic.
>
> I'm not quite sure how far along he is, but it would be awesome if you
> could help him out somehow.
Sure, I could lend a hand as minimum in testing and maybe some rewriting
too if Thomas can give some part if so that we do not conflict on this.
PS:
There is still a couple hangs in 3.4-rc4+:
one looks like kvm host related, hangs when writing into apic register:
#0 native_apic_mem_write (reg=768, v=<value optimized out>) at /builds/imammedo/linux-2.6/arch/x86/include/asm/apic.h:107
#1 0xffffffff81034749 in apic_write (low=50432, id=<value optimized out>) at /builds/imammedo/linux-2.6/arch/x86/include/asm/apic.h:426
#2 native_apic_icr_write (low=50432, id=<value optimized out>) at arch/x86/kernel/apic/apic.c:273
#3 0xffffffff815a78fb in apic_icr_write (apicid=2, cpu=2) at /builds/imammedo/linux-2.6/arch/x86/include/asm/apic.h:436
#4 wakeup_secondary_cpu_via_init (apicid=2, cpu=2) at arch/x86/kernel/smpboot.c:563
#5 do_boot_cpu (apicid=2, cpu=2) at arch/x86/kernel/smpboot.c:782
And another one cannot be helped: RHBZ 816899 comment 7
https://bugzilla.redhat.com/show_bug.cgi?id=816899#c7
--
-----
Igor
* Peter Zijlstra <[email protected]> wrote:
> I'm not quite sure how far along he is, but it would be
> awesome if you could help him out somehow.
bits of Thomas's rework are in latest -tip, the tip:smp/hotplug
branch, also merged into tip:master and soon to linux-next.
Thanks,
Ingo
On Wed, 2012-05-09 at 12:24 +0200, Igor Mammedov wrote:
> When bringing up cpuX1, it could stall in start_secondary
> before setting cpu_callin_mask for more than 5 sec. That forces
> do_boot_cpu() to give up on waiting and go to error return path
> printing messages:
> pr_err("CPU%d: Stuck ??\n", cpuX1);
I am seeing this with the linux-next May 7th build on my laptop HP
EliteBook 8440p during boot. Could this problem be not specific to virt
envs.? Anybody else seeing it?
-- Shuah
On 05/09/2012 05:04 PM, Shuah Khan wrote:
> On Wed, 2012-05-09 at 12:24 +0200, Igor Mammedov wrote:
>> When bringing up cpuX1, it could stall in start_secondary
>> before setting cpu_callin_mask for more than 5 sec. That forces
>> do_boot_cpu() to give up on waiting and go to error return path
>> printing messages:
>> pr_err("CPU%d: Stuck ??\n", cpuX1);
>
> I am seeing this with the linux-next May 7th build on my laptop HP
> EliteBook 8440p during boot. Could this problem be not specific to virt
> envs.? Anybody else seeing it?
I could only guess that on bare metal SMI or other firmware issue may interfere with AP boot.
Could you test if this patch-set fixes issue for you?
And do you see the same problem during suspend/resume (assuming that it's booted without problem)?
--
-----
Igor
On Wed, 2012-05-09 at 17:22 +0200, Igor Mammedov wrote:
> On 05/09/2012 05:04 PM, Shuah Khan wrote:
> > On Wed, 2012-05-09 at 12:24 +0200, Igor Mammedov wrote:
> >> When bringing up cpuX1, it could stall in start_secondary
> >> before setting cpu_callin_mask for more than 5 sec. That forces
> >> do_boot_cpu() to give up on waiting and go to error return path
> >> printing messages:
> >> pr_err("CPU%d: Stuck ??\n", cpuX1);
> >
> > I am seeing this with the linux-next May 7th build on my laptop HP
> > EliteBook 8440p during boot. Could this problem be not specific to virt
> > envs.? Anybody else seeing it?
>
> I could only guess that on bare metal SMI or other firmware issue may interfere with AP boot.
> Could you test if this patch-set fixes issue for you?
> And do you see the same problem during suspend/resume (assuming that it's booted without problem)?
>
I had to abandon the boot and go back to distro installed kernel - yes I
will test with your patch set and see if the problem goes away. Might
not happen until close to end of day today, but will report back.
-- Shuah
> > I could only guess that on bare metal SMI or other firmware issue may interfere with AP boot.
> > Could you test if this patch-set fixes issue for you?
> > And do you see the same problem during suspend/resume (assuming that it's booted without problem)?
> >
>
> I had to abandon the boot and go back to distro installed kernel - yes I
> will test with your patch set and see if the problem goes away. Might
> not happen until close to end of day today, but will report back.
These patches failed to apply on linux-next May 9th. Couldn't get the
testing done as planned.
-- Shuah
just cloned git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
last commit 0bed306
All 5 patches applied cleanly, maybe we are talking about different linux-next trees?
----- Original Message -----
> From: "Shuah Khan" <[email protected]>
> To: "Igor Mammedov" <[email protected]>
> Cc: [email protected], [email protected], [email protected], [email protected], [email protected],
> [email protected], [email protected], [email protected], "suresh b siddha" <[email protected]>, [email protected], "a p
> zijlstra" <[email protected]>, [email protected], [email protected], [email protected]
> Sent: Thursday, May 10, 2012 5:26:24 PM
> Subject: Re: [PATCH 1/5] Fix soft-lookup in stop machine on secondary cpu bring up
>
>
> > > I could only guess that on bare metal SMI or other firmware issue
> > > may interfere with AP boot.
> > > Could you test if this patch-set fixes issue for you?
> > > And do you see the same problem during suspend/resume (assuming
> > > that it's booted without problem)?
> > >
> >
> > I had to abandon the boot and go back to distro installed kernel -
> > yes I
> > will test with your patch set and see if the problem goes away.
> > Might
> > not happen until close to end of day today, but will report back.
>
> These patches failed to apply on linux-next May 9th. Couldn't get the
> testing done as planned.
>
> -- Shuah
>
>
On Thu, 2012-05-10 at 12:29 -0400, Igor Mammedov wrote:
> just cloned git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> last commit 0bed306
>
> All 5 patches applied cleanly, maybe we are talking about different linux-next trees?
The same git from May 8th. The commit id I have is
407655a15be465ca284a68843e66c6fe7decf4bc. Let me try again.
-- Shuah
On 05/09/2012 08:12 AM, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> I'm not quite sure how far along he is, but it would be
>> awesome if you could help him out somehow.
>
> bits of Thomas's rework are in latest -tip, the tip:smp/hotplug
> branch, also merged into tip:master and soon to linux-next.
Um, I missed a curve here: I know what linux-next is, and I know what
staging is (although I'm a bit confused about how those two relate), but
what's tip? (Do you mean linux's tree?)
I checked applying-patches.txt and HOWTO in case I missed something, and
both are are hideously stale. (For one thing, both still say 2.6
everywhere. I guess this is my problem now, I'll try a fixup pass this
weekend.)
In the meantime: what do you mean by tip?
> Thanks,
>
> Ingo
>
Rob
--
GNU/Linux isn't: Linux=GPLv2, GNU=GPLv3+, they can't share code.
Either it's "mere aggregation", or a license violation. Pick one.
On Thu, 2012-05-10 at 12:31 -0500, Rob Landley wrote:
> In the meantime: what do you mean by tip?
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
This tree contains x86,sched,lockdep,tracing,irq,timers,timekeeping and
other assorted stuff.
On Wed, 9 May 2012, Igor Mammedov wrote:
> When bringing up cpuX1, it could stall in start_secondary
> before setting cpu_callin_mask for more than 5 sec. That forces
> do_boot_cpu() to give up on waiting and go to error return path
> printing messages:
> pr_err("CPU%d: Stuck ??\n", cpuX1);
> or
> pr_err("CPU%d: Not responding.\n", cpuX1);
> and native_cpu_up returns early with -EIO. However AP may continue
> its boot process till it reaches check_tsc_sync_target(), where
> it will wait for boot cpu to run cpu_up...=>check_tsc_sync_source.
> That will never happen since cpu_up have returned with error before.
>
> Now we need to note that cpuX1 is marked as active in smp_callin
> before it stuck in check_tsc_sync_target. And when another cpuX2
> is being onlined, start_secondary on it will call
> smp_callin
> -> smp_store_cpu_info
> -> identify_secondary_cpu
> -> mtrr_ap_init
> -> set_mtrr_from_inactive_cpu
> -> stop_machine_from_inactive_cpu
> where it's going to schedule stop_machine work on all ACTIVE cpus
> smdata.num_threads = num_active_cpus() + 1;
> and wait till they all complete it before continuing. As was noted
> before cpuX1 was marked as active but can't execute any work since
> it's not completed initialization and stuck in check_tsc_sync_target.
> As result system soft lockups in stop_machine_cpu_stop.
>
> backtrace from reproducer:
>
> PID: 3324 TASK: ffff88007c00ae20 CPU: other cpus COMMAND: "migration/1"
> [exception RIP: stop_machine_cpu_stop+131]
> ...
> #0 [ffff88007b4d7de8] cpu_stopper_thread at ffffffff810c66bd
> #1 [ffff88007b4d7ee8] kthread at ffffffff8107871e
> #2 [ffff88007b4d7f48] kernel_thread_helper at ffffffff8154af24
>
> PID: 0 TASK: ffff88007c029710 CPU: 2 COMMAND: "swapper/2"
> [exception RIP: check_tsc_sync_target+33]
> ...
> #0 [ffff88007c025f30] start_secondary at ffffffff81539876
>
> PID: 0 TASK: ffff88007c041710 CPU: 3 COMMAND: "swapper/3"
> [exception RIP: stop_machine_cpu_stop+131]
> ...
> #0 [ffff88007c04be50] stop_machine_from_inactive_cpu at ffffffff810c6b2f
> #1 [ffff88007c04bee0] mtrr_ap_init at ffffffff8102e963
> #2 [ffff88007c04bf10] identify_secondary_cpu at ffffffff81536799
> #3 [ffff88007c04bf20] smp_store_cpu_info at ffffffff815396d5
> #4 [ffff88007c04bf30] start_secondary at ffffffff81539800
>
> Could be fixed by not marking being onlined cpu as active too early.
This explanation is completely useless. What's fixed by what. And is
it fixed or could it be fixed?
This also want's an explanation why moving the cpu_starting notifier
does not hurt any assumptions of the code which has notifiers
registered for CPU_STARTING. In fact your change can result in
CPU_ONLINE notifier being called _BEFORE_ CPU_STARTING. Do you really
think that's correct?
Aside of that your whole patch series tackles the wrong aspect.
Why the heck do you need extra magic in check_tsc_sync_target() ?
If the booting CPU fails to set the callin map within 5 seconds then
it should not even reach check_tsc_sync_target() at all.
And just for the record, the new CPU can run into the very same
timeout problem, when the boot CPU fails to set the callout mask.
This whole stuff is a complete trainwreck already and I don't want to
see anything like your "fixing the symptoms" hackery near it, really.
This whole stuff needs a proper rewrite and not some more braindamaged
bandaids. And if we apply bandaids for the time being, then certainly
not bandaids like the mess you created.
Thanks,
tglx
On 05/11/2012 01:45 PM, Thomas Gleixner wrote:
> On Wed, 9 May 2012, Igor Mammedov wrote:
>
>> When bringing up cpuX1, it could stall in start_secondary
>> before setting cpu_callin_mask for more than 5 sec. That forces
>> do_boot_cpu() to give up on waiting and go to error return path
>> printing messages:
>> pr_err("CPU%d: Stuck ??\n", cpuX1);
>> or
>> pr_err("CPU%d: Not responding.\n", cpuX1);
>> and native_cpu_up returns early with -EIO. However AP may continue
>> its boot process till it reaches check_tsc_sync_target(), where
>> it will wait for boot cpu to run cpu_up...=>check_tsc_sync_source.
>> That will never happen since cpu_up have returned with error before.
>>
>> Now we need to note that cpuX1 is marked as active in smp_callin
>> before it stuck in check_tsc_sync_target. And when another cpuX2
>> is being onlined, start_secondary on it will call
>> smp_callin
>> -> smp_store_cpu_info
>> -> identify_secondary_cpu
>> -> mtrr_ap_init
>> -> set_mtrr_from_inactive_cpu
>> -> stop_machine_from_inactive_cpu
>> where it's going to schedule stop_machine work on all ACTIVE cpus
>> smdata.num_threads = num_active_cpus() + 1;
>> and wait till they all complete it before continuing. As was noted
>> before cpuX1 was marked as active but can't execute any work since
>> it's not completed initialization and stuck in check_tsc_sync_target.
>> As result system soft lockups in stop_machine_cpu_stop.
>>
>> backtrace from reproducer:
>>
>> PID: 3324 TASK: ffff88007c00ae20 CPU: other cpus COMMAND: "migration/1"
>> [exception RIP: stop_machine_cpu_stop+131]
>> ...
>> #0 [ffff88007b4d7de8] cpu_stopper_thread at ffffffff810c66bd
>> #1 [ffff88007b4d7ee8] kthread at ffffffff8107871e
>> #2 [ffff88007b4d7f48] kernel_thread_helper at ffffffff8154af24
>>
>> PID: 0 TASK: ffff88007c029710 CPU: 2 COMMAND: "swapper/2"
>> [exception RIP: check_tsc_sync_target+33]
>> ...
>> #0 [ffff88007c025f30] start_secondary at ffffffff81539876
>>
>> PID: 0 TASK: ffff88007c041710 CPU: 3 COMMAND: "swapper/3"
>> [exception RIP: stop_machine_cpu_stop+131]
>> ...
>> #0 [ffff88007c04be50] stop_machine_from_inactive_cpu at ffffffff810c6b2f
>> #1 [ffff88007c04bee0] mtrr_ap_init at ffffffff8102e963
>> #2 [ffff88007c04bf10] identify_secondary_cpu at ffffffff81536799
>> #3 [ffff88007c04bf20] smp_store_cpu_info at ffffffff815396d5
>> #4 [ffff88007c04bf30] start_secondary at ffffffff81539800
>>
>> Could be fixed by not marking being onlined cpu as active too early.
>
> This explanation is completely useless. What's fixed by what. And is
> it fixed or could it be fixed?
What's fixed:
above mentioned hang in stop_machine_from_inactive_cpu() because even if
a cpu failed to set cpu_callin_mask in time and boot cpu marked it as
not present + removed from some maps, with this move, a failed cpu won't
set cpu_active_mask before it completes check_tsc_sync_target().
And with patches [2,3,4]/5 it will not set cpu_active_mask at all
so making itself unavailable to the rest of kernel.
> This also want's an explanation why moving the cpu_starting notifier
> does not hurt any assumptions of the code which has notifiers
> registered for CPU_STARTING.
I've checked in kernel users [sched, kvm, pmu] before moving it here.
It looked safe. However I might have missed something.
>In fact your change can result in
> CPU_ONLINE notifier being called _BEFORE_ CPU_STARTING. Do you really
> think that's correct?
That's certainly is not correct, it asks for a barrier after
cpu_starting and before setting cpu_online_mask.
> Aside of that your whole patch series tackles the wrong aspect.
patch series tries to prevent a failed to boot cpu wreck havoc on
running kernel. How wrong is that?
What should be fixed instead?
> Why the heck do you need extra magic in check_tsc_sync_target() ?
Because it's plainly racy. patch 2/5 describes/fixes race condition in
check_tsc_sync_target().
> If the booting CPU fails to set the callin map within 5 seconds then
> it should not even reach check_tsc_sync_target() at all.
Why it shouldn't reach check_tsc_sync_target () at all. There is nothing
that prevents it and guaranties such behavior.
For example: it happens when kernel is running inside guest on overloaded host.
And it seems on baremetal as well: https://lkml.org/lkml/2012/5/9/336
> And just for the record, the new CPU can run into the very same
> timeout problem, when the boot CPU fails to set the callout mask.
Yes, it can.
I've tried to fix only what was reproducible on my test system, so I
haven't touched this.
That might result in panic in smp_callin():
panic("%s: CPU%d started up but did not get a callout!\n"
> This whole stuff is a complete trainwreck already and I don't want to
> see anything like your "fixing the symptoms" hackery near it, really.
Fixing slow to respond cpu might be not option, so we need to gracefully
abort failed cpu_online operation instead of hanging in stop_machine or
crashing in scheduler[https://lkml.org/lkml/2012/5/9/137].
>
> This whole stuff needs a proper rewrite and not some more braindamaged
> bandaids. And if we apply bandaids for the time being, then certainly
> not bandaids like the mess you created.
Rewrite will need to deal with failed to boot in time cpu as well.
So if rewrite is not near completion, then maybe for a time being bandaids
would be needed.
Any ideas/suggestions for "right bandaids" instead of braindamaged ones?
> Thanks,
>
> tglx
--
-----
Thanks,
Igor
On Fri, 11 May 2012, Igor Mammedov wrote:
> On 05/11/2012 01:45 PM, Thomas Gleixner wrote:
> > In fact your change can result in
> > CPU_ONLINE notifier being called _BEFORE_ CPU_STARTING. Do you really
> > think that's correct?
>
> That's certainly is not correct, it asks for a barrier after
> cpu_starting and before setting cpu_online_mask.
Ah, a barrier will solve that. Interesting approach.
> > Aside of that your whole patch series tackles the wrong aspect.
>
> patch series tries to prevent a failed to boot cpu wreck havoc on
> running kernel. How wrong is that?
Emphasis on "tries". And as I explained before you are fixing stuff at
the wrong place.
> What should be fixed instead?
If that timeout happens, then prevent that the following code is
reached, perhaps ?
> > Why the heck do you need extra magic in check_tsc_sync_target() ?
> Because it's plainly racy. patch 2/5 describes/fixes race condition in
> check_tsc_sync_target().
Crap. You do not understand at all. If the code which is before
check_tsc_sync_target() is failing then check_tsc_sync_target() should
not be called at all. It's that simple. Putting weird ass checks into
that code is simply the wrong solution.
> > If the booting CPU fails to set the callin map within 5 seconds then
> > it should not even reach check_tsc_sync_target() at all.
>
> Why it shouldn't reach check_tsc_sync_target () at all. There is nothing
> that prevents it and guaranties such behavior.
That's the whole fcking point. The code is missing which prevents that
and instead of hacking that crap into check_tsc_sync_target() we need
to add that what's missing.
> > And just for the record, the new CPU can run into the v()ery same
> > timeout problem, when the boot CPU fails to set the callout mask.
>
> Yes, it can.
> I've tried to fix only what was reproducible on my test system, so I
> haven't touched this.
> That might result in panic in smp_callin():
> panic("%s: CPU%d started up but did not get a callout!\n"
I know that. And it's fucking wrong and I don't care whether you are
only fixing what's reproducible on your test system. If we touch that
code for that purpose then we better touch it so it's correct in all
aspects and not in some "this fixes my esoteric problem" approach.
> > This whole stuff is a complete trainwreck already and I don't want to
> > see anything like your "fixing the symptoms" hackery near it, really.
>
> Fixing slow to respond cpu might be not option, so we need to gracefully
> abort failed cpu_online operation instead of hanging in stop_machine or
> crashing in scheduler[https://lkml.org/lkml/2012/5/9/137].
I'm tired of your symptom links. You are simply not understanding the
scope of the problem and you just try to fix it so your testing
failures go away.
> > This whole stuff needs a proper rewrite and not some more braindamaged
> > bandaids. And if we apply bandaids for the time being, then certainly
> > not bandaids like the mess you created.
>
> Rewrite will need to deal with failed to boot in time cpu as well.
Really? Thanks for the hint, didn't know that.
> So if rewrite is not near completion, then maybe for a time being bandaids
> would be needed.
As I said, I don't object against proper bandaids, but I object
against the hackery you provided.
> Any ideas/suggestions for "right bandaids" instead of braindamaged ones?
Maybe start to think about my answers instead of blindly repeating
your observations of symptoms and praising your symptom cures.
Even in bandaid mode we can fix that behaviour by putting proper
synchronization into the right points, instead of hacking weird
failure handling into code which should never be affected by that.
Thanks,
tglx
Thomas,
Is this what you've meant?
Master cpu may timeout before cpu_callin_mask is set and decide to
abort cpu boot but being onlined cpu will continue to boot, set
cpu_active_mask and wait in check_tsc_sync_target() for master cpu
to arrive, that will never happen because master cpu aborted boot
proccess. Following attempt to online next cpu will hang in
stop_machine because it will wait on comletion of stop_work on all
cpus from cpu_active_mask and that will never happen because first
failed cpu spins in check_tsc_sync_target().
Introduce cpu_may_complete_boot_mask which will be set by master cpu
if it goes via normal boot path and decides to continue cpu bring up.
Being onlined cpu will continue to boot only if master cpu confirms
via cpu_may_complete_boot_mask its intention not to abort cpu bring up.
Otherwise being onlined cpu will gracefully die.
In addition if being onlined cpu timed-out waiting on cpu_callout_mask,
it should not panic but rather die.
Signed-off-by: Igor Mammedov <[email protected]>
---
arch/x86/include/asm/cpumask.h | 1 +
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/kernel/smpboot.c | 34 +++++++++++++++++++++++++++++++---
3 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/cpumask.h b/arch/x86/include/asm/cpumask.h
index 61c852f..eacd269 100644
--- a/arch/x86/include/asm/cpumask.h
+++ b/arch/x86/include/asm/cpumask.h
@@ -7,6 +7,7 @@ extern cpumask_var_t cpu_callin_mask;
extern cpumask_var_t cpu_callout_mask;
extern cpumask_var_t cpu_initialized_mask;
extern cpumask_var_t cpu_sibling_setup_mask;
+extern cpumask_var_t cpu_may_complete_boot_mask;
extern void setup_cpu_local_masks(void);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cf79302..50e91cb 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -48,6 +48,7 @@
cpumask_var_t cpu_initialized_mask;
cpumask_var_t cpu_callout_mask;
cpumask_var_t cpu_callin_mask;
+cpumask_var_t cpu_may_complete_boot_mask;
/* representing cpus for which sibling maps can be computed */
cpumask_var_t cpu_sibling_setup_mask;
@@ -59,6 +60,7 @@ void __init setup_cpu_local_masks(void)
alloc_bootmem_cpumask_var(&cpu_callin_mask);
alloc_bootmem_cpumask_var(&cpu_callout_mask);
alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
+ alloc_bootmem_cpumask_var(&cpu_may_complete_boot_mask);
}
static void __cpuinit default_init(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6e1e406..b33149f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -187,8 +187,9 @@ static void __cpuinit smp_callin(void)
}
if (!time_before(jiffies, timeout)) {
- panic("%s: CPU%d started up but did not get a callout!\n",
+ pr_debug("%s: CPU%d started up but did not get a callout!\n",
__func__, cpuid);
+ goto die;
}
/*
@@ -232,12 +233,36 @@ static void __cpuinit smp_callin(void)
set_cpu_sibling_map(raw_smp_processor_id());
wmb();
- notify_cpu_starting(cpuid);
-
/*
* Allow the master to continue.
*/
cpumask_set_cpu(cpuid, cpu_callin_mask);
+
+ /*
+ * Wait for master to continue.
+ */
+ for (timeout = 0; timeout < 50000; timeout++) {
+ if (cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
+ break;
+
+ if (!cpumask_test_cpu(cpuid, cpu_callout_mask))
+ break;
+
+ udelay(100);
+ }
+
+ if (!cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
+ goto die;
+
+ notify_cpu_starting(cpuid);
+ return;
+
+die:
+ /* was set by cpu_init() */
+ cpumask_clear_cpu(smp_processor_id(), cpu_initialized_mask);
+ cpumask_clear_cpu(smp_processor_id(), cpu_callin_mask);
+ clear_local_APIC();
+ play_dead();
}
/*
@@ -774,6 +799,8 @@ do_rest:
}
if (cpumask_test_cpu(cpu, cpu_callin_mask)) {
+ /* Signal AP that it may continue to boot */
+ cpumask_set_cpu(cpu, cpu_may_complete_boot_mask);
print_cpu_msr(&cpu_data(cpu));
pr_debug("CPU%d: has booted.\n", cpu);
} else {
@@ -1250,6 +1277,7 @@ static void __ref remove_cpu_from_maps(int cpu)
cpumask_clear_cpu(cpu, cpu_callin_mask);
/* was set by cpu_init() */
cpumask_clear_cpu(cpu, cpu_initialized_mask);
+ cpumask_clear_cpu(cpu, cpu_may_complete_boot_mask);
numa_remove_cpu(cpu);
}
--
1.7.1
On Sat, 2012-05-12 at 21:32 +0200, Igor Mammedov wrote:
> @@ -232,12 +233,36 @@ static void __cpuinit smp_callin(void)
> set_cpu_sibling_map(raw_smp_processor_id());
> wmb();
>
> - notify_cpu_starting(cpuid);
> -
> /*
> * Allow the master to continue.
> */
> cpumask_set_cpu(cpuid, cpu_callin_mask);
> +
> + /*
> + * Wait for master to continue.
> + */
> + for (timeout = 0; timeout < 50000; timeout++) {
> + if (cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
> + break;
> +
> + if (!cpumask_test_cpu(cpuid, cpu_callout_mask))
> + break;
> +
> + udelay(100);
> + }
> +
> + if (!cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
> + goto die;
> +
> + notify_cpu_starting(cpuid);
Its absolutely broken to call CPU_STARTING after the master cpu is told
to continue. Once it returns from cpu_up() it assumes the secondary is
completely initialized and ready to run.
> + return;
> +
> +die:
You've forgotten to clean up the bits set by set_cpu_sibling_map().
> + /* was set by cpu_init() */
> + cpumask_clear_cpu(smp_processor_id(), cpu_initialized_mask);
> + cpumask_clear_cpu(smp_processor_id(), cpu_callin_mask);
> + clear_local_APIC();
> + play_dead();
> }
>
> /*
> @@ -774,6 +799,8 @@ do_rest:
> }
>
> if (cpumask_test_cpu(cpu, cpu_callin_mask)) {
> + /* Signal AP that it may continue to boot */
> + cpumask_set_cpu(cpu, cpu_may_complete_boot_mask);
> print_cpu_msr(&cpu_data(cpu));
> pr_debug("CPU%d: has booted.\n", cpu);
> } else {
> @@ -1250,6 +1277,7 @@ static void __ref remove_cpu_from_maps(int cpu)
> cpumask_clear_cpu(cpu, cpu_callin_mask);
> /* was set by cpu_init() */
> cpumask_clear_cpu(cpu, cpu_initialized_mask);
> + cpumask_clear_cpu(cpu, cpu_may_complete_boot_mask);
> numa_remove_cpu(cpu);
> }
>
----- Original Message -----
> From: "Peter Zijlstra" <[email protected]>
> To: "Igor Mammedov" <[email protected]>
> On Sat, 2012-05-12 at 21:32 +0200, Igor Mammedov wrote:
>
> > @@ -232,12 +233,36 @@ static void __cpuinit smp_callin(void)
> > set_cpu_sibling_map(raw_smp_processor_id());
> > wmb();
> >
> > - notify_cpu_starting(cpuid);
> > -
> > /*
> > * Allow the master to continue.
> > */
> > cpumask_set_cpu(cpuid, cpu_callin_mask);
> > +
> > + /*
> > + * Wait for master to continue.
> > + */
> > + for (timeout = 0; timeout < 50000; timeout++) {
> > + if (cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
> > + break;
> > +
> > + if (!cpumask_test_cpu(cpuid, cpu_callout_mask))
> > + break;
> > +
> > + udelay(100);
> > + }
> > +
> > + if (!cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
> > + goto die;
> > +
> > + notify_cpu_starting(cpuid);
>
> Its absolutely broken to call CPU_STARTING after the master cpu is
> told
> to continue. Once it returns from cpu_up() it assumes the secondary
> is
> completely initialized and ready to run.
Wouldn't master cpu stop in native_cpu_up() and wait till AP will set cpu_online_mask?
local_irq_save(flags);
check_tsc_sync_source(cpu);
local_irq_restore(flags);
while (!cpu_online(cpu)) {
cpu_rela);
touch_nmi_watchdog();
}
So it shouldn't do anything till AP is online.
>
> > + return;
> > +
> > +die:
>
> You've forgotten to clean up the bits set by set_cpu_sibling_map().
Thanks, I'll fix it.
>
> > + /* was set by cpu_init() */
> > + cpumask_clear_cpu(smp_processor_id(), cpu_initialized_mask);
> > + cpumask_clear_cpu(smp_processor_id(), cpu_callin_mask);
> > + clear_local_APIC();
> > + play_dead();
> > }
> >
> > /*
> > @@ -774,6 +799,8 @@ do_rest:
> > }
> >
> > if (cpumask_test_cpu(cpu, cpu_callin_mask)) {
> > + /* Signal AP that it may continue to boot */
> > + cpumask_set_cpu(cpu, cpu_may_complete_boot_mask);
> > print_cpu_msr(&cpu_data(cpu));
> > pr_debug("CPU%d: has booted.\n", cpu);
> > } else {
> > @@ -1250,6 +1277,7 @@ static void __ref remove_cpu_from_maps(int
> > cpu)
> > cpumask_clear_cpu(cpu, cpu_callin_mask);
> > /* was set by cpu_init() */
> > cpumask_clear_cpu(cpu, cpu_initialized_mask);
> > + cpumask_clear_cpu(cpu, cpu_may_complete_boot_mask);
> > numa_remove_cpu(cpu);
> > }
> >
>
>
Forgot to cc v1 to tglx, so reposting fixed v2.
Thomas,
is patch below what you've meant?
Master cpu may timeout before cpu_callin_mask is set and decide to
abort cpu boot but being onlined cpu will continue to boot, set
cpu_active_mask and wait in check_tsc_sync_target() for master cpu
to arrive, that will never happen because master cpu aborted boot
proccess. Following attempt to online next cpu will hang in
stop_machine because it will wait on comletion of stop_work on all
cpus from cpu_active_mask and that will never happen because first
failed cpu spins in check_tsc_sync_target().
Introduce cpu_may_complete_boot_mask which will be set by master cpu
if it goes via normal boot path and decides to continue cpu bring up.
Being onlined cpu will continue to boot only if master cpu confirms
via cpu_may_complete_boot_mask its intention not to abort cpu bring up.
Otherwise being onlined cpu will gracefully die.
Reason for moving setting cpu_callin_mask before is that one of
CPU_STARTING notfiers sets cpu_active_mask before it is known for sure
that cpu may continue to boot. And that may lead to soft lookups in
stop_machine when next cpu is booted but failed one haven't completed
cpu_*_mask cleaups yet.
It's ok for being onlined cpu to set cpu_callin_mask before calling
CPU_STARTING notifiers because master cpu may first wait on being
booted cpu in check_tsc_sync_source() and after that it waits in
native_cpu_up() till being booted cpu comes online and only when
being booted cpu sets cpu_online_mask it will exit native_cpu_up()
and then call CPU_ONLINE notifiers. So call sequence looks like this:
CPU1 CPU2
wait till CPU2 sets
cpu_callin_mask
-------------------------------------------------
set cpu_callin_mask
wait till CPU1 sets
cpu_may_complete_boot_mask
-------------------------------------------------
got ack from CPU2 via
cpu_callin_mask
set cpu_may_complete_boot_mask
exit do_boot_cpu and return into
native_cpu_up()
in native_cpu_up() CPU1 may spin first in
check_tsc_sync_source()
and then wait till CPU2 will set
cpu_online_mask:
while (!cpu_online(cpu)) {
cpu_relax();
touch_nmi_watchdog();
}
-------------------------------------------------
got ack from CPU1 via
cpu_may_complete_boot_mask
call CPU_STARTING notifiers
...
call check_tsc_sync_target()
set cpu_online_mask
-------------------------------------------------
got ack from CPU2 that it
is online, return from native_cpu_up()
call CPU_ONLINE notifiers
----
In addition if being onlined cpu timed-out waiting on cpu_callout_mask,
it should not panic but rather die.
v2:
- added missing remove_siblinginfo() on 'die' error path.
- added explanation why it's ok to set cpu_callin_mask before calling
CPU_STARTING notifiers
Signed-off-by: Igor Mammedov <[email protected]>
---
arch/x86/include/asm/cpumask.h | 1 +
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/kernel/smpboot.c | 39 ++++++++++++++++++++++++++++++++++++---
3 files changed, 39 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/cpumask.h b/arch/x86/include/asm/cpumask.h
index 61c852f..eacd269 100644
--- a/arch/x86/include/asm/cpumask.h
+++ b/arch/x86/include/asm/cpumask.h
@@ -7,6 +7,7 @@ extern cpumask_var_t cpu_callin_mask;
extern cpumask_var_t cpu_callout_mask;
extern cpumask_var_t cpu_initialized_mask;
extern cpumask_var_t cpu_sibling_setup_mask;
+extern cpumask_var_t cpu_may_complete_boot_mask;
extern void setup_cpu_local_masks(void);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cf79302..50e91cb 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -48,6 +48,7 @@
cpumask_var_t cpu_initialized_mask;
cpumask_var_t cpu_callout_mask;
cpumask_var_t cpu_callin_mask;
+cpumask_var_t cpu_may_complete_boot_mask;
/* representing cpus for which sibling maps can be computed */
cpumask_var_t cpu_sibling_setup_mask;
@@ -59,6 +60,7 @@ void __init setup_cpu_local_masks(void)
alloc_bootmem_cpumask_var(&cpu_callin_mask);
alloc_bootmem_cpumask_var(&cpu_callout_mask);
alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
+ alloc_bootmem_cpumask_var(&cpu_may_complete_boot_mask);
}
static void __cpuinit default_init(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6e1e406..b419e2e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -136,6 +136,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
atomic_t init_deasserted;
+#ifdef CONFIG_HOTPLUG_CPU
+static void remove_siblinginfo(int cpu);
+#endif
+
/*
* Report back to the Boot Processor.
* Running on AP.
@@ -187,8 +191,9 @@ static void __cpuinit smp_callin(void)
}
if (!time_before(jiffies, timeout)) {
- panic("%s: CPU%d started up but did not get a callout!\n",
+ pr_debug("%s: CPU%d started up but did not get a callout!\n",
__func__, cpuid);
+ goto die;
}
/*
@@ -232,12 +237,37 @@ static void __cpuinit smp_callin(void)
set_cpu_sibling_map(raw_smp_processor_id());
wmb();
- notify_cpu_starting(cpuid);
-
/*
* Allow the master to continue.
*/
cpumask_set_cpu(cpuid, cpu_callin_mask);
+
+ /*
+ * Wait for master to continue.
+ */
+ for (timeout = 0; timeout < 50000; timeout++) {
+ if (cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
+ break;
+
+ if (!cpumask_test_cpu(cpuid, cpu_callout_mask))
+ break;
+
+ udelay(100);
+ }
+
+ if (!cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
+ goto die;
+
+ notify_cpu_starting(cpuid);
+ return;
+
+die:
+ /* was set by cpu_init() */
+ cpumask_clear_cpu(smp_processor_id(), cpu_initialized_mask);
+ remove_siblinginfo(cpuid);
+ cpumask_clear_cpu(smp_processor_id(), cpu_callin_mask);
+ clear_local_APIC();
+ play_dead();
}
/*
@@ -774,6 +804,8 @@ do_rest:
}
if (cpumask_test_cpu(cpu, cpu_callin_mask)) {
+ /* Signal AP that it may continue to boot */
+ cpumask_set_cpu(cpu, cpu_may_complete_boot_mask);
print_cpu_msr(&cpu_data(cpu));
pr_debug("CPU%d: has booted.\n", cpu);
} else {
@@ -1250,6 +1282,7 @@ static void __ref remove_cpu_from_maps(int cpu)
cpumask_clear_cpu(cpu, cpu_callin_mask);
/* was set by cpu_init() */
cpumask_clear_cpu(cpu, cpu_initialized_mask);
+ cpumask_clear_cpu(cpu, cpu_may_complete_boot_mask);
numa_remove_cpu(cpu);
}
--
1.7.1
ping for reviewers.
Please review patch.
On 05/14/2012 01:09 PM, Igor Mammedov wrote:
> Forgot to cc v1 to tglx, so reposting fixed v2.
>
> Thomas,
> is patch below what you've meant?
>
> Master cpu may timeout before cpu_callin_mask is set and decide to
> abort cpu boot but being onlined cpu will continue to boot, set
> cpu_active_mask and wait in check_tsc_sync_target() for master cpu
> to arrive, that will never happen because master cpu aborted boot
> proccess. Following attempt to online next cpu will hang in
> stop_machine because it will wait on comletion of stop_work on all
> cpus from cpu_active_mask and that will never happen because first
> failed cpu spins in check_tsc_sync_target().
>
> Introduce cpu_may_complete_boot_mask which will be set by master cpu
> if it goes via normal boot path and decides to continue cpu bring up.
> Being onlined cpu will continue to boot only if master cpu confirms
> via cpu_may_complete_boot_mask its intention not to abort cpu bring up.
> Otherwise being onlined cpu will gracefully die.
>
> Reason for moving setting cpu_callin_mask before is that one of
> CPU_STARTING notfiers sets cpu_active_mask before it is known for sure
> that cpu may continue to boot. And that may lead to soft lookups in
> stop_machine when next cpu is booted but failed one haven't completed
> cpu_*_mask cleaups yet.
>
> It's ok for being onlined cpu to set cpu_callin_mask before calling
> CPU_STARTING notifiers because master cpu may first wait on being
> booted cpu in check_tsc_sync_source() and after that it waits in
> native_cpu_up() till being booted cpu comes online and only when
> being booted cpu sets cpu_online_mask it will exit native_cpu_up()
> and then call CPU_ONLINE notifiers. So call sequence looks like this:
>
> CPU1 CPU2
>
> wait till CPU2 sets
> cpu_callin_mask
> -------------------------------------------------
> set cpu_callin_mask
> wait till CPU1 sets
> cpu_may_complete_boot_mask
> -------------------------------------------------
> got ack from CPU2 via
> cpu_callin_mask
>
> set cpu_may_complete_boot_mask
> exit do_boot_cpu and return into
> native_cpu_up()
>
> in native_cpu_up() CPU1 may spin first in
> check_tsc_sync_source()
> and then wait till CPU2 will set
> cpu_online_mask:
>
> while (!cpu_online(cpu)) {
> cpu_relax();
> touch_nmi_watchdog();
> }
> -------------------------------------------------
> got ack from CPU1 via
> cpu_may_complete_boot_mask
>
> call CPU_STARTING notifiers
> ...
> call check_tsc_sync_target()
> set cpu_online_mask
> -------------------------------------------------
> got ack from CPU2 that it
> is online, return from native_cpu_up()
>
> call CPU_ONLINE notifiers
> ----
>
> In addition if being onlined cpu timed-out waiting on cpu_callout_mask,
> it should not panic but rather die.
>
> v2:
> - added missing remove_siblinginfo() on 'die' error path.
> - added explanation why it's ok to set cpu_callin_mask before calling
> CPU_STARTING notifiers
>
> Signed-off-by: Igor Mammedov<[email protected]>
> ---
> arch/x86/include/asm/cpumask.h | 1 +
> arch/x86/kernel/cpu/common.c | 2 ++
> arch/x86/kernel/smpboot.c | 39 ++++++++++++++++++++++++++++++++++++---
> 3 files changed, 39 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpumask.h b/arch/x86/include/asm/cpumask.h
> index 61c852f..eacd269 100644
> --- a/arch/x86/include/asm/cpumask.h
> +++ b/arch/x86/include/asm/cpumask.h
> @@ -7,6 +7,7 @@ extern cpumask_var_t cpu_callin_mask;
> extern cpumask_var_t cpu_callout_mask;
> extern cpumask_var_t cpu_initialized_mask;
> extern cpumask_var_t cpu_sibling_setup_mask;
> +extern cpumask_var_t cpu_may_complete_boot_mask;
>
> extern void setup_cpu_local_masks(void);
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index cf79302..50e91cb 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -48,6 +48,7 @@
> cpumask_var_t cpu_initialized_mask;
> cpumask_var_t cpu_callout_mask;
> cpumask_var_t cpu_callin_mask;
> +cpumask_var_t cpu_may_complete_boot_mask;
>
> /* representing cpus for which sibling maps can be computed */
> cpumask_var_t cpu_sibling_setup_mask;
> @@ -59,6 +60,7 @@ void __init setup_cpu_local_masks(void)
> alloc_bootmem_cpumask_var(&cpu_callin_mask);
> alloc_bootmem_cpumask_var(&cpu_callout_mask);
> alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
> + alloc_bootmem_cpumask_var(&cpu_may_complete_boot_mask);
> }
>
> static void __cpuinit default_init(struct cpuinfo_x86 *c)
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 6e1e406..b419e2e 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -136,6 +136,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
>
> atomic_t init_deasserted;
>
> +#ifdef CONFIG_HOTPLUG_CPU
> +static void remove_siblinginfo(int cpu);
> +#endif
> +
> /*
> * Report back to the Boot Processor.
> * Running on AP.
> @@ -187,8 +191,9 @@ static void __cpuinit smp_callin(void)
> }
>
> if (!time_before(jiffies, timeout)) {
> - panic("%s: CPU%d started up but did not get a callout!\n",
> + pr_debug("%s: CPU%d started up but did not get a callout!\n",
> __func__, cpuid);
> + goto die;
> }
>
> /*
> @@ -232,12 +237,37 @@ static void __cpuinit smp_callin(void)
> set_cpu_sibling_map(raw_smp_processor_id());
> wmb();
>
> - notify_cpu_starting(cpuid);
> -
> /*
> * Allow the master to continue.
> */
> cpumask_set_cpu(cpuid, cpu_callin_mask);
> +
> + /*
> + * Wait for master to continue.
> + */
> + for (timeout = 0; timeout< 50000; timeout++) {
> + if (cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
> + break;
> +
> + if (!cpumask_test_cpu(cpuid, cpu_callout_mask))
> + break;
> +
> + udelay(100);
> + }
> +
> + if (!cpumask_test_cpu(cpuid, cpu_may_complete_boot_mask))
> + goto die;
> +
> + notify_cpu_starting(cpuid);
> + return;
> +
> +die:
> + /* was set by cpu_init() */
> + cpumask_clear_cpu(smp_processor_id(), cpu_initialized_mask);
> + remove_siblinginfo(cpuid);
> + cpumask_clear_cpu(smp_processor_id(), cpu_callin_mask);
> + clear_local_APIC();
> + play_dead();
> }
>
> /*
> @@ -774,6 +804,8 @@ do_rest:
> }
>
> if (cpumask_test_cpu(cpu, cpu_callin_mask)) {
> + /* Signal AP that it may continue to boot */
> + cpumask_set_cpu(cpu, cpu_may_complete_boot_mask);
> print_cpu_msr(&cpu_data(cpu));
> pr_debug("CPU%d: has booted.\n", cpu);
> } else {
> @@ -1250,6 +1282,7 @@ static void __ref remove_cpu_from_maps(int cpu)
> cpumask_clear_cpu(cpu, cpu_callin_mask);
> /* was set by cpu_init() */
> cpumask_clear_cpu(cpu, cpu_initialized_mask);
> + cpumask_clear_cpu(cpu, cpu_may_complete_boot_mask);
> numa_remove_cpu(cpu);
> }
>
--
-----
Igor
On 05/24/2012 10:41 AM, Igor Mammedov wrote:
> ping for reviewers.
>
> Please review patch.
I can't hugely comment on the guts of what the patch is doing, but:
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -136,6 +136,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
>>
>> atomic_t init_deasserted;
>>
>> +#ifdef CONFIG_HOTPLUG_CPU
>> +static void remove_siblinginfo(int cpu);
>> +#endif
>> +
#ifdefs should almost never be in C code, they should be in header
files. You can stub out functions with empty inline versions. For a
random example, see kernel/smpboot.h
Rob
--
GNU/Linux isn't: Linux=GPLv2, GNU=GPLv3+, they can't share code.
Either it's "mere aggregation", or a license violation. Pick one.
On 05/25/2012 08:11 PM, Rob Landley wrote:
> On 05/24/2012 10:41 AM, Igor Mammedov wrote:
>> ping for reviewers.
>>
>> Please review patch.
>
> I can't hugely comment on the guts of what the patch is doing, but:
>
>>> --- a/arch/x86/kernel/smpboot.c
>>> +++ b/arch/x86/kernel/smpboot.c
>>> @@ -136,6 +136,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
>>>
>>> atomic_t init_deasserted;
>>>
>>> +#ifdef CONFIG_HOTPLUG_CPU
>>> +static void remove_siblinginfo(int cpu);
>>> +#endif
>>> +
>
> #ifdefs should almost never be in C code, they should be in header
> files. You can stub out functions with empty inline versions. For a
> random example, see kernel/smpboot.h
If it were not just local function, then I'd do so. But since this
function is used only in this file and defined under the same #ifdefs
after smp_callin(), it could stay there and not pollute headers
with non public function declaration.
I made remove_siblinginfo() a forward declaration before smp_callin(),
because of I didn't have much justification to move ~20 lines function
to be available a bit earlier that it is now.
However thanks for mentioning stubs. I should add stub for building
kernel without CONFIG_HOTPLUG_CPU, to prevent build breakage.
I'll fix and repost patch.
--
-----
Igor