http://bugzilla.kernel.org/show_bug.cgi?id=5203
There is a small race during SMP shutdown between the processor issuing
the shutdown and the other processors clearing themselves off the
cpu_online_map as they do this without using the normal cpu offline
synchronisation. To avoid this we should wait for all the other processors
to clear their corresponding bits and then proceed. This way we can safely
make the cpu_online test in smp_send_reschedule, it's safe during normal
runtime as smp_send_reschedule is called with a lock held / preemption
disabled.
Signed-off-by: Zwane Mwaikambo <[email protected]>
arch/i386/kernel/smp.c | 12 +++++++++---
arch/x86_64/kernel/smp.c | 12 +++++++++++-
2 files changed, 20 insertions(+), 4 deletions(-)
diff -r 2c0d5afbfb94 arch/i386/kernel/smp.c
--- a/arch/i386/kernel/smp.c Fri Nov 25 17:42:19 2005
+++ b/arch/i386/kernel/smp.c Sun Nov 27 02:00:59 2005
@@ -164,7 +164,6 @@
unsigned long flags;
local_irq_save(flags);
- WARN_ON(mask & ~cpus_addr(cpu_online_map)[0]);
/*
* Wait for idle.
*/
@@ -476,8 +475,8 @@
*/
void smp_send_reschedule(int cpu)
{
- WARN_ON(cpu_is_offline(cpu));
- send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
+ if (likely(cpu_online(cpu)))
+ send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
}
/*
@@ -585,7 +584,14 @@
void smp_send_stop(void)
{
+ cpumask_t mask;
+
+ mask = cpumask_of_cpu(smp_processor_id());
smp_call_function(stop_this_cpu, NULL, 1, 0);
+
+ /* Wait for other cpus to halt */
+ while (!cpus_equal(mask, cpu_online_map))
+ cpu_relax();
local_irq_disable();
disable_local_APIC();
diff -r 2c0d5afbfb94 arch/x86_64/kernel/smp.c
--- a/arch/x86_64/kernel/smp.c Fri Nov 25 17:42:19 2005
+++ b/arch/x86_64/kernel/smp.c Sun Nov 27 02:00:59 2005
@@ -293,7 +293,8 @@
void smp_send_reschedule(int cpu)
{
- send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
+ if (likely(cpu_online(cpu)))
+ send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
}
/*
@@ -470,6 +471,8 @@
void smp_send_stop(void)
{
int nolock = 0;
+ cpumask_t mask;
+
if (reboot_force)
return;
/* Don't deadlock on the call lock in panic */
@@ -477,7 +480,14 @@
/* ignore locking because we have paniced anyways */
nolock = 1;
}
+
+ mask = cpumask_of_cpu(smp_processor_id());
__smp_call_function(smp_really_stop_cpu, NULL, 0, 0);
+
+ /* wait for the other cpus to halt */
+ while (!cpus_equal(mask, cpu_online_map))
+ cpu_relax();
+
if (!nolock)
spin_unlock(&call_lock);
On Sun, Nov 27, 2005 at 02:05:45AM -0800, Zwane Mwaikambo wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=5203
>
> There is a small race during SMP shutdown between the processor issuing
> the shutdown and the other processors clearing themselves off the
> cpu_online_map as they do this without using the normal cpu offline
> synchronisation. To avoid this we should wait for all the other processors
> to clear their corresponding bits and then proceed. This way we can safely
> make the cpu_online test in smp_send_reschedule, it's safe during normal
> runtime as smp_send_reschedule is called with a lock held / preemption
> disabled.
Looking at the backtrace in the bug - how can sys_reboot call do_exit???
I would say the problem is in whatever causes that. It shouldn't
do that. sys_reboot shouldn't schedule, it's that simple.
Your patch is just papering over that real bug.
Badness in send_IPI_mask_bitmask at arch/i386/kernel/smp.c:168
[<c0103cd7>] dump_stack+0x17/0x20
[<c0112286>] send_IPI_mask_bitmask+0x86/0x90
[<c0112689>] smp_send_reschedule+0x19/0x20
[<c0117e9b>] resched_task+0x6b/0x90
[<c01186cb>] try_to_wake_up+0x2db/0x310
[<c011872a>] wake_up_state+0xa/0x10
[<c012832b>] signal_wake_up+0x2b/0x40
[<c0128be0>] __group_complete_signal+0x210/0x250
[<c0128cb3>] __group_send_sig_info+0x93/0xd0
[<c0129561>] do_notify_parent+0xe1/0x1c0
[<c01206b6>] exit_notify+0x366/0x830
[<c0120e14>] do_exit+0x294/0x3a0
[<c012b5ef>] sys_reboot+0xcf/0x160
[<c0102ded>] syscall_call+0x7/0xb
-Andi
Andi Kleen <[email protected]> writes:
> On Sun, Nov 27, 2005 at 02:05:45AM -0800, Zwane Mwaikambo wrote:
>> http://bugzilla.kernel.org/show_bug.cgi?id=5203
>>
>> There is a small race during SMP shutdown between the processor issuing
>> the shutdown and the other processors clearing themselves off the
>> cpu_online_map as they do this without using the normal cpu offline
>> synchronisation. To avoid this we should wait for all the other processors
>> to clear their corresponding bits and then proceed. This way we can safely
>> make the cpu_online test in smp_send_reschedule, it's safe during normal
>> runtime as smp_send_reschedule is called with a lock held / preemption
>> disabled.
>
> Looking at the backtrace in the bug - how can sys_reboot call do_exit???
> I would say the problem is in whatever causes that. It shouldn't
> do that. sys_reboot shouldn't schedule, it's that simple.
> Your patch is just papering over that real bug.
sys_reboot in the case of halt (after everything else is done)
has directly called do_exit for years.
There are some very subtle interactions there. The one
I always remember (having found it the hard way) is that
interrupts must be left on so ctrl-alt-del still works.
This do_exit may be something similar, although I can't
think of how it could be useful.
Eric
On Sun, 27 Nov 2005, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> > On Sun, Nov 27, 2005 at 02:05:45AM -0800, Zwane Mwaikambo wrote:
> >> http://bugzilla.kernel.org/show_bug.cgi?id=5203
> >>
> >> There is a small race during SMP shutdown between the processor issuing
> >> the shutdown and the other processors clearing themselves off the
> >> cpu_online_map as they do this without using the normal cpu offline
> >> synchronisation. To avoid this we should wait for all the other processors
> >> to clear their corresponding bits and then proceed. This way we can safely
> >> make the cpu_online test in smp_send_reschedule, it's safe during normal
> >> runtime as smp_send_reschedule is called with a lock held / preemption
> >> disabled.
> >
> > Looking at the backtrace in the bug - how can sys_reboot call do_exit???
> > I would say the problem is in whatever causes that. It shouldn't
> > do that. sys_reboot shouldn't schedule, it's that simple.
> > Your patch is just papering over that real bug.
>
> sys_reboot in the case of halt (after everything else is done)
> has directly called do_exit for years.
>
> There are some very subtle interactions there. The one
> I always remember (having found it the hard way) is that
> interrupts must be left on so ctrl-alt-del still works.
>
> This do_exit may be something similar, although I can't
> think of how it could be useful.
I wondered the same thing, perhaps i am papering over the bug, what
exactly is do_exit doing there?
Zwane Mwaikambo <[email protected]> writes:
> On Sun, 27 Nov 2005, Eric W. Biederman wrote:
>
>> Andi Kleen <[email protected]> writes:
>>
>> > On Sun, Nov 27, 2005 at 02:05:45AM -0800, Zwane Mwaikambo wrote:
>> >> http://bugzilla.kernel.org/show_bug.cgi?id=5203
>> >>
>> >> There is a small race during SMP shutdown between the processor issuing
>> >> the shutdown and the other processors clearing themselves off the
>> >> cpu_online_map as they do this without using the normal cpu offline
>> >> synchronisation. To avoid this we should wait for all the other processors
>> >> to clear their corresponding bits and then proceed. This way we can safely
>> >> make the cpu_online test in smp_send_reschedule, it's safe during normal
>> >> runtime as smp_send_reschedule is called with a lock held / preemption
>> >> disabled.
>> >
>> > Looking at the backtrace in the bug - how can sys_reboot call do_exit???
>> > I would say the problem is in whatever causes that. It shouldn't
>> > do that. sys_reboot shouldn't schedule, it's that simple.
>> > Your patch is just papering over that real bug.
>>
>> sys_reboot in the case of halt (after everything else is done)
>> has directly called do_exit for years.
>>
>> There are some very subtle interactions there. The one
>> I always remember (having found it the hard way) is that
>> interrupts must be left on so ctrl-alt-del still works.
>>
>> This do_exit may be something similar, although I can't
>> think of how it could be useful.
>
> I wondered the same thing, perhaps i am papering over the bug, what
> exactly is do_exit doing there?
So I just looked at a couple of kernels I have lying around.
the do_exit has been in halt since 1.2.? In 1.0 the halt
case was not even implemented.
So I think not coping with the do_exit is a problem, as every
other architecture and every other kernel has been able to.
To handle ctrl-alt-del after a halt (or any other action) being able
to schedule sounds important.
As Andi Kleen indicated the problem with do_exit is that it schedules.
I strongly disagree that simply scheduling after halting is a bug.
So why are we calling smp_send_stop from machine_halt?
We don't.
Looking more closely at the bug report the problem here
is that halt -p is called which triggers not a halt but
an attempt to power off.
machine_power_off calls machine_shutdown which calls smp_send_stop.
If pm_power_off is set we should never make it out machine_power_off
to the call of do_exit. So pm_power_off must not be set in this case.
When pm_power_off is not set we expect machine_power_off to devolve
into machine_halt.
So how do we fix this?
Playing too much with smp_send_stop is dangerous because it
must also be safe to be called from panic.
It looks like the obviously correct fix is to only call
machine_shutdown when pm_power_off is defined. Doing
that will make Andi's assumption about not scheduling
true and generally simplify what must be supported.
This turns machine_power_off into a noop like machine_halt
when pm_power_off is not defined.
If the expected behavior is that sys_reboot(LINUX_REBOOT_CMD_POWER_OFF)
becomes sys_reboot(LINUX_REBOOT_CMD_HALT) if pm_power_off is NULL
this is not quite a comprehensive fix as we pass a different parameter
to the reboot notifier and we set system_state to a different value
before calling device_shutdown().
Unfortunately any fix more comprehensive I can think of is not
obviously correct. The core problem is that there is no architecture
independent way to detect if machine_power will become a noop, without
calling it.
Eric
diff --git a/arch/i386/kernel/reboot.c b/arch/i386/kernel/reboot.c
index 350ea66..2a8cb44 100644
--- a/arch/i386/kernel/reboot.c
+++ b/arch/i386/kernel/reboot.c
@@ -12,6 +12,7 @@
#include <linux/efi.h>
#include <linux/dmi.h>
#include <linux/ctype.h>
+#include <linux/pm.h>
#include <asm/uaccess.h>
#include <asm/apic.h>
#include <asm/desc.h>
@@ -347,10 +348,10 @@ void machine_halt(void)
void machine_power_off(void)
{
- machine_shutdown();
-
- if (pm_power_off)
+ if (pm_power_off) {
+ machine_shutdown();
pm_power_off();
+ }
}
diff --git a/arch/x86_64/kernel/reboot.c b/arch/x86_64/kernel/reboot.c
index 75235ed..57117b8 100644
--- a/arch/x86_64/kernel/reboot.c
+++ b/arch/x86_64/kernel/reboot.c
@@ -6,6 +6,7 @@
#include <linux/kernel.h>
#include <linux/ctype.h>
#include <linux/string.h>
+#include <linux/pm.h>
#include <asm/io.h>
#include <asm/kdebug.h>
#include <asm/delay.h>
@@ -154,10 +155,11 @@ void machine_halt(void)
void machine_power_off(void)
{
- if (!reboot_force) {
- machine_shutdown();
- }
- if (pm_power_off)
+ if (pm_power_off) {
+ if (!reboot_force) {
+ machine_shutdown();
+ }
pm_power_off();
+ }
}
> It looks like the obviously correct fix is to only call
> machine_shutdown when pm_power_off is defined. Doing
> that will make Andi's assumption about not scheduling
> true and generally simplify what must be supported.
Looks like a reasonable fix. Thanks.
-Andi
On Sun, 27 Nov 2005, Eric W. Biederman wrote:
> diff --git a/arch/x86_64/kernel/reboot.c b/arch/x86_64/kernel/reboot.c
> index 75235ed..57117b8 100644
> --- a/arch/x86_64/kernel/reboot.c
> +++ b/arch/x86_64/kernel/reboot.c
> @@ -6,6 +6,7 @@
> #include <linux/kernel.h>
> #include <linux/ctype.h>
> #include <linux/string.h>
> +#include <linux/pm.h>
> #include <asm/io.h>
> #include <asm/kdebug.h>
> #include <asm/delay.h>
> @@ -154,10 +155,11 @@ void machine_halt(void)
>
> void machine_power_off(void)
> {
> - if (!reboot_force) {
> - machine_shutdown();
> - }
> - if (pm_power_off)
> + if (pm_power_off) {
> + if (!reboot_force) {
> + machine_shutdown();
> + }
> pm_power_off();
> + }
> }
That makes sense, thanks Eric.
Zwane