2005-09-28 10:46:27

by Clemens Koller

[permalink] [raw]
Subject: 2.6.13.2 crash on shutdown on SMP machine

Hi!

Last night, right before thinking about going to bed, my newly
installed old SMP machine crashed after a #shutdown -h now
as shown below:

linux-2.6.13.2
old Tyan Tomcat Board, Dual Processor, 2xPentium MMX 200MHz
SMP enabled, preemption enabled..

[...]
Shutdown: hda
Power down.
Badness in send_IPI_mask_bitmask at arch/i386/kernel/smp.c:168
c010fdd5 send_IPI_mask_bitmask+0x65/0x70
c0110236 smp_send_reschedule+0x16/0x20
c01188d6 __migrate_task+0xb6/0xc0
c01189ad migration_thread+0xcd/0x120
c01188e0 migration_thread+0x0/0x120
c012ef43 kthread+0x93/0xc0
c012eeb0 kthread+0x0/0x120
c010104d kernel_thread_helper+0x5/0x18

The board cannot do any acpi and auto-powerdown thing.
It should just stop after "Power down."

I can try the latest git or 2.6.14-rc2 tonight and get you
some more info (.config) when I am back home...
BTW what is IPI? Any ideas? What do you need to track down
this issue?

Thanks,
--
Clemens Koller
_______________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm.de
Phone: +49-89-741518-50
Fax: +49-89-741518-19


2005-09-28 15:20:55

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

On Wed, 28 Sep 2005, Clemens Koller wrote:

> Last night, right before thinking about going to bed, my newly
> installed old SMP machine crashed after a #shutdown -h now
> as shown below:
>
> linux-2.6.13.2
> old Tyan Tomcat Board, Dual Processor, 2xPentium MMX 200MHz
> SMP enabled, preemption enabled..
>
> [...]
> Shutdown: hda
> Power down.
> Badness in send_IPI_mask_bitmask at arch/i386/kernel/smp.c:168
> c010fdd5 send_IPI_mask_bitmask+0x65/0x70
> c0110236 smp_send_reschedule+0x16/0x20

We've seen this one before, how reproducible is it for you? Could you also
please test a 2.6.14-rc -mm kernel?

Thanks,
Zwane

2005-09-28 15:23:28

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

On Wed, 28 Sep 2005, Clemens Koller wrote:

> I can try the latest git or 2.6.14-rc2 tonight and get you
> some more info (.config) when I am back home...
> BTW what is IPI? Any ideas? What do you need to track down

Forgot to answer your other question, IPI is Inter Processor Interrupt,
the cpu's method of triggering an interrupt on remote processors.

2005-09-29 12:20:40

by Clemens Koller

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

Hello, Zwane!


Zwane Mwaikambo wrote:
> On Wed, 28 Sep 2005, Clemens Koller wrote:
>
>>Last night, right before thinking about going to bed, my newly
>>installed old SMP machine crashed after a #shutdown -h now
>>as shown below:
>>
>>linux-2.6.13.2
>>old Tyan Tomcat Board, Dual Processor, 2xPentium MMX 200MHz
>>SMP enabled, preemption enabled..
>>
>>[...]
>>Shutdown: hda
>>Power down.
>>Badness in send_IPI_mask_bitmask at arch/i386/kernel/smp.c:168
>>c010fdd5 send_IPI_mask_bitmask+0x65/0x70
>>c0110236 smp_send_reschedule+0x16/0x20
>
> We've seen this one before, how reproducible is it for you? Could you also
> please test a 2.6.14-rc -mm kernel?

It's reproducable... I got the same thing with a slightly different configured
2.6.13.2-npe (no preemtion, no acpi, no apm) but beside that, I got other
very strange crashes (page table something thingys?) as well during a CRUX pkgmk
tool to build i.e. samba. So I wasn't able to get the system stable enough for
more serious testing yet.
I am about to grab the latest linus' git tree and try that...

This system was running for a long time with linux without any problems
in the past. But I had to change the hdd (old one was broken) and installed
a new (CRUX) system from scratch... I migrated to 2.6.13.2 and switched over
to udev... I was running memtest86 for about half a day. It didn't show any
problems. Are there good torture tests to check if a system's hw is stable?

Thanks,
--
Clemens Koller
_______________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm.de
Phone: +49-89-741518-50
Fax: +49-89-741518-19

2005-10-01 15:56:41

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

On Thu, 29 Sep 2005, Clemens Koller wrote:

> Zwane Mwaikambo wrote:
>
> It's reproducable... I got the same thing with a slightly different configured
> 2.6.13.2-npe (no preemtion, no acpi, no apm) but beside that, I got other
> very strange crashes (page table something thingys?) as well during a CRUX
> pkgmk
> tool to build i.e. samba. So I wasn't able to get the system stable enough for
> more serious testing yet.
> I am about to grab the latest linus' git tree and try that...
>
> This system was running for a long time with linux without any problems
> in the past. But I had to change the hdd (old one was broken) and installed
> a new (CRUX) system from scratch... I migrated to 2.6.13.2 and switched over
> to udev... I was running memtest86 for about half a day. It didn't show any
> problems. Are there good torture tests to check if a system's hw is stable?

memtest and repeated multijob kernel/gcc builds seems to do a very good
job. Let me know how the new kernel goes, i'm going to try and see if any
of my systems can trigger it.

Thanks,
Zwane

2005-10-04 09:26:49

by Clemens Koller

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

Hello, Zwane!

You wrote:
> memtest and repeated multijob kernel/gcc builds seems to do a very good
> job. Let me know how the new kernel goes, i'm going to try and see if any
> of my systems can trigger it.

Okay, I've tried the latest linux' git tree from friday last week:
2.6.14-rc3-something, but the

c010fdd5 send_IPI_mask_bitmask+0x65/0x70
c0110236 smp_send_reschedule+0x16/0x20

still occurs when I want to do a "shutdown -h now".
A reboot or "shutdown -r now" seems to work without any errors (or at least
I cannot see any until the graphcs card get re-initialized)!

All that is on a Tyan Tomcat IIID (S1563D) machine, w/ the latest BIOS (4.02)
APM and ACPI are currently disabled...

Unfortunately, I have had a severe two-at-a-time hard-disk crash on another
machine, which kept me quite busy the last days. :-(
Hopefully, by the end of this week, I will be able to debug into that more...
Can you give me a pointer to some code, where the kernel actually splits
up in between rebooting and halting the system?

Thanks,
--
Clemens Koller
_______________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm.de
Phone: +49-89-741518-50
Fax: +49-89-741518-19

2005-10-05 20:32:36

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

On Tue, 4 Oct 2005, Clemens Koller wrote:

> All that is on a Tyan Tomcat IIID (S1563D) machine, w/ the latest BIOS (4.02)
> APM and ACPI are currently disabled...
>
> Unfortunately, I have had a severe two-at-a-time hard-disk crash on another
> machine, which kept me quite busy the last days. :-(
> Hopefully, by the end of this week, I will be able to debug into that more...
> Can you give me a pointer to some code, where the kernel actually splits
> up in between rebooting and halting the system?

Hello Clemens,
You can start by having a look at sys_reboot in kernel/sys.c

Cheers,
Zwane

2005-12-05 11:35:47

by Clemens Koller

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

Hello, Guys, hello, Jeff!

This issue seems to happen more than once:

Jeff Collins wrote:
> I experience a panic whenever I shut down a 4 cpu Intel PII Xeon SMP
> system.

What panic do you get?

> Linux sitka 2.6.14.3 #2 SMP Fri Dec 2 09:01:46 PST 2005 i686 unknown
> unknown GNU/Linux
> Base OS: Slackware 10.2
> Kernel: 2.6.14.3 from kernel.org
>
> "shutdown -h now" causes the panic
>
> "shutdown -r now" reboots correctly.

I guess it panics, too, but the reboot still works, so you just don't
see the panic. (?)

> I got the same panic when I substituted the 2.6.13 kernel.

Still the same thing over here. Unfortunately, I am pretty busy with other
work, and the affected system isn't really needed. It's an old
Tyan Tomcat IIID Mainboard with two Pentium I MMX 200MHz CPU's.
Theoretically I would be able to test the latest git snapshots, but currently
it's just not possible. :-(
Let me know if you cannot solve this issue - maybe I can spend some
time to give some more information for debugging by the end of this week.

Good Luck,

Best greets,

--
Clemens Koller
_______________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm.de
Phone: +49-89-741518-50
Fax: +49-89-741518-19

2005-12-05 20:41:46

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

On Mon, 5 Dec 2005, Clemens Koller wrote:

> Hello, Guys, hello, Jeff!
>
> This issue seems to happen more than once:
>
> Jeff Collins wrote:
> > I experience a panic whenever I shut down a 4 cpu Intel PII Xeon SMP system.
>
> What panic do you get?
>
> > Linux sitka 2.6.14.3 #2 SMP Fri Dec 2 09:01:46 PST 2005 i686 unknown unknown
> > GNU/Linux
> > Base OS: Slackware 10.2
> > Kernel: 2.6.14.3 from kernel.org
> >
> > "shutdown -h now" causes the panic
> >
> > "shutdown -r now" reboots correctly.
>
> I guess it panics, too, but the reboot still works, so you just don't
> see the panic. (?)
>
> > I got the same panic when I substituted the 2.6.13 kernel.
>
> Still the same thing over here. Unfortunately, I am pretty busy with other
> work, and the affected system isn't really needed. It's an old
> Tyan Tomcat IIID Mainboard with two Pentium I MMX 200MHz CPU's.
> Theoretically I would be able to test the latest git snapshots, but currently
> it's just not possible. :-(
> Let me know if you cannot solve this issue - maybe I can spend some
> time to give some more information for debugging by the end of this week.

>From what i hear it's this issue;

http://bugzilla.kernel.org/show_bug.cgi?id=5203

Which is being looked at, feel free to chip in though.

2005-12-06 11:52:59

by Clemens Koller

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

Hello, Zwane!

>>From what i hear it's this issue;
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5203

Yes it seems to be the same issue.
But who is Eric, mentioned in bugzilla? :-]
If it makes sense I can test his patch while/before he is pushing
it upstream.

Thanks!
--
Clemens Koller
_______________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm.de
Phone: +49-89-741518-50
Fax: +49-89-741518-19

2005-12-06 16:56:10

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

On Tue, 6 Dec 2005, Clemens Koller wrote:

> Hello, Zwane!
>
> > > From what i hear it's this issue;
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=5203
>
> Yes it seems to be the same issue.
> But who is Eric, mentioned in bugzilla? :-]
> If it makes sense I can test his patch while/before he is pushing
> it upstream.

Eric is 'Eric Biederman', Jeff tested his patch but there appears to be a
failure case when there is no power management callback installed. Could
you please test the following patch?

diff -r 3815424104b0 arch/i386/kernel/reboot.c
--- a/arch/i386/kernel/reboot.c Sat Dec 3 07:09:38 2005
+++ b/arch/i386/kernel/reboot.c Mon Dec 5 00:44:37 2005
@@ -359,6 +359,10 @@

if (pm_power_off)
pm_power_off();
-}
-
-
+
+ local_irq_disable();
+ if (cpu_data[0].hlt_works_ok)
+ while (1) halt();
+ while (1);
+}
+
diff -r 3815424104b0 arch/x86_64/kernel/reboot.c
--- a/arch/x86_64/kernel/reboot.c Sat Dec 3 07:09:38 2005
+++ b/arch/x86_64/kernel/reboot.c Mon Dec 5 00:44:37 2005
@@ -159,5 +159,9 @@
}
if (pm_power_off)
pm_power_off();
+
+ local_irq_disable();
+ while (1)
+ halt();
}

2005-12-07 19:20:04

by Jeff Collins

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

With this patch, my PII Xeon 4 cpu system reaches the
state of "Powered Down" and stops. (Running Slackware Linux 10.2
with 2.6.14.3 from kernel.org)

At this point, I can power off or hit the reset button
to restart.


Thank you for the patch.


Jeff

On Tue, 6 Dec 2005, Zwane Mwaikambo wrote:

> On Tue, 6 Dec 2005, Clemens Koller wrote:
>
>> Hello, Zwane!
>>
>>>> From what i hear it's this issue;
>>>
>>> http://bugzilla.kernel.org/show_bug.cgi?id=5203
>>
>> Yes it seems to be the same issue.
>> But who is Eric, mentioned in bugzilla? :-]
>> If it makes sense I can test his patch while/before he is pushing
>> it upstream.
>
> Eric is 'Eric Biederman', Jeff tested his patch but there appears to be a
> failure case when there is no power management callback installed. Could
> you please test the following patch?
>
> diff -r 3815424104b0 arch/i386/kernel/reboot.c
> --- a/arch/i386/kernel/reboot.c Sat Dec 3 07:09:38 2005
> +++ b/arch/i386/kernel/reboot.c Mon Dec 5 00:44:37 2005
> @@ -359,6 +359,10 @@
>
> if (pm_power_off)
> pm_power_off();
> -}
> -
> -
> +
> + local_irq_disable();
> + if (cpu_data[0].hlt_works_ok)
> + while (1) halt();
> + while (1);
> +}
> +
> diff -r 3815424104b0 arch/x86_64/kernel/reboot.c
> --- a/arch/x86_64/kernel/reboot.c Sat Dec 3 07:09:38 2005
> +++ b/arch/x86_64/kernel/reboot.c Mon Dec 5 00:44:37 2005
> @@ -159,5 +159,9 @@
> }
> if (pm_power_off)
> pm_power_off();
> +
> + local_irq_disable();
> + while (1)
> + halt();
> }
>
>

2005-12-08 05:31:22

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

On Wed, 7 Dec 2005, Jeff Collins wrote:

> With this patch, my PII Xeon 4 cpu system reaches the state of "Powered Down"
> and stops. (Running Slackware Linux 10.2
> with 2.6.14.3 from kernel.org)
>
> At this point, I can power off or hit the reset button
> to restart.

Thanks for confirming Jeff.

2005-12-08 14:09:45

by Clemens Koller

[permalink] [raw]
Subject: Re: 2.6.13.2 crash on shutdown on SMP machine

Hello, Zwane!

Zwane Mwaikambo wrote:
>>>>From what i hear it's this issue;
>>>
>>>http://bugzilla.kernel.org/show_bug.cgi?id=5203
>>
>>Yes it seems to be the same issue.
>>But who is Eric, mentioned in bugzilla? :-]
>>If it makes sense I can test his patch while/before he is pushing
>>it upstream.
>
>
> Eric is 'Eric Biederman', Jeff tested his patch but there appears to be a
> failure case when there is no power management callback installed. Could
> you please test the following patch?

Thanks again for the support. Yesterday night, I've checked your patch with the
latest git snapshot (2.6.15-rc5-ge4f5c82a), but it doesn't fix the bug(s) for.

2.6.15-rc5-ge4f5c82a unpatched showes up with:
[...cut due to out of screen...]
try_to_wake_up
__wake_up_common
__wake_up
__queue_work
call_usermodehelper_keys
__call_usermodehelper
kobject_hotplug
class_device_del
...

2.6.15-rc5-ge4f5c82a with your patch crashes differently:
[...cut due to out of screen...]
send_IPI_mask_bitmask
smp_send_reschedule
try_to_wake_up
__wake_up_common
__wake_up_sync
do_notify_parent
....

or: exactly the same patched kernel but at a later time with a different crash:
Badness in send_IPI_mask_bitmask at arch/i386/kernel/smp.c:167
send_IPI_mask_bitmask
smp_send_reschedule
__migrate_task
migration_thread
migration_thread
kthread
kthread
kernel_thread_helper

The attachments are three .gif images of the screen snapshots
of the above error messages (just some more details).

Any ideas?

Greets,
--
Clemens Koller
_______________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Str. 45/1
81379 Muenchen
Germany

http://www.anagramm.de
Phone: +49-89-741518-50
Fax: +49-89-741518-19


Attachments:
2615rc5-patched1.gif (10.35 kB)
2615rc5-patched2.gif (9.44 kB)
2615rc5-plain.gif (13.58 kB)
Download all attachments