2006-11-05 13:07:35

by S.Çağlar Onur

[permalink] [raw]
Subject: [Opps] Invalid opcode

Hi;

2.6.18, 2.6.18.1 and 2.6.18.2 still panics randomly (it seems this is not
related to smpreplacament bug solved in .2) in VmWare and Microsoft Virtual
PC and in order to confirm this bug is not our distro specific i downloaded
and tried latest OpenSuse also [1] and [2] are screens captured by vmware
but exact same panic occurs in Virtual PC as reported to us in [3]. I CC'ed
previous threads receivers also.

[1] http://cekirdek.pardus.org.tr/~caglar/2.6.18/panic.png
[2] http://cekirdek.pardus.org.tr/~caglar/2.6.18/panic.png
[3] http://bugs.pardus.org.tr/show_bug.cgi?id=3804

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (760.00 B)
(No filename) (189.00 B)
Download all attachments

2006-11-05 16:43:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

On Sunday 05 November 2006 14:07, S.Çağlar Onur wrote:
> Hi;
>
> 2.6.18, 2.6.18.1 and 2.6.18.2 still panics randomly (it seems this is not
> related to smpreplacament bug solved in .2)

How do you know this?

And does it still happen in 2.6.19-rc4?

> in VmWare and Microsoft Virtual
> PC and in order to confirm this bug is not our distro specific i downloaded
> and tried latest OpenSuse also [1] and [2] are screens captured by vmware
> but exact same panic occurs in Virtual PC as reported to us in [3].

Always the same BUG()?

> I CC'ed
> previous threads receivers also.
>
> [1] http://cekirdek.pardus.org.tr/~caglar/2.6.18/panic.png
> [2] http://cekirdek.pardus.org.tr/~caglar/2.6.18/panic.png

There is just some rolling Turkish text there.

-Andi

2006-11-05 17:17:55

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
> How do you know this?

Just guessing, if im not wrong panics occur after SMP alternative switching
code done its job.

> And does it still happen in 2.6.19-rc4?

Will try

> > in VmWare and Microsoft Virtual
> > PC and in order to confirm this bug is not our distro specific i
> > downloaded and tried latest OpenSuse also [1] and [2] are screens
> > captured by vmware but exact same panic occurs in Virtual PC as reported
> > to us in [3].
>
> Always the same BUG()?

Yes, same bug

> There is just some rolling Turkish text there.

Ah im sorry here is the correct links :(

[1] http://cekirdek.pardus.org.tr/~caglar/2.6.18/panic_on_opensuse.png
[2] http://cekirdek.pardus.org.tr/~caglar/2.6.18/panic_on_pardus.png

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (959.00 B)
(No filename) (189.00 B)
Download all attachments

2006-11-05 17:29:22

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

>05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
>> How do you know this?
>
>Just guessing, if im not wrong panics occur after SMP alternative switching
>code done its job.

Possibly compiled a kernel with instructions your processor does not
support? Come to think of cmov...



-`J'
--

2006-11-05 17:38:35

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

05 Kas 2006 Paz 19:25 tarihinde, Jan Engelhardt şunları yazmıştı:
> >05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
> >> How do you know this?
> >
> >Just guessing, if im not wrong panics occur after SMP alternative
> > switching code done its job.
>
> Possibly compiled a kernel with instructions your processor does not
> support? Come to think of cmov...

That machine is a Intel(R) Pentium(R) 4 CPU 3.00GHz so if im not wrong cmov is
supported on that processor

--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (656.00 B)
(No filename) (189.00 B)
Download all attachments

2006-11-05 18:57:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

On Sunday 05 November 2006 18:17, S.Çağlar Onur wrote:
> 05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
> > How do you know this?
>
> Just guessing, if im not wrong panics occur after SMP alternative switching
> code done its job.

Can you test with "noreplacement" to make sure?

Anyways I suspect we're just getting back some variant of the old CPU setup race.

Normally CPU booting in Linux follows a special "cpu hotplug" state machine,
but for historical reasons i386 only implements one state of this. At one
point we had a similar bug (but not in the callback on CPU #0, but in
the timer on newly booted CPU). I don't see currently how it can happen
(but i haven't thought very deeply about it yet)

Probably your timing is just unlucky on those simulators.

Previously we avoided converting i386 cpu bootup fully to the new state
machine because it is very fragile, but it's possible that there
is no other choice than to do it properly. Or maybe another kludge
is possible.

-Andi

2006-11-05 19:51:13

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

Hi;

05 Kas 2006 Paz 20:57 tarihinde, Andi Kleen şunları yazmıştı:
> Can you test with "noreplacement" to make sure?

I sorry for not to mention that, i tried noreplacement before reporting which
is also ends up with same panic.

> Anyways I suspect we're just getting back some variant of the old CPU setup
> race.
>
> Normally CPU booting in Linux follows a special "cpu hotplug" state
> machine, but for historical reasons i386 only implements one state of
> this. At one point we had a similar bug (but not in the callback on CPU #0,
> but in the timer on newly booted CPU). I don't see currently how it can
> happen (but i haven't thought very deeply about it yet)
>
> Probably your timing is just unlucky on those simulators.

Hmm, Novell bugzilla seems has similiar issues,
https://bugzilla.novell.com/show_bug.cgi?id=204647 and its duplicated ones
gaves same or similiar panic outputs.

> Previously we avoided converting i386 cpu bootup fully to the new state
> machine because it is very fragile, but it's possible that there
> is no other choice than to do it properly. Or maybe another kludge
> is possible.

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (1.27 kB)
(No filename) (189.00 B)
Download all attachments

2006-11-05 23:13:13

by Zachary Amsden

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

S.Çağlar Onur wrote:
> Hmm, Novell bugzilla seems has similiar issues,
> https://bugzilla.novell.com/show_bug.cgi?id=204647 and its duplicated ones
> gaves same or similiar panic outputs.
>
>
>> Previously we avoided converting i386 cpu bootup fully to the new state
>> machine because it is very fragile, but it's possible that there
>> is no other choice than to do it properly. Or maybe another kludge
>> is possible.
>>

Yes, this is some kind of softirq race during init.

2006-11-05 23:33:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

On Monday 06 November 2006 00:13, Zachary Amsden wrote:
> S.Çağlar Onur wrote:
> > Hmm, Novell bugzilla seems has similiar issues,
> > https://bugzilla.novell.com/show_bug.cgi?id=204647 and its duplicated ones
> > gaves same or similiar panic outputs.
> >
> >
> >> Previously we avoided converting i386 cpu bootup fully to the new state
> >> machine because it is very fragile, but it's possible that there
> >> is no other choice than to do it properly. Or maybe another kludge
> >> is possible.
> >>
>
> Yes, this is some kind of softirq race during init.

Yes, the callbacks run at the wrong time. Unlike modern architectures
i386 doesn't do callback cpu boot callback repeat, but boot all cpus then callback.

But the strange thing is that the BP hits it. Normally the new CPU
hit it because it tried to run a timer interrupt before the callback
ran and initialized all the per CPU state (this happened often
when dual core CPUs were first introduced for some reason)

In this case it looks like the AP managed to queue a tasklet before
the cpu up callback runs.

I suppose we'll either need to convert i386 really over to standard
cpu_up() or add some additional spinlocks to stop the APs with interrupts
off before the callbacks start to run.

-Andi

2006-11-12 02:39:57

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
> And does it still happen in 2.6.19-rc4?

Sorry for delayed test result, i cannot reproduce this panic with 2.6.19-rc5

--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (351.00 B)
(No filename) (189.00 B)
Download all attachments

2006-11-12 03:33:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

On Sunday 12 November 2006 03:39, S.Çağlar Onur wrote:
> 05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
> > And does it still happen in 2.6.19-rc4?
>
> Sorry for delayed test result, i cannot reproduce this panic with 2.6.19-rc5

It's probably still there, just hopefully it won't be release critical
for .19 then.

At some point it has to be fixed properly by converting i386 to the
new hotplug architecture i suppose.

-Andi

2006-11-13 04:20:24

by Zachary Amsden

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

S.Çağlar Onur wrote:
> 05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
>
>> And does it still happen in 2.6.19-rc4?
>>
>
> Sorry for delayed test result, i cannot reproduce this panic with 2.6.19-rc5
>

I would like to find the exact cause of the problem; I suspect, as does
Andi, that it could just be dormant. You had problems still with
2.6.18.latest, correct? If I can find the cause, I would like to get a
fix into 2.6.18-stable if possible. I think you already sent me the
reproducing kernel config, but I seem to have misplaced it. Could you
resend? I should have some time to look at this early this week.

Thanks,

Zach

2006-11-13 05:31:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

On Monday 13 November 2006 05:20, Zachary Amsden wrote:
> S.Çağlar Onur wrote:
> > 05 Kas 2006 Paz 18:40 tarihinde, Andi Kleen şunları yazmıştı:
> >
> >> And does it still happen in 2.6.19-rc4?
> >>
> >
> > Sorry for delayed test result, i cannot reproduce this panic with 2.6.19-rc5
> >
>
> I would like to find the exact cause of the problem;

It's all related to i386's abuse of the cpu hotplug state machine.
Eventually that needs to be fixed properly like it was on x86-64
(I didn't dare touch i386 back then because this code is so fragile on old
hardware)

-Andi

2006-11-15 16:08:38

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: [Opps] Invalid opcode

13 Kas 2006 Pts 06:20 tarihinde, Zachary Amsden şunları yazmıştı:
> I would like to find the exact cause of the problem; I suspect, as does
> Andi, that it could just be dormant. You had problems still with
> 2.6.18.latest, correct? If I can find the cause, I would like to get a
> fix into 2.6.18-stable if possible. I think you already sent me the
> reproducing kernel config, but I seem to have misplaced it. Could you
> resend? I should have some time to look at this early this week.

Sorry for late reply, [1] is the kernel config i used and if you want i can
provide ~5mb iso contains that kernel and its 26.19-rc5 version to test?

[1] http://cekirdek.pardus.org.tr/~caglar/2.6.18/config.2.6.18
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (872.00 B)
(No filename) (189.00 B)
Download all attachments

2006-11-30 06:44:52

by Zachary Amsden

[permalink] [raw]
Subject: Fix for OpenSUSE kernel bug (was Re: [Opps] Invalid opcode)

It is possible to have tasklets get scheduled before softirqd has had
a chance to spawn on all CPUs. This is totally harmless; after success
during action CPU_UP_PREPARE, action CPU_ONLINE will be called, which
immediately wakes softirqd on the appropriate CPU to process the already
pending tasklets. So there is no danger of having a missed wakeup for
any tasklets that were already pending.

In particular, i386 is affected by this during startup, and is visible when
using a very large initrd; during the time it takes for the initrd to be
decompressed, a timer IRQ can come in and schedule RCU callbacks. It is also
possible that resending of a hardware IRQ via a softirq triggers the same bug.

Because of different timing conditions, this shows up in all emulators
and virtual machines tested, including Xen, VMware, Virtual PC, and Qemu.
It is also possible to trigger on native hardware with a large enough initrd,
although I don't have a reliable case demonstrating that.

Signed-off-by: Zachary Amsden <[email protected]>

Index: linux-2.6.18/kernel/softirq.c
===================================================================
--- linux-2.6.18.orig/kernel/softirq.c 2006-11-10 14:44:39.000000000 -0800
+++ linux-2.6.18/kernel/softirq.c 2006-11-29 22:19:36.000000000 -0800
@@ -574,8 +574,6 @@ static int __cpuinit cpu_callback(struct

switch (action) {
case CPU_UP_PREPARE:
- BUG_ON(per_cpu(tasklet_vec, hotcpu).list);
- BUG_ON(per_cpu(tasklet_hi_vec, hotcpu).list);
p = kthread_create(ksoftirqd, hcpu, "ksoftirqd/%d", hotcpu);
if (IS_ERR(p)) {
printk("ksoftirqd for %i failed\n", hotcpu);


Attachments:
fix-softirq-race (1.58 kB)

2006-11-30 12:29:30

by S.Çağlar Onur

[permalink] [raw]
Subject: Re: Fix for OpenSUSE kernel bug (was Re: [Opps] Invalid opcode)

Hi;

30 Kas 2006 Per 08:44 tarihinde, Zachary Amsden şunları yazmıştı:
> I'm proposing this as a fix for your bug. Having tasklets scheduled
> before softirqd gets to run might be somewhat backwards, but there is
> nothing I can find wrong about it from a correctness point of view.
> Better to boot the kernel even when compiled with bug checking on, I think.
>
> This bug started becoming apparent in 2.6.18 because of some rework with
> the CPU hotplug code, but in theory, it exists at least all the way back
> to 2.6.10, which is as far as I looked backwards in time.

I cannot reproduce that opps with 2.6.18.4 + your patch any longer, so at
least works for me :), thanks

Cheers
--
S.Çağlar Onur <[email protected]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


Attachments:
(No filename) (855.00 B)
(No filename) (189.00 B)
Download all attachments