2006-12-21 08:04:11

by Zhang, Yanmin

[permalink] [raw]
Subject: RE: [Bug 7505] Linux-2.6.18 fails to boot on AMD64 machine

>>-----Original Message-----
>>From: Andrew Morton [mailto:[email protected]]
>>Sent: 2006??12??20?? 18:38
>>To: Chuck Ebbert
>>Cc: Yinghai Lu; [email protected]; [email protected]; [email protected]; [email protected]; [email protected];
>>Eric W. Biederman; Zhang, Yanmin
>>Subject: Re: [Bug 7505] Linux-2.6.18 fails to boot on AMD64 machine
>>
>>On Wed, 20 Dec 2006 04:59:19 -0500
>>Chuck Ebbert <[email protected]> wrote:
>>
>>> > On 12/19/06, Chuck Ebbert <[email protected]> wrote:
>>> > > So an external interrupt occurred, the system tried to use interrupt
>>> > > descriptor #39 decimal (irq 7), but the descriptor was invalid.
>>> >
>>> > but the irq is disabled at that time.
>>> >
>>> > can you use attached diff to verify if the irq is enable somehow?
>>>
>>> But it seems interrupts are on--look at the flags:
>>>
>>> RSP: 0018:ffffffff803cdf68 EFLAGS: 00010246
>>>
>>
>>down_write()->__down_write()->__down_write_nested()->spin_unlock_irq()->dead
>>
>>Could someone please test this?
I couldn't reproduce it on my EM64T machine. I instrumented function start_kernel and
didn't find irq was enabled before calling init_IRQ. It'll be better if the reporter could
instrument function start_kernel to capture which function enables irq.


2006-12-21 20:17:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bug 7505] Linux-2.6.18 fails to boot on AMD64 machine

On Thu, 21 Dec 2006 20:52:40 +0100
Ard -kwaak- van Breemen <[email protected]> wrote:

> Hello,
>
> On Thu, Dec 21, 2006 at 04:04:04PM +0800, Zhang, Yanmin wrote:
> > I couldn't reproduce it on my EM64T machine. I instrumented function start_kernel and
> > didn't find irq was enabled before calling init_IRQ. It'll be better if the reporter could
> > instrument function start_kernel to capture which function enables irq.
> Just diving into the sources.
> Is that something like:
> if(!raw_irqs_disabled_flags) printk "irqs are enabled";
>
> (At that moment it might have crashed already.. :-)).
>
> I don't see the complete context yet, but I hope the irq is
> triggered after the irq is somehow enabled.
>
> BTW: the panic occurs on half of my boards on tyan S2891 with 2
> opterons, of which the only difference seems to be the purchase
> date (and hence probably the motherboard revisions). (Haven't got
> time yet to pull them out of the rack and compare the
> motherboards).

please, I'm still waiting for someone to tell me whether this "fixes" it:


--- a/lib/rwsem-spinlock.c~down_write-preserve-local-irqs
+++ a/lib/rwsem-spinlock.c
@@ -195,13 +195,14 @@ void fastcall __sched __down_write_neste
{
struct rwsem_waiter waiter;
struct task_struct *tsk;
+ unsigned long flags;

- spin_lock_irq(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);

if (sem->activity == 0 && list_empty(&sem->wait_list)) {
/* granted */
sem->activity = -1;
- spin_unlock_irq(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
goto out;
}

@@ -216,7 +217,7 @@ void fastcall __sched __down_write_neste
list_add_tail(&waiter.list, &sem->wait_list);

/* we don't need to touch the semaphore struct anymore */
- spin_unlock_irq(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

/* wait to be given the lock */
for (;;) {
_

2006-12-21 20:26:42

by Ard van Breemen

[permalink] [raw]
Subject: Re: [Bug 7505] Linux-2.6.18 fails to boot on AMD64 machine

Hello,

On Thu, Dec 21, 2006 at 04:04:04PM +0800, Zhang, Yanmin wrote:
> I couldn't reproduce it on my EM64T machine. I instrumented function start_kernel and
> didn't find irq was enabled before calling init_IRQ. It'll be better if the reporter could
> instrument function start_kernel to capture which function enables irq.
Just diving into the sources.
Is that something like:
if(!raw_irqs_disabled_flags) printk "irqs are enabled";

(At that moment it might have crashed already.. :-)).

I don't see the complete context yet, but I hope the irq is
triggered after the irq is somehow enabled.

BTW: the panic occurs on half of my boards on tyan S2891 with 2
opterons, of which the only difference seems to be the purchase
date (and hence probably the motherboard revisions). (Haven't got
time yet to pull them out of the rack and compare the
motherboards).


--
program signature;
begin { telegraaf.com
} writeln("<[email protected]> TEM2");
end
.

2006-12-21 21:05:43

by Ard van Breemen

[permalink] [raw]
Subject: Re: [Bug 7505] Linux-2.6.18 fails to boot on AMD64 machine

On Thu, Dec 21, 2006 at 04:04:04PM +0800, Zhang, Yanmin wrote:
> I couldn't reproduce it on my EM64T machine. I instrumented function start_kernel and
> didn't find irq was enabled before calling init_IRQ. It'll be better if the reporter could
> instrument function start_kernel to capture which function enables irq.

Editing init/main.c:
preempt_disable();
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
printk("BLAAT17");
build_all_zonelists();
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
printk("BLAAT18");
page_alloc_init();
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
printk("BLAAT19");
printk(KERN_NOTICE "Kernel command line: %s\n", saved_command_line);
parse_early_param();
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
printk("BLAAT20");
parse_args("Booting kernel", command_line, __start___param,
__stop___param - __start___param,
&unknown_bootoption);
printk("BLAAT21");
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
sort_main_extable();
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
printk("BLAAT22");
trap_init();
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
printk("BLAAT23");

Results in:
^MAllocating PCI resources starting at 88000000 (gap: 80000000:60000000)
^MBLAAT12BLAAT13<6>PERCPU: Allocating 32960 bytes of per cpu data
^MBLAAT14BLAAT15BLAAT16BLAAT17Built 2 zonelists. Total pages: 1032635
^MBLAAT18BLAAT19<5>Kernel command line: console=tty0 console=ttyS0,115200 hdb=noprobe hdc=noprobe hdd=noprobe root=/dev/md0 ro panic=30 earlyprintk=serial,ttyS0,115200
^MBLAAT20<6>ide_setup: hdb=noprobe
^Mide_setup: hdc=noprobe
^Mide_setup: hdd=noprobe
^MBLAAT21start_kernel(): bug: interrupts were enabled early
^Mstart_kernel(): bug: interrupts were enabled early
^MBLAAT22Initializing CPU#0

Hmmm, that actually doesn't make sense to me (unless parse_args is able to enable irq's).
--
program signature;
begin { telegraaf.com
} writeln("<[email protected]> TEM2");
end
.

2006-12-22 18:42:55

by Ard van Breemen

[permalink] [raw]
Subject: Re: [Bug 7505] Linux-2.6.18 fails to boot on AMD64 machine

Hello,
On Thu, Dec 21, 2006 at 04:04:04PM +0800, Zhang, Yanmin wrote:
> I couldn't reproduce it on my EM64T machine. I instrumented function start_kernel and
> didn't find irq was enabled before calling init_IRQ. It'll be better if the reporter could
> instrument function start_kernel to capture which function enables irq.

I can confirm this is a *GENERIC* X86_64 problem:
----
Kernel command line: console=tty0 console=ttyS0,115200 hdb=noprobe root=/dev/md0
init/main.c start_kernel(): interrupts were disabled@525
ide_setup: hdb=noprobe
init/main.c start_kernel(): interrupts were enabled@529
...
start_kernel(): bug: interrupts were enabled early
----
This is on a dell 1950 with a core 2 duo processors.

You have to have ide compiled in, and set ide options to get the irq's enabled,
and then have a setup which will have an irq pending before the irq controller
get's initialized to get the panic. The dell1950 does not panic, the kernel
merely warns.

I am pretty sure the i386 tree has the same problem but I haven't checked yet.
Anyway: the panic is just a way of noticing. The bug is that irq's are enabled
before the irq controller is set up.

But to make the ide_setup/irq bug go away, I think it might be an acceptable
solution to just disable the irq's again after the parse_args, and just to wait
until the SATA tree takes over the IDE tree.

--
program signature;
begin { telegraaf.com
} writeln("<[email protected]> TEM2");
end
.

2006-12-22 19:37:42

by Stefano Takekawa

[permalink] [raw]
Subject: Re: [Bug 7505] Linux-2.6.18 fails to boot on AMD64 machine

> I am pretty sure the i386 tree has the same problem but I haven't checked yet.
> Anyway: the panic is just a way of noticing. The bug is that irq's are enabled
> before the irq controller is set up.

A very similar i386 linux installation works fine on my laptop, but that
i386 kernel never had problem.

--
Stefano Takekawa
[email protected]

Frank: And why do days get longer in the summer?
Ernest: Because heat makes things expand!