2000-11-15 08:24:00

by Rogier Wolff

[permalink] [raw]
Subject: 2.4. continues after Aieee...


Shouldn't the system be "halted" after an "Aiee, killing interrupt
handler"?


Modem status change from 0x63 to 0xf3
Unable to handle kernel NULL pointer dereference at virtual address 00000629
printing eip:
c4854fcc
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c4854fcc>]
EFLAGS: 00010002
eax: 00000620 ebx: c1e80000 ecx: c1f28000 edx: 00000000
esi: c2749800 edi: 000000f3 ebp: c3ba6000 esp: c26d7dc0
ds: 0018 es: 0018 ss: 0018
Process agetty (pid: 299, stackpage=c26d7000)
Stack: c487f3e3 c3ba6578 c487f3e2 00000212 00000145 00010082 c487f3e9 00000246
c3ba6578 c487f3e8 c4855603 c1e80000 00000002 c3ba6000 c487f3e2 c3ba6000
0000000b c3ba6000 c26d7eb4 c0274400 00000002 0002001d c4859d8f c1e80000
Call Trace: [<c487f3e3>] [<c487f3e2>] [<c487f3e9>] [<c487f3e8>] [<c4855603>] [<c487f3e2>] [<c4859d8f>]
[<c484f358>] [<c484f471>] [<c010b681>] [<c010b7f2>] [<c010a4e0>] [<c0116ce9>] [<c0122a4d>] [<c01233ed>]
[<c0123683>] [<c01235bc>] [<c014c87c>] [<c012e032>] [<c010a423>]
Code: f6 40 09 08 0f 85 22 01 00 00 8b 86 bc 00 00 00 a8 06 0f 84
Aiee, killing interrupt handler
Scheduling in interrupt
kernel BUG at sched.c:692!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c0116019>]
EFLAGS: 00010292
eax: 0000001b ebx: 00000000 ecx: c1f28000 edx: 00000000
esi: 00000000 edi: 0000000b ebp: c26d7cb8 esp: c26d7c68
ds: 0018 es: 0018 ss: 0018
Process agetty (pid: 299, stackpage=c26d7000)
Stack: c01eb041 c01eb216 000002b4 c1172160 c26d6000 0000000b 00000282 c26d6000
00000020 00000086 00000000 c3fca000 c26d6000 c26d6000 c011a9cf c26d6000
c1172160 00000000 c26d6000 00000629 00000629 c011abca 00000000 00000000
Call Trace: [<c01eb041>] [<c01eb216>] [<c011a9cf>] [<c011abca>] [<c0111a88>] [<c010a956>] [<c0111da6>]
[<c01ea15e>] [<c0111a88>] [<c010e586>] [<c010b681>] [<c01e16c1>] [<c01e16c1>] [<c0188b92>] [<c010a564>]
[<c4854fcc>] [<c487f3e3>] [<c487f3e2>] [<c487f3e9>] [<c487f3e8>] [<c4855603>] [<c487f3e2>] [<c4859d8f>]
[<c484f358>] [<c484f471>] [<c010b681>] [<c010b7f2>] [<c010a4e0>] [<c0116ce9>] [<c0122a4d>] [<c01233ed>]
[<c0123683>] [<c01235bc>] [<c014c87c>] [<c012e032>] [<c010a423>]
Code: 0f 0b 90 8d 65 bc 5b 5e 5f 89 ec 5d c3 89 f6 55 89 e5 83 ec
Aiee, killing interrupt handler
Scheduling in interrupt
kernel BUG at sched.c:692!
invalid operand: 0000


After this, the call trace becomes longer and longer, but the system
keeps on oopsing...

Roger.




--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* Common sense is the collection of *
****** prejudices acquired by age eighteen. -- Albert Einstein ********


2000-11-15 16:47:10

by Dennis

[permalink] [raw]
Subject: Re: 2.4. continues after Aieee...

At 02:53 AM 11/15/2000, Rogier Wolff wrote:

>Shouldn't the system be "halted" after an "Aiee, killing interrupt
>handler"?
>

This brings another question. Has there been any work done to force linux
to reboot on all panics? Linux's propensity to crash drivers (say the
network card driver) and leave the system running make linux unusable in
unattended environments as the machine is functionally dead.

a simple switch that forces reboot on panic would do much to alleviate the
problem.

DB

2000-11-15 17:01:21

by Rogier Wolff

[permalink] [raw]
Subject: Re: 2.4. continues after Aieee...

Dennis wrote:
> At 02:53 AM 11/15/2000, Rogier Wolff wrote:
>
> >Shouldn't the system be "halted" after an "Aiee, killing interrupt
> >handler"?
> >
>
> This brings another question. Has there been any work done to force linux
> to reboot on all panics? Linux's propensity to crash drivers (say the

You already have the option to say what happens on panic.

> network card driver) and leave the system running make linux unusable in
> unattended environments as the machine is functionally dead.

Which doesn't help in this case, as your network card COULD be dead,
while the system simply hasn't crashed....

Roger.



--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* Common sense is the collection of *
****** prejudices acquired by age eighteen. -- Albert Einstein ********

2000-11-16 05:18:22

by David Feuer

[permalink] [raw]
Subject: Re: 2.4. continues after Aieee...

At 05:30 PM 11/15/2000 +0100, Rogier Wolff wrote:

> > network card driver) and leave the system running make linux unusable in
> > unattended environments as the machine is functionally dead.
>
>Which doesn't help in this case, as your network card COULD be dead,
>while the system simply hasn't crashed....

Yeah, but it doesn't matter. The system is no more useful running with a
network card than it is rebooting itself. Just make sure that it doesn't
reboot itself more than N times in M hours, and you'll be fine... The
network admin needs to be paged in any case. The network card COULD be
dead, in which case the administrator needs to replace it. Otherwise, a
reboot could solve the problem.

--
This message has been brought to you by the letter alpha and the number pi.
Open Source: Think locally, act globally.
David Feuer
[email protected]

2000-11-16 15:33:34

by Russell King

[permalink] [raw]
Subject: Re: 2.4. continues after Aieee...

Rogier Wolff wrote:
> Dennis wrote:
> > network card driver) and leave the system running make linux unusable in
> > unattended environments as the machine is functionally dead.
>
> Which doesn't help in this case, as your network card COULD be dead,
> while the system simply hasn't crashed....

Not every case causes a panic either. This week, I had an instance of
an i686 box lock solid with a DFE-530TX net card. Rebooting/power
cycling it didn't recover it (despite it working for the past month
without any problems). It only started working again after I moved
it into a different PCI slot.

I've seen a couple of instances now on totally different hardware where
it is possible to lock a PCI bus solid by improper connections on some
of the PCI bus lines, so a faulty PCI socket seem to be the most likely
cause.

In this case, a "panic" doesn't help you; the machine experiances a
hardware lockup. To catch these, you'd need a hardware watchdog.

What I'm basically saying is that there is only a limited amount that
Linux (or any OS) can do against these types of hardware failure. If
you need better protection, try a hardware with user-space policy
implementations.
_____
|_____| ------------------------------------------------- ---+---+-
| | Russell King [email protected] --- ---
| | | | http://www.arm.linux.org.uk/personal/aboutme.html / / |
| +-+-+ --- -+-
/ | THE developer of ARM Linux |+| /|\
/ | | | --- |
+-+-+ ------------------------------------------------- /\\\ |

2000-11-16 16:05:33

by Dennis

[permalink] [raw]
Subject: Re: 2.4. continues after Aieee...


>
>Not every case causes a panic either. This week, I had an instance of
>an i686 box lock solid with a DFE-530TX net card. Rebooting/power
>cycling it didn't recover it (despite it working for the past month
>without any problems). It only started working again after I moved
>it into a different PCI slot.
>
>I've seen a couple of instances now on totally different hardware where
>it is possible to lock a PCI bus solid by improper connections on some
>of the PCI bus lines, so a faulty PCI socket seem to be the most likely
>cause.


theres nothing that software can do with a pci bus lockup. You need a
hardware watchdog to reboot the system for this type of failure.

PCI has a very tight spec, and running a card (say on an extender) or with
another card that has too many loads can cause a bus failure. If you have
more than 4 cards on the bus you are out of spec, for example.

But that doesnt change the panic issue. if you have hardware problems you
cant expect any OS to help you, you need new hardware.

Dennis


2000-11-16 16:41:42

by Russell King

[permalink] [raw]
Subject: Re: 2.4. continues after Aieee...

Dennis writes:
> >Not every case causes a panic either. This week, I had an instance of
> >an i686 box lock solid with a DFE-530TX net card. Rebooting/power
> >cycling it didn't recover it (despite it working for the past month
> >without any problems). It only started working again after I moved
> >it into a different PCI slot.
> >
> >I've seen a couple of instances now on totally different hardware where
> >it is possible to lock a PCI bus solid by improper connections on some
> >of the PCI bus lines, so a faulty PCI socket seem to be the most likely
> >cause.
>
>
> theres nothing that software can do with a pci bus lockup. You need a
> hardware watchdog to reboot the system for this type of failure.

If you read on, you'll discover I did in fact say this.
_____
|_____| ------------------------------------------------- ---+---+-
| | Russell King [email protected] --- ---
| | | | http://www.arm.linux.org.uk/personal/aboutme.html / / |
| +-+-+ --- -+-
/ | THE developer of ARM Linux |+| /|\
/ | | | --- |
+-+-+ ------------------------------------------------- /\\\ |