2002-09-30 12:08:26

by Federico Sevilla III

[permalink] [raw]
Subject: kernel panic "killing interrupt handler" and kernel BUG at sched.c:468

Hi everyone,

On our server that had been running for 55 days with this 2.4.19-xfs
kernel (XFS CVS snapshot 20020809 patched with RML's preempt patches for
2.4.19-rc3 and sys-magic 20020314 from Randy Dunlap, built using GCC
3.1.1 running Debian GNU/Linux), I hit a kernel panic in the process
running the distributed-net client. I had been running the
distributed-net client -- and everything else on the server -- with no
significant changes recently. The server wasn't under any significantly
different load, either.

I copied the oops by hand onto another computer and am attaching a
ksymoops output of that as kernel-panic.out. I rebooted and a few
minutes after all the initialization had completed I hit another kernel
panic, again because of the distributed-net client process. The oops
(passed through ksymoops) is attached as kernel-panic-2.out.

After copying the oops message, I attempted to sync the disks using the
(Alt + SysRq + S) key combination and after the sync messages I hit a
kernel BUG at sched.c:568. In my sched.c (different from the XFS tree
only because of RML's preempt patch) line 568 is in the "asmlinkage void
schedule(void)" function. The oops (passed through ksymoops) is attached
as kernel-bug.out.

Some other notes that may be significant to mention:

- system is an Intel Pentium III with 512MB RAM and a 3ware 6400 IDE
RAID controller,
- system has one small ext2 partition for /boot, one ReiserFS
partition for Squid cache, and 5 XFS partitions,
- system is an NFS server, with NFSv3 enabled in the kernel and
running nfs-kernel-server 1.0.2,
- system is not exclusively an NFS server, it's a Samba, mail, IRC
server as well, and runs lm-sensors,
- this happened during a lull in the load because everyone was on
their way home at the end of our work day.

I am recompiling the kernel now, using a current CVS snapshot of the XFS
tree, and using Debian Sid's current default gcc (2.95.4 20011002)
instead of gcc 3.1.1 like before, and without RML's preempt patch (the
SysMagic patch does not touch sched.c and probably didn't have anything
to do with this). I turned off distributed-net as soon as I rebooted
this third time, and the system's alive so far and was able to recompile
the kernel. I will turn it back on when I boot with the new kernel and
will send a follow-up if the kernel panics again.

Pointers as to what probably caused this are welcome. If this is a "new"
issue I hope the decoded oops messages will be help. Thank you everyone
for your time.

--> Jijo

--
Federico Sevilla III : http://jijo.free.net.ph
Network Administrator : The Leather Collection, Inc.
GnuPG Key ID : 0x93B746BE


Attachments:
(No filename) (0.00 B)
(No filename) (189.00 B)
Download all attachments

2002-09-30 16:24:34

by Federico Sevilla III

[permalink] [raw]
Subject: Re: kernel panic "killing interrupt handler" and kernel BUG at sched.c:468

Hi everyone,

On Mon, Sep 30, 2002 at 08:13:24PM +0800, Federico Sevilla III wrote:
> On our server that had been running for 55 days with this 2.4.19-xfs
> kernel (XFS CVS snapshot 20020809 patched with RML's preempt patches
> for 2.4.19-rc3 and sys-magic 20020314 from Randy Dunlap, built using
> GCC 3.1.1 running Debian GNU/Linux)

And then after approximately 1.5 hours on 2.4.19-xfs (XFS CVS snapshot
20020930 patched only with Randy Dunlap's sys-magic 20020314, built
using gcc 2.95.4 running Debian GNU/Linux).

> I hit a kernel panic in the process running the distributed-net
> client.

Now in the process running syslogd. Attached kernel-panic-3.out is the
oops.

> After copying the oops message, I attempted to sync the disks using
> the (Alt + SysRq + S) key combination and after the sync messages I
> hit a kernel BUG at sched.c:568.

I did this, as well, and hit a kernel BUG at sched.c:566. My sched.c is
now exactly the same as the one in the XFS tree (which I think is the
same as the one in vanilla 2.4.19). Oops is attached as
kernel-bug-2.out.

> Pointers as to what probably caused this are welcome. If this is a
> "new" issue I hope the decoded oops messages will be help. Thank you
> everyone for your time.

System has been up for two hours so far and is thankfully alive. I hope
someone can point me to what the probably cause of this problem is. I
did an xfs_check -- which is from the latest 2.3.1 package -- and all my
XFS partitions are okay. I will be looking for a memory scanning tool
shortly to make sure I don't suddenly have bad RAM, although the problem
doesn't seem to be that.

Thanks again to everyone for your time.

--> Jijo

--
Federico Sevilla III : http://jijo.free.net.ph
Network Administrator : The Leather Collection, Inc.
GnuPG Key ID : 0x93B746BE


Attachments:
(No filename) (1.77 kB)
kernel-panic-3.out (2.81 kB)
kernel-bug-2.out (3.60 kB)
Download all attachments