LinuxLists.cc - 2.4.18-24 SMP Machine stuck in zombie state after kernel Oops

2003-07-16 07:39:54

Subject: 2.4.18-24 SMP Machine stuck in zombie state after kernel Oops

Hi,

Tried to find information about a kernel OOPS I've seen a lot of times
already on 4 different machines - but nothing seems to be said about this in
the list archives or anywhere else for that matter.

We are running 2.4.18-24 on SMP machines with 2CPUs and hyperthreading
(SuperMicro Xeon servers) and doing heavy IO to disk and networking. (Qlogic
HBAs and Intel e1000 NICs are used)

At some point the machine oopses (no scenario except heavy nfs-server like
load):

The oops doesn't appear in logs or on the console, but I've been able to use
the diagnostic keys to get the following information:

right ALT+Scroll lock

CPU 0 : swapper

[<c0106f32>] (0xc0323fc4)) default_idle
[<c0105000>] (0xc0323fd4)) empty_zero_page

CPU 1 : swapper

(<c0106f32>) (0xc 6597fb0) - default_idle
(<c011d29b>) (0xc 6597fd0) - out_of_line_bug
(<c011d449>) (8xc 659ffc) - printk

CPU 3 : swapper

[<c0106f32>] (0xc8257fb0)) default_idle
[<c011d449>][0xc8257fd0)) printk

CTRL +Scroll lock

XINETD
[<c0108dc0>]sys_sigaltstack
[<c01151b1>]flush_tlb_page
[<c0115140>]flush_tlb_page
[<c014086e>]shmem_file_setup
[<c0131109>]do_generic_file_read
[<c013121e>]generic_file_read
[<c01310b0>]do_generic_file_read
[<c0142aa6>]sys_read
[<c0108ccf>]sys_sigaltstack

CROND

[<c01198f1>]wait_for_completion
[<c011c47a>]do_fork
[<c0107718>]dump_thread
[<c0108ccf>]sys_sigaltstack

After the oops networking stack continues to function, some running daemons
continue to work (I'm seeing network traffic from the machine which
indicates that clearly), but login into the node is not possible via
console, ssh, rsh, and the majority of the application processes are dead.

Any information / pointers will be appreciated.

If any information is missing or anything I should do to help analyze next
time it happens tell me as well.

Thanks,

--
Yuval Yeret
Exanet
[email protected]
http://www.exanet.com
Tel. 972-9-9717782
Fax. 972-9-9717778

_________________________________________________________________
The new MSN 8: smart spam protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail

2003-07-29 09:47:38

by yuval yeret

[permalink] [raw]

Subject: RE: 2.4.18-24 SMP Machine stuck in zombie state after kernel Oops

>Tried to find information about a kernel OOPS I've seen a lot of times
>already on 4 different machines - >but nothing seems to be said about this
>in the list archives or anywhere else for that matter.

>We are running 2.4.18-24 on SMP machines with 2CPUs and hyperthreading
>(SuperMicro Xeon >servers) and doing heavy IO to disk and networking.
>(Qlogic HBAs and Intel e1000 NICs are used)
Actually this time it happened on machines with Emulex HBAs.

>At some point the machine oopses (no scenario except heavy nfs-server like
>load):
.....snipped...
>After the oops networking stack continues to function, some running daemons
>continue to work (I'm >seeing network traffic from the machine which
>indicates that clearly), but login into the node is not >possible via
>console, ssh, rsh, and the majority of the application processes are dead.

Since then I tried to reproduce the problem with devfs disabled, and after
some time found a pattern that reproduces this scenario quite consistently.

This time I've been able to see the output of the magic keys in the log:

===============================
Magic keys show:
Jul 29 11:00:17 node1 kernel: Pid: 0, comm: swapper
Jul 29 11:00:17 node1 kernel: EIP: 0010:[default_idle+41/64] CPU: 2
Jul 29 11:00:17 node1 kernel: EIP: 0010:[<c0106e89>] CPU: 2
Jul 29 11:00:17 node1 kernel: EIP is at (2.4.18-24exa)
Jul 29 11:00:17 node1 kernel: EFLAGS: 00000246 Not tainted
Jul 29 11:00:17 node1 kernel: EAX: 00000000 EBX: c0106e60 ECX: 00000032 EDX:
c4af6000
Jul 29 11:00:17 node1 kernel: ESI: c4af6000 EDI: c4af6000 EBP: c0106e60 DS:
0018 ES: 0018
Jul 29 11:00:17 node1 kernel: CR0: 8005003b CR2: 4012faa0 CR3: 33b6a340 CR4:
000006f0
Jul 29 11:00:17 node1 kernel: Call Trace: [cpu_idle+50/80] (0xc4af7fb0))
Jul 29 11:00:17 node1 kernel: Call Trace: [<c0106f02>] (0xc4af7fb0))
Jul 29 11:00:17 node1 kernel: [printk+297/320] (0xc4af7fd0))
Jul 29 11:00:17 node1 kernel: [<c011d409>] (0xc4af7fd0))

Jul 29 11:00:20 node1 kernel: Pid: 0, comm: swapper
Jul 29 11:00:20 node1 kernel: EIP: 0010:[default_idle+41/64] CPU: 0
Jul 29 11:00:20 node1 kernel: EIP: 0010:[<c0106e89>] CPU: 0
Jul 29 11:00:20 node1 kernel: EIP is at (2.4.18-24exa)
Jul 29 11:00:20 node1 kernel: EFLAGS: 00000246 Not tainted
Jul 29 11:00:20 node1 kernel: EAX: 00000000 EBX: c0106e60 ECX: 00000032 EDX:
c031c000
Jul 29 11:00:20 node1 kernel: ESI: c031c000 EDI: c031c000 EBP: c0106e60 DS:
0018 ES: 0018
Jul 29 11:00:20 node1 kernel: CR0: 8005003b CR2: 40048794 CR3: 36963c80 CR4:
000006f0
Jul 29 11:00:20 node1 kernel: Call Trace: [cpu_idle+50/80] (0xc031dfc4))
Jul 29 11:00:20 node1 kernel: [<c0106f02>] (0xc031dfc4))
Jul 29 11:00:20 node1 kernel: [_stext+0/80] (0xc031dfd0))
Jul 29 11:00:20 node1 kernel: [<c0105000>] (0xc031dfd0))
Jul 29 11:00:20 node1 kernel:

Any information / pointers will be appreciated.

If any information is missing or anything I should do to help analyze next
time it happens tell me as well.

Thanks,

--
Yuval Yeret
Exanet
[email protected]
http://www.exanet.com
Tel. 972-9-9717782
Fax. 972-9-9717778

_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail