2004-09-17 14:07:05

by Thierry Coutelier

[permalink] [raw]
Subject: Freeze on 2.4 kernels.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Every few weeks (sometimes 2 days, often 3 weeks and sometime up to 9
weeks) our kernel freezes: nothing on screen or serial console except
from some VJ decompression errors which we have at all times, even the
Num-Lock does not respond. We tried to enable sysreq keys but those
won't work either.

We are using Linux boxes to offer Satellite Internet.
We still use RedHat 7.[23] and 2.4 kernels.

The system works using rp-l2tp and/or pptpd with pppd.
On the outgoing interface (the one that sends traffic to the
Satellite we were using CBQ and now we use HTB queuing discipline.

The kernels range from 2.4.6 to 2.4.25 with some modifications
(tcp_input). We tried with the standard kernel with the only
change that the dev_alloc_name has been changed to support
up to 900 names.


The Hardware are Dell PowerEdge with Perc2 or Perc3. We tried with HP
servers and have the same problem. We tried different firmware releases
for the Perc cards and still no change.

The NIC cards are mostly Intel EEpro 100. We tried with both drivers
Intel and community with no better results.

The problem may be happening more often (every 2/3 days) when we
simulate a lot of ppp connections/disconnections (80 users/minute),
but in some cases it hangs even without having many users.

The platform we run have between 25 to 200 simultaneous connections.
Some have single or dual or even quad CPU's. And RAM between 512Mbytes
and 4 Gbytes.

We could not detect any parameters that would rise before the freeze
(load, memory, swap ...)

Could anyone give me some hint as to what to do/test more ?
Where could the problem be ?


Here are some stack traces that we where able to get on a serial console:




Unable to handle kernel paging request at virtual address 5f47534d
~ printing eip:
c0119923
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0119923>] Not tainted
EFLAGS: 00010046
eax: 00000000 ebx: 00000000 ecx: 00000001 edx: 5f47534d
esi: dfff6000 edi: 00000001 ebp: d0939fbc esp: d0939f90
ds: 0018 es: 0018 ss: 0018
Process ip-up (pid: 27133, stackpage=d0939000)
Stack: 00000000 c0306438 00000000 5f47534d 00000000 d0938000 fffffc18
c0374ce0
~ d0938000 00000001 00000006 bfffe148 c010932d 00000000 082a73a4
082a71b8
~ 00000001 00000006 bfffe148 082a71c4 0000002b 0000002b ffffff00
08096305
Call Trace: [<c010932d>]

Code: 8b 02 89 45 e0 0f 18 00 81 fa 20 6e 30 c0 0f 85 79 ff ff ff


After ksymoops analysis :

ksymoops 2.4.1 on i686 2.4.25. Options used
~ -V (default)
~ -k /proc/ksyms (default)
~ -l /proc/modules (default)
~ -o /lib/modules/2.4.25/ (default)
~ -m /boot/System.map-2.4.25 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Warning (compare_maps): ip_conntrack symbol
GPLONLY_ip_conntrack_expect_find_get not found in /lib/modules/2.4.25
/kernel/net/ipv4/netfilter/ip_conntrack.o. Ignoring /lib/modules/2.4.25
/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): ip_conntrack symbol GPLONLY_ip_conntrack_expect_put
not found in /lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o.
Ignoring /lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): ip_conntrack symbol GPLONLY_ip_conntrack_find_get
not found in /lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o.
Ignoring /lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): ip_conntrack symbol GPLONLY_ip_conntrack_put not
found in /lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o.
Ignoring /lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): mismatch on symbol ip_conntrack_destroyed ,
ip_conntrack says e0ed4b78, /lib/modules/2.4.25
/kernel/net/ipv4/netfilter/ip_conntrack.o says e0ed42e4. Ignoring
/lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): mismatch on symbol ip_conntrack_hash ,
ip_conntrack says e0ed4b90, /lib/modules/2.4.25
/kernel/net/ipv4/netfilter/ip_conntrack.o says e0ed42fc. Ignoring
/lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): mismatch on symbol ip_conntrack_htable_size ,
ip_conntrack says e0ed4b7c, /lib/modules/2.4.25
/kernel/net/ipv4/netfilter/ip_conntrack.o says e0ed42e8. Ignoring
/lib/modules/2.4.25/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): mismatch on symbol usb_devfs_handle , usbcore says
e0d4e274, /lib/modules/2.4.25/kernel/drivers/usb/usbcore.o says e0d4dcd4.
Ignoring /lib/modules/2.4.25/kernel/drivers/usb/usbcore.o entry
Unable to handle kernel paging request at virtual address 5f47534d
c0119923
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0119923>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010046
eax: 00000000 ebx: 00000000 ecx: 00000001 edx: 5f47534d
esi: dfff6000 edi: 00000001 ebp: d0939fbc esp: d0939f90
ds: 0018 es: 0018 ss: 0018
Process ip-up (pid: 27133, stackpage=d0939000)
Stack: 00000000 c0306438 00000000 5f47534d 00000000 d0938000 fffffc18
c0374ce0
~ d0938000 00000001 00000006 bfffe148 c010932d 00000000 082a73a4
082a71b8
~ 00000001 00000006 bfffe148 082a71c4 0000002b 0000002b ffffff00
08096305
Call Trace: [<c010932d>]
Code: 8b 02 89 45 e0 0f 18 00 81 fa 20 6e 30 c0 0f 85 79 ff ff ff


|>>>EIP; c0119923 <schedule+173/4c0> <=====

Trace; c010932d <reschedule+5/c>
Code; c0119923 <schedule+173/4c0>
00000000 <_EIP>:
Code; c0119923 <schedule+173/4c0> <=====
~ 0: 8b 02 mov (%edx),%eax <=====
Code; c0119925 <schedule+175/4c0>
~ 2: 89 45 e0 mov %eax,0xffffffe0(%ebp)
Code; c0119928 <schedule+178/4c0>
~ 5: 0f 18 00 prefetchnta (%eax)
Code; c011992b <schedule+17b/4c0>
~ 8: 81 fa 20 6e 30 c0 cmp $0xc0306e20,%edx
Code; c0119931 <schedule+181/4c0>
~ e: 0f 85 79 ff ff ff jne ffffff8d <_EIP+0xffffff8d>
c01198b0 <schedule+100/4c0>


9 warnings issued. Results may not be reliable.

- ----

ksymoops 2.4.1 on i686 2.4.25-SES. Options used
~ -V (default)
~ -k /proc/ksyms (default)
~ -l /proc/modules (default)
~ -o /lib/modules/2.4.25-SES/ (default)
~ -m /boot/System.map-2.4.25-SES (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Warning (compare_maps): ip_conntrack symbol
GPLONLY_ip_conntrack_expect_find_get not found in
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o. Ignoring
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): ip_conntrack symbol GPLONLY_ip_conntrack_expect_put
not found in
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o. Ignoring
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): ip_conntrack symbol GPLONLY_ip_conntrack_find_get
not found in
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o. Ignoring
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): ip_conntrack symbol GPLONLY_ip_conntrack_put not
found in /lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o.
Ignoring /lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o
entry
Warning (compare_maps): mismatch on symbol ip_conntrack_destroyed ,
ip_conntrack says e0edab98,
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o says
e0eda304. Ignoring
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): mismatch on symbol ip_conntrack_hash ,
ip_conntrack says e0edabb0,
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o says
e0eda31c. Ignoring
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): mismatch on symbol ip_conntrack_htable_size ,
ip_conntrack says e0edab9c,
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o says
e0eda308. Ignoring
/lib/modules/2.4.25-SES/kernel/net/ipv4/netfilter/ip_conntrack.o entry
Warning (compare_maps): mismatch on symbol my_classid , sch_miq says
e0ed1a20, /lib/modules/2.4.25-SES/kernel/net/sched/sch_miq.o says e0ed19a0.
Ignoring /lib/modules/2.4.25-SES/kernel/net/sched/sch_miq.o entry
Warning (compare_maps): mismatch on symbol usb_devfs_handle , usbcore says
e0d4e294, /lib/modules/2.4.25-SES/kernel/drivers/usb/usbcore.o says
e0d4dcf4. Ignoring /lib/modules/2.4.25-SES/kernel/drivers/usb/usbcore.o
entry
Unable to handle kernel NULL pointer dereference at virtual address
00000000
c0119923
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0119923>] Tainted: P
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010046
eax: 00000000 ebx: 083ca488 ecx: 00000001 edx: 00000000
esi: dfff6000 edi: 00000001 ebp: cc8a1f98 esp: cc8a1f6c
ds: 0018 es: 0018 ss: 0018
Process tbectrld (pid: 21783, stackpage=cc8a1000)
Stack: 00000082 cb1b44c0 cc8a0000 00000000 c01209dc cc8a0000 fffffc18
c0376ce0
~ c160c220 dfffb200 cc8a0000 00000000 c0120e4d cc8a0000 c160c220
cc8a0000
~ 40170c44 00000000 bffffcd8 c0120fc3 00000000 c010927f 00000000
00001000
Call Trace: [<c01209dc>] [<c0120e4d>] [<c0120fc3>] [<c010927f>]
Code: 8b 02 89 45 e0 0f 18 00 81 fa a0 7f 30 c0 0f 85 79 ff ff ff


|>>>EIP; c0119923 <schedule+173/4c0> <=====

Trace; c01209dc <exit_notify+dc/360>
Trace; c0120e4d <do_exit+1ed/330>
Trace; c0120fc3 <sys_exit+13/20>
Trace; c010927f <system_call+33/38>
Code; c0119923 <schedule+173/4c0>
00000000 <_EIP>:
Code; c0119923 <schedule+173/4c0> <=====
~ 0: 8b 02 mov (%edx),%eax <=====
Code; c0119925 <schedule+175/4c0>
~ 2: 89 45 e0 mov %eax,0xffffffe0(%ebp)
Code; c0119928 <schedule+178/4c0>
~ 5: 0f 18 00 prefetchnta (%eax)
Code; c011992b <schedule+17b/4c0>
~ 8: 81 fa a0 7f 30 c0 cmp $0xc0307fa0,%edx
Code; c0119931 <schedule+181/4c0>
~ e: 0f 85 79 ff ff ff jne ffffff8d <_EIP+0xffffff8d>
c01198b0 <schedule+100/4c0>


10 warnings issued. Results may not be reliable.


- --
Thierry Coutelier Pr?sident LiLux asbl
7, Rue Jacques Sturm L-2556 Luxembourg
Office:+352 710725 608 Home:+352 406776
http://www.lilux.lu/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFBSu6SPOfrcNNQX7oRAge+AJ9fdYdf0/AxEbDdd/LGaJBs0BU28wCfX9ja
FCZr3mX4ox6hHinkcRMXUy8=
=q7bP
-----END PGP SIGNATURE-----


2004-09-20 04:46:13

by Jussi Hamalainen

[permalink] [raw]
Subject: Re: Freeze on 2.4 kernels.

On Fri, 17 Sep 2004, Thierry Coutelier wrote:

> The kernels range from 2.4.6 to 2.4.25 with some modifications
> (tcp_input). We tried with the standard kernel with the only
> change that the dev_alloc_name has been changed to support
> up to 900 names.
>
>
> The Hardware are Dell PowerEdge with Perc2 or Perc3. We tried with HP
> servers and have the same problem. We tried different firmware releases
> for the Perc cards and still no change.

I think I might be experiencing the same problem here with dual-p3
1.4GHz PE2550 boxes without PERC. We have a bunch of them doing SMTP
and webmail and every now and then one of them freezes for no
apparent reason. I don't get _anything_ on the console and nothing in
the logs. Haven't tried serial console though.

This isn't a big problem for me since this happens randomly about
every 9 weeks or so. Since the boxes are in redundant pairs I've just
shrugged it off as being a general case of piece-of-crap PC hardware.
I just thought I should add my two cents' worth...

I've gone through kernels 2.4.16 to 2.4.26 with and without various
patches and nothing seems to make a difference.

--
-=[ Count Zero / TBH - Jussi H?m?l?inen - email [email protected] ]=-

2004-09-20 12:45:19

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Freeze on 2.4 kernels.

On Mon, 20 Sep 2004 07:46:10 +0300 (EEST), Jussi Hamalainen wrote:
>> The Hardware are Dell PowerEdge with Perc2 or Perc3. We tried with HP
>> servers and have the same problem. We tried different firmware releases
>> for the Perc cards and still no change.
>
>I think I might be experiencing the same problem here with dual-p3
>1.4GHz PE2550 boxes without PERC. We have a bunch of them doing SMTP
>and webmail and every now and then one of them freezes for no
>apparent reason. I don't get _anything_ on the console and nothing in
>the logs. Haven't tried serial console though.
>
>This isn't a big problem for me since this happens randomly about
>every 9 weeks or so. Since the boxes are in redundant pairs I've just
>shrugged it off as being a general case of piece-of-crap PC hardware.
>I just thought I should add my two cents' worth...

Our PE2650 (dual HT Xeons) used to have frequent lockup problems.
I was asked to look for an NMI watchdog trace, so I enabled the
I/O-APIC watchdog (nmi_watchdog=1). Since then (a year and a half ago)
the box has been rock solid. Currently running FC2 user-space with
the RHEL3 2.4.21-20 kernel.

/Mikael