LinuxLists.cc - PROBLEM: crashing kernels

2000-11-25 18:48:50

Subject: PROBLEM: crashing kernels

[1.]
Kernels 2.2.14, 2.2.17, 2.4.0-test11 crash with various errors
(I know this is too simple, but this is what I could say in one line)

[2.]
We're running a quite bussy site, and updateing our servers hardware quite
often. Since a while ago we're expecting many troubles, that usually end
with crash (that means, we need to press the hw-reset to reboot).

We've run the kernel 2.2.14 for a long while, but after replacing the
mainboard, and the ethernet cards the module eepro100 (wich actually
benn compiled into the kernel, and not as a module) always gave the
errors:

eth0: Transmit timed out: status 0050 0090 at 134704418/134704432
eth0: Trying to restart the transmitter...

In these cases we couldn't do anything than restarting the system
(yeah, I know is a quite window$ solution, but ifconfig eth0 down/up
didn't help neither) by pressing ctrl-alt-del. This problem occoured
almost every day once.

Ok, we thought maybe the driver isn't correct yet, and whenever we were a
little out-of-date with the kernel 2.2.14, so we decided to try a newer
one, first the 2.2.16. This kernel didn't last long, because it also
crashed with error like:
Oct 1 23:09:17 luna kernel: eth0: can't fill rx buffer (force 1)!
Oct 1 23:09:17 luna kernel: eth0: Tx ring dump, Tx queue 6270428 /
6270428:
Oct 1 23:09:17 luna kernel: eth0: 0 200ca000.
Oct 1 23:09:17 luna kernel: eth0: 1 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 2 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 3 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 4 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 5 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 6 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 7 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 8 200ca000.
Oct 1 23:09:17 luna kernel: eth0: 9 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 10 000ca000.
Oct 1 23:09:17 luna kernel: eth0: Printing Rx ring (next to receive into
427603
9, dirty index 4

but in these cases there wasn't any possibility for a clean reboot. We
don't like much the fsck, so let's try the 2.2.17. Also the same troubles,
like above. We also tried to increase the TX_RING_SIZE, and RX_RING_SIZE
values, but this dind't help neither.

We got the informations, that these Intel Ether Express Pro cards have
some bugs, so we got another driver from the Intel, called e100, and
decided to go back to the good old 2.2.14 Kernel, and tried this module.
Wow, new type of error messages:

kernel: e100 - Intel(R) 82559 Fast Ethernet LAN on Motherboard
kernel: eth0: Mem:0xf4102000 IRQ:21 Speed:100 Mbps Dx:Full
kernel: Hardware receive checksums enabled
...
kernel: eth0: rx_srv: no buffers left!!!

Ok, the modul supported to change the buffer sizes, we tried to increase
them, but didn't help.

After some askabout we decided to throw out the EtherExpress Pro's, and to
change to something, that other people use. We couldn't do that totally,
because on of these cards is integrated to the mainboard, so we couldn't
get rid of that.

We also needed to do some changes (change the Mylex card to a Mylex
AcceleRAID 352) so we changed one of the EtherExpresses to a 3Com Tornado
at the same time. Like the kernel sais, it's a:
kernel: eth0: 3Com 3c905C Tornado at 0x2800, 00:01:02:b7:94:4b, IRQ 18
kernel: 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
kernel: MII transceiver found at address 24, status 782d.
kernel: Enabling bus-master transmits and whole-frame receives.

We needed to change to the kernel 2.2.17 again, because this is the
onlyone that supports the Mylex 352.

This was the point where everything went down. We have 1Gb ram in the
computer, but sometimes we run out to the swap. This usually lead to a
server crash: you had a ping, but none of the ports answared. On the
console, and in the kern.log we got these kind of errors:
Nov 20 01:06:47 luna kernel: VM: do_try_to_free_pages failed for apache...
You could change the console, but nothing gave a reply. Such things
happened before with other kernels, but there we simply pulled down the
ethernet, the apache processes exited after some mins, and I could log in
to the console. But with the 2.2.17 it doesn't help anymore, only thing
could do: reset :( Then of course 2-3 hours of fsck... nice...
We tried to increase the amount of swap. (originally we had just 256MB, we
increased this to 1.7GB) Same shit happened again, and usually really
fast: often while I've been working on the computer once my ssh stoped,
and by then was too late...
I've checked the kernel source, and saw that the vmscan.c was changed
since the 2.2.14 until the 2.2.17. Ok, maybe the author made some
mistake... But we couldn't go back to oldier kernels (because of the Mylex
card) so the only possibility is to go forward: we compiled the
2.4.0-test11 kernel. It could be usefull also because of the khttpd, at
least we could free up some memory used by the apache.

The kernel compiled and the system booted up. The first three-four hours
were without errors, it also seemed to be a little faster. When the number
of the apaches begun to grow, I've decided to try the khttpd. I've
configured it, and started up at the port 8090 (8080's been used by
another apache). Coll it worked. I've Redirected some of the queryes from
the apache to the khttpd. This worked fine too. Well after a while it's
been a little slow (on the fast localnet it took also like 3-4 seconds
until the port got opened) but it was working, and the memory usage went
down. The only scary thing was, that the ping of the server got worst, the
packet lost was growing since I've used the khttpd. Ok I thought it's not
so bad, the next day I'll see the statistics and decide wether to use the
khttpd or not, now I could go to sleep... Of course at the night the whole
thing crashed: black screen, no console, no keyboard leds, nothing, just
hardware reset and fsck.
I thought that the khttpd is guilty, I won't use it anymore. Next morning
it crashed again, now without khttpd, without high load, without high
memory usage, just the 3Com driver said:
kernel: eth0: Interrupt posted but not delivered -- IRQ blocked by another
device?
kernel: Flags; bus-master 1, full 0; dirty 112(0) current 112(0).
kernel: Transmit list 00000000 vs. f20ac200.
kernel: 0: @f20ac200 length 80000036 status 00010036
kernel: 1: @f20ac210 length 80000182 status 00010182
kernel: 2: @f20ac220 length 80000036 status 00010036
kernel: 3: @f20ac230 length 8000004a status 0001004a
kernel: 4: @f20ac240 length 80000337 status 00010337
kernel: 5: @f20ac250 length 80000036 status 00010036
kernel: 6: @f20ac260 length 800005ea status 000105ea
kernel: 7: @f20ac270 length 800005d3 status 000105d3
kernel: 8: @f20ac280 length 80000042 status 00010042
kernel: 9: @f20ac290 length 800005ea status 000105ea
kernel: 10: @f20ac2a0 length 800005e2 status 000105e2
kernel: 11: @f20ac2b0 length 800005ea status 000105ea
kernel: 12: @f20ac2c0 length 800005ea status 000105ea
kernel: 13: @f20ac2d0 length 8000024e status 0001024e
kernel: 14: @f20ac2e0 length 800005ea status 800105ea
kernel: 15: @f20ac2f0 length 80000042 status 80010042
luna kernel: eth0: Resetting the Tx ring pointer.

This wasn't quite bad, rmmod, insmod helped. Some hours later again the
black crash - but no high load again. In the afternoon crash again. In the
nights again... etc...

We tried almost everything with the hw:
changed the network cards,
changed the mainboard,
changed the mylex card,
currently we're on our work to change the memory, but it isn't easy to
find another 1Gb, this hour in the weekend...

[3.]
kernel, eepro100, e100, khttpd

[4.]
[5.]
-

[6.]
We tryied to trigger always the problems, but couldn't find any :(

[7.]
[7.1]
Software (the 2.4.0 kernel)
-- Versions installed: (if some fields are empty or look
-- unusual then possibly you have very old versions)
Linux luna 2.4.0-test11 #2 SMP Thu Nov 23 16:27:38 CET 2000 i686 unknown
Kernel modules 2.3.11
Gnu C 2.95.2
Gnu Make 3.79
Binutils 2.9.5.0.46
Linux C Library 2.1.95
Dynamic linker ldd (GNU libc) 2.1.95
Procps 2.0.6
Mount 2.10f
Net-tools 2.05
Kbd 0.99
Sh-utils 2.0i
Modules Loaded eepro100 3c59x

[7.2]
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 1
cpu MHz : 596.000928
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
features : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse
bogomips : 1189.48

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 1
cpu MHz : 596.000928
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 3
wp : yes
features : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse
bogomips : 1192.76

[7.3]
modules:
eepro100 17076 1
3c59x 22900 1

[7.4]
Scsi:
Attached devices: none
(usually... somtimes one hard drive, but we removed it since we're
experiencing the problems)
[7.5]
/proc/meminfo:
total: used: free: shared: buffers: cached:
Mem: 1053315072 1049874432 3440640 0 20615168 566947840
Swap: 1784807424 0 1784807424
MemTotal: 1028628 kB
MemFree: 3360 kB
MemShared: 0 kB
Buffers: 20132 kB
Cached: 553660 kB
Active: 263572 kB
Inact_dirty: 302560 kB
Inact_clean: 7660 kB
Inact_target: 6296 kB
HighTotal: 131008 kB
HighFree: 1484 kB
LowTotal: 897620 kB
LowFree: 1876 kB
SwapTotal: 1742976 kB
SwapFree: 1742976 kB

/proc/partitions:
major minor #blocks name

3 0 40020624 hda
3 1 2048256 hda1
3 2 498015 hda2
3 3 498015 hda3
3 4 1 hda4
3 5 497983 hda5
48 0 143302656 rd/c0d0
48 1 5124703 rd/c0d0p1
48 2 5124735 rd/c0d0p2
48 3 133050330 rd/c0d0p3
48 8 143441920 rd/c0d1
48 9 2000061 rd/c0d1p1
48 10 2996122 rd/c0d1p2
48 11 249007 rd/c0d1p3
48 12 1 rd/c0d1p4
48 13 995998 rd/c0d1p5
48 14 137195068 rd/c0d1p6

/proc/swaps:
Filename Type Size Used Priority
/dev/rd/c0d1p3 partition 248996 0 100
/dev/hda2 partition 498004 0 3
/dev/hda3 partition 498004 0 2
/dev/hda5 partition 497972 0 1

/proc/rd/c0/initial_status:
***** DAC960 RAID Driver Version 2.4.8 of 19 August 2000 *****
Copyright 1998-2000 by Leonard N. Zubkoff <[email protected]>
Configuring Mylex AcceleRAID 352 PCI RAID Controller
Firmware Version: 6.00-01, Channels: 2, Memory Size: 32MB
PCI Bus: 0, Device: 13, Function: 1, I/O Address: Unassigned
PCI Address: 0xF4106000 mapped at 0xF8800000, IRQ Channel: 17
Controller Queue Depth: 512, Maximum Blocks per Command: 2048
Driver Queue Depth: 511, Scatter/Gather Limit: 128 of 257 Segments
Physical Devices:
0:0 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009450931
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:1 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009354185
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:2 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009452126
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:3 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009355037
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:4 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009451877
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:6 Vendor: ESG-SHV Model: SCA HSBP M13 Revision: 0.02
Asynchronous
0:7 Vendor: MYLEX Model: AcceleRAID 352 Revision: 0600
Wide Synchronous at 160 MB/sec
Serial Number:
1:0 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8X278
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:1 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0230
Wide Synchronous at 40 MB/sec
Serial Number: 143026950130
Disk Status: Online, 71688192 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:2 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8Y122
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:3 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8R057
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:4 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8P717
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:6 Vendor: ESG-SHV Model: SCA HSBP M13 Revision: 0.02
Asynchronous
1:7 Vendor: MYLEX Model: AcceleRAID 352 Revision: 0600
Wide Synchronous at 160 MB/sec
Serial Number:
Logical Drives:
/dev/rd/c0d0: RAID-5, Online, 286605312 blocks
Logical Device Initialized, BIOS Geometry: 255/63
Stripe Size: 64KB, Segment Size: 8KB
Read Cache Disabled, Write Cache Disabled
/dev/rd/c0d1: RAID-5, Online, 286883840 blocks
Logical Device Initialized, BIOS Geometry: 255/63
Stripe Size: 64KB, Segment Size: 8KB
Read Cache Disabled, Write Cache Disabled

[8.]
Our server is used mainly as a web server. It has a heave load, and
because of this we needed to do some little modifications:
at the /etc/initscript we we did a ulimit -u 2048 so that the users could
have more than 256 processes. This was inportant for the apache, currently
the 1100 paralell apache processes aren't enough sometimes neither.
Also for this we had to change at the include/linux/tasks.h the NR_TASKS
to 2048 and the MAX_TASKS_PER_USER to NR_TASKS-512 (note: this
second one didn't help, to have more processes per user just the ulimit)
Because sometimes the kernel ran out of the open files, we also increase
at the boot time the file-max, inode-max and the super-max at
the /proc/sys/fs directory.
These modifications may seem weird at the first look, but don't forget,
that we have a plenty of ram, and the CPU usage stays usually around
30-40% even at the highest load (ok, just in case we don't need to use
the swap, but we try to avoid that when is possible)

[9.]
Ok I know this is a quite long for a bug report, but I hope it helps more
to find the problem if I tell everything since the begin. Please don't say
solutions like: use another computer too, split up your system, etc,
because we're working on that too. Any other ideas, hints, questions
are welcome. The main problem is that the computer crashes every 3-4
hours, and in the last days it spent more time on the fsck than on
working.

If You need some more detailed help, log files, etc, I'd be happy for
help.

Thanx in advantage

Attila Nagy

+--------------------------------------------+
| Attila Nagy |
| mailto:[email protected] |
+--------------------------------------------+

2000-11-25 18:55:40

by Alan

[permalink] [raw]

Subject: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: readonly /proc/sys/vm/freepages (was: Re: PROBLEM: crashing kernels)

Subject: Re: PROBLEM: crashing kernels

Subject: Re: readonly /proc/sys/vm/freepages (was: Re: PROBLEM: crashing kernels)

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: readonly /proc/sys/vm/freepages (was: Re: PROBLEM: crashing kernels)

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels

Subject: Re: PROBLEM: crashing kernels