2000-11-25 18:48:50

by Mr. Big

[permalink] [raw]
Subject: PROBLEM: crashing kernels


[1.]
Kernels 2.2.14, 2.2.17, 2.4.0-test11 crash with various errors
(I know this is too simple, but this is what I could say in one line)

[2.]
We're running a quite bussy site, and updateing our servers hardware quite
often. Since a while ago we're expecting many troubles, that usually end
with crash (that means, we need to press the hw-reset to reboot).

We've run the kernel 2.2.14 for a long while, but after replacing the
mainboard, and the ethernet cards the module eepro100 (wich actually
benn compiled into the kernel, and not as a module) always gave the
errors:

eth0: Transmit timed out: status 0050 0090 at 134704418/134704432
eth0: Trying to restart the transmitter...

In these cases we couldn't do anything than restarting the system
(yeah, I know is a quite window$ solution, but ifconfig eth0 down/up
didn't help neither) by pressing ctrl-alt-del. This problem occoured
almost every day once.

Ok, we thought maybe the driver isn't correct yet, and whenever we were a
little out-of-date with the kernel 2.2.14, so we decided to try a newer
one, first the 2.2.16. This kernel didn't last long, because it also
crashed with error like:
Oct 1 23:09:17 luna kernel: eth0: can't fill rx buffer (force 1)!
Oct 1 23:09:17 luna kernel: eth0: Tx ring dump, Tx queue 6270428 /
6270428:
Oct 1 23:09:17 luna kernel: eth0: 0 200ca000.
Oct 1 23:09:17 luna kernel: eth0: 1 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 2 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 3 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 4 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 5 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 6 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 7 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 8 200ca000.
Oct 1 23:09:17 luna kernel: eth0: 9 000ca000.
Oct 1 23:09:17 luna kernel: eth0: 10 000ca000.
Oct 1 23:09:17 luna kernel: eth0: Printing Rx ring (next to receive into
427603
9, dirty index 4

but in these cases there wasn't any possibility for a clean reboot. We
don't like much the fsck, so let's try the 2.2.17. Also the same troubles,
like above. We also tried to increase the TX_RING_SIZE, and RX_RING_SIZE
values, but this dind't help neither.

We got the informations, that these Intel Ether Express Pro cards have
some bugs, so we got another driver from the Intel, called e100, and
decided to go back to the good old 2.2.14 Kernel, and tried this module.
Wow, new type of error messages:

kernel: e100 - Intel(R) 82559 Fast Ethernet LAN on Motherboard
kernel: eth0: Mem:0xf4102000 IRQ:21 Speed:100 Mbps Dx:Full
kernel: Hardware receive checksums enabled
...
kernel: eth0: rx_srv: no buffers left!!!

Ok, the modul supported to change the buffer sizes, we tried to increase
them, but didn't help.

After some askabout we decided to throw out the EtherExpress Pro's, and to
change to something, that other people use. We couldn't do that totally,
because on of these cards is integrated to the mainboard, so we couldn't
get rid of that.

We also needed to do some changes (change the Mylex card to a Mylex
AcceleRAID 352) so we changed one of the EtherExpresses to a 3Com Tornado
at the same time. Like the kernel sais, it's a:
kernel: eth0: 3Com 3c905C Tornado at 0x2800, 00:01:02:b7:94:4b, IRQ 18
kernel: 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
kernel: MII transceiver found at address 24, status 782d.
kernel: Enabling bus-master transmits and whole-frame receives.

We needed to change to the kernel 2.2.17 again, because this is the
onlyone that supports the Mylex 352.

This was the point where everything went down. We have 1Gb ram in the
computer, but sometimes we run out to the swap. This usually lead to a
server crash: you had a ping, but none of the ports answared. On the
console, and in the kern.log we got these kind of errors:
Nov 20 01:06:47 luna kernel: VM: do_try_to_free_pages failed for apache...
You could change the console, but nothing gave a reply. Such things
happened before with other kernels, but there we simply pulled down the
ethernet, the apache processes exited after some mins, and I could log in
to the console. But with the 2.2.17 it doesn't help anymore, only thing
could do: reset :( Then of course 2-3 hours of fsck... nice...
We tried to increase the amount of swap. (originally we had just 256MB, we
increased this to 1.7GB) Same shit happened again, and usually really
fast: often while I've been working on the computer once my ssh stoped,
and by then was too late...
I've checked the kernel source, and saw that the vmscan.c was changed
since the 2.2.14 until the 2.2.17. Ok, maybe the author made some
mistake... But we couldn't go back to oldier kernels (because of the Mylex
card) so the only possibility is to go forward: we compiled the
2.4.0-test11 kernel. It could be usefull also because of the khttpd, at
least we could free up some memory used by the apache.

The kernel compiled and the system booted up. The first three-four hours
were without errors, it also seemed to be a little faster. When the number
of the apaches begun to grow, I've decided to try the khttpd. I've
configured it, and started up at the port 8090 (8080's been used by
another apache). Coll it worked. I've Redirected some of the queryes from
the apache to the khttpd. This worked fine too. Well after a while it's
been a little slow (on the fast localnet it took also like 3-4 seconds
until the port got opened) but it was working, and the memory usage went
down. The only scary thing was, that the ping of the server got worst, the
packet lost was growing since I've used the khttpd. Ok I thought it's not
so bad, the next day I'll see the statistics and decide wether to use the
khttpd or not, now I could go to sleep... Of course at the night the whole
thing crashed: black screen, no console, no keyboard leds, nothing, just
hardware reset and fsck.
I thought that the khttpd is guilty, I won't use it anymore. Next morning
it crashed again, now without khttpd, without high load, without high
memory usage, just the 3Com driver said:
kernel: eth0: Interrupt posted but not delivered -- IRQ blocked by another
device?
kernel: Flags; bus-master 1, full 0; dirty 112(0) current 112(0).
kernel: Transmit list 00000000 vs. f20ac200.
kernel: 0: @f20ac200 length 80000036 status 00010036
kernel: 1: @f20ac210 length 80000182 status 00010182
kernel: 2: @f20ac220 length 80000036 status 00010036
kernel: 3: @f20ac230 length 8000004a status 0001004a
kernel: 4: @f20ac240 length 80000337 status 00010337
kernel: 5: @f20ac250 length 80000036 status 00010036
kernel: 6: @f20ac260 length 800005ea status 000105ea
kernel: 7: @f20ac270 length 800005d3 status 000105d3
kernel: 8: @f20ac280 length 80000042 status 00010042
kernel: 9: @f20ac290 length 800005ea status 000105ea
kernel: 10: @f20ac2a0 length 800005e2 status 000105e2
kernel: 11: @f20ac2b0 length 800005ea status 000105ea
kernel: 12: @f20ac2c0 length 800005ea status 000105ea
kernel: 13: @f20ac2d0 length 8000024e status 0001024e
kernel: 14: @f20ac2e0 length 800005ea status 800105ea
kernel: 15: @f20ac2f0 length 80000042 status 80010042
luna kernel: eth0: Resetting the Tx ring pointer.

This wasn't quite bad, rmmod, insmod helped. Some hours later again the
black crash - but no high load again. In the afternoon crash again. In the
nights again... etc...

We tried almost everything with the hw:
changed the network cards,
changed the mainboard,
changed the mylex card,
currently we're on our work to change the memory, but it isn't easy to
find another 1Gb, this hour in the weekend...

[3.]
kernel, eepro100, e100, khttpd

[4.]
[5.]
-

[6.]
We tryied to trigger always the problems, but couldn't find any :(

[7.]
[7.1]
Software (the 2.4.0 kernel)
-- Versions installed: (if some fields are empty or look
-- unusual then possibly you have very old versions)
Linux luna 2.4.0-test11 #2 SMP Thu Nov 23 16:27:38 CET 2000 i686 unknown
Kernel modules 2.3.11
Gnu C 2.95.2
Gnu Make 3.79
Binutils 2.9.5.0.46
Linux C Library 2.1.95
Dynamic linker ldd (GNU libc) 2.1.95
Procps 2.0.6
Mount 2.10f
Net-tools 2.05
Kbd 0.99
Sh-utils 2.0i
Modules Loaded eepro100 3c59x

[7.2]
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 1
cpu MHz : 596.000928
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
features : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse
bogomips : 1189.48

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 1
cpu MHz : 596.000928
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 3
wp : yes
features : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse
bogomips : 1192.76

[7.3]
modules:
eepro100 17076 1
3c59x 22900 1

[7.4]
Scsi:
Attached devices: none
(usually... somtimes one hard drive, but we removed it since we're
experiencing the problems)
[7.5]
/proc/meminfo:
total: used: free: shared: buffers: cached:
Mem: 1053315072 1049874432 3440640 0 20615168 566947840
Swap: 1784807424 0 1784807424
MemTotal: 1028628 kB
MemFree: 3360 kB
MemShared: 0 kB
Buffers: 20132 kB
Cached: 553660 kB
Active: 263572 kB
Inact_dirty: 302560 kB
Inact_clean: 7660 kB
Inact_target: 6296 kB
HighTotal: 131008 kB
HighFree: 1484 kB
LowTotal: 897620 kB
LowFree: 1876 kB
SwapTotal: 1742976 kB
SwapFree: 1742976 kB

/proc/partitions:
major minor #blocks name

3 0 40020624 hda
3 1 2048256 hda1
3 2 498015 hda2
3 3 498015 hda3
3 4 1 hda4
3 5 497983 hda5
48 0 143302656 rd/c0d0
48 1 5124703 rd/c0d0p1
48 2 5124735 rd/c0d0p2
48 3 133050330 rd/c0d0p3
48 8 143441920 rd/c0d1
48 9 2000061 rd/c0d1p1
48 10 2996122 rd/c0d1p2
48 11 249007 rd/c0d1p3
48 12 1 rd/c0d1p4
48 13 995998 rd/c0d1p5
48 14 137195068 rd/c0d1p6

/proc/swaps:
Filename Type Size Used Priority
/dev/rd/c0d1p3 partition 248996 0 100
/dev/hda2 partition 498004 0 3
/dev/hda3 partition 498004 0 2
/dev/hda5 partition 497972 0 1

/proc/rd/c0/initial_status:
***** DAC960 RAID Driver Version 2.4.8 of 19 August 2000 *****
Copyright 1998-2000 by Leonard N. Zubkoff <[email protected]>
Configuring Mylex AcceleRAID 352 PCI RAID Controller
Firmware Version: 6.00-01, Channels: 2, Memory Size: 32MB
PCI Bus: 0, Device: 13, Function: 1, I/O Address: Unassigned
PCI Address: 0xF4106000 mapped at 0xF8800000, IRQ Channel: 17
Controller Queue Depth: 512, Maximum Blocks per Command: 2048
Driver Queue Depth: 511, Scatter/Gather Limit: 128 of 257 Segments
Physical Devices:
0:0 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009450931
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:1 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009354185
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:2 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009452126
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:3 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009355037
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:4 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0200
Wide Synchronous at 80 MB/sec
Serial Number: 143009451877
Disk Status: Online, 71720960 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
0:6 Vendor: ESG-SHV Model: SCA HSBP M13 Revision: 0.02
Asynchronous
0:7 Vendor: MYLEX Model: AcceleRAID 352 Revision: 0600
Wide Synchronous at 160 MB/sec
Serial Number:
1:0 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8X278
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:1 Vendor: QUANTUM Model: ATLAS_V_36_SCA Revision: 0230
Wide Synchronous at 40 MB/sec
Serial Number: 143026950130
Disk Status: Online, 71688192 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:2 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8Y122
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:3 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8R057
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:4 Vendor: IBM Model: DDYS-T36950M Revision: S80D
Wide Synchronous at 40 MB/sec
Serial Number: 5FL8P717
Disk Status: Online, 71651328 blocks
Errors - Parity: 0, Soft: 0, Hard: 0, Misc: 1
Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
1:6 Vendor: ESG-SHV Model: SCA HSBP M13 Revision: 0.02
Asynchronous
1:7 Vendor: MYLEX Model: AcceleRAID 352 Revision: 0600
Wide Synchronous at 160 MB/sec
Serial Number:
Logical Drives:
/dev/rd/c0d0: RAID-5, Online, 286605312 blocks
Logical Device Initialized, BIOS Geometry: 255/63
Stripe Size: 64KB, Segment Size: 8KB
Read Cache Disabled, Write Cache Disabled
/dev/rd/c0d1: RAID-5, Online, 286883840 blocks
Logical Device Initialized, BIOS Geometry: 255/63
Stripe Size: 64KB, Segment Size: 8KB
Read Cache Disabled, Write Cache Disabled



[8.]
Our server is used mainly as a web server. It has a heave load, and
because of this we needed to do some little modifications:
at the /etc/initscript we we did a ulimit -u 2048 so that the users could
have more than 256 processes. This was inportant for the apache, currently
the 1100 paralell apache processes aren't enough sometimes neither.
Also for this we had to change at the include/linux/tasks.h the NR_TASKS
to 2048 and the MAX_TASKS_PER_USER to NR_TASKS-512 (note: this
second one didn't help, to have more processes per user just the ulimit)
Because sometimes the kernel ran out of the open files, we also increase
at the boot time the file-max, inode-max and the super-max at
the /proc/sys/fs directory.
These modifications may seem weird at the first look, but don't forget,
that we have a plenty of ram, and the CPU usage stays usually around
30-40% even at the highest load (ok, just in case we don't need to use
the swap, but we try to avoid that when is possible)

[9.]
Ok I know this is a quite long for a bug report, but I hope it helps more
to find the problem if I tell everything since the begin. Please don't say
solutions like: use another computer too, split up your system, etc,
because we're working on that too. Any other ideas, hints, questions
are welcome. The main problem is that the computer crashes every 3-4
hours, and in the last days it spent more time on the fsck than on
working.

If You need some more detailed help, log files, etc, I'd be happy for
help.

Thanx in advantage

Attila Nagy

+--------------------------------------------+
| Attila Nagy |
| mailto:[email protected] |
+--------------------------------------------+




2000-11-25 18:55:40

by Alan

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

> benn compiled into the kernel, and not as a module) always gave the
> errors:
>
> eth0: Transmit timed out: status 0050 0090 at 134704418/134704432
> eth0: Trying to restart the transmitter...

Known problem. This one might be fixed in current 2.2.18pre. SOme people
see it some dont

> mistake... But we couldn't go back to oldier kernels (because of the Mylex
> card) so the only possibility is to go forward: we compiled the
> 2.4.0-test11 kernel. It could be usefull also because of the khttpd, at
> least we could free up some memory used by the apache.

You can copy the 2.2.17 updated mylex driver into 2.2.14 and rebuild a kernel
that way. In fact that would be a good test

I'd also be interested to know if 2.2.17 + Rik's vm patch (or + Andrea's
vm patch) is stable. (Rik and Andrea have differing views how to fix it but
both claim they have 8))

2000-11-25 19:15:19

by Mr. Big

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels


> > benn compiled into the kernel, and not as a module) always gave the
> > errors:
> >
> > eth0: Transmit timed out: status 0050 0090 at 134704418/134704432
> > eth0: Trying to restart the transmitter...
>
> Known problem. This one might be fixed in current 2.2.18pre. SOme people
> see it some dont
Ok, we're on the way to try to avoid the use of that ethernet card anyway

>
> > mistake... But we couldn't go back to oldier kernels (because of the Mylex
> > card) so the only possibility is to go forward: we compiled the
> > 2.4.0-test11 kernel. It could be usefull also because of the khttpd, at
> > least we could free up some memory used by the apache.
>
> You can copy the 2.2.17 updated mylex driver into 2.2.14 and rebuild a kernel
> that way. In fact that would be a good test

Ok, I'll try. Since I'm not a kernel developer, I have the fool question,
wether it is enough to overwrite the .c and .h files in the 2.2.14 source
tree, or do I need to apply other changes too?


+--------------------------------------------+
| Nagy Attila |
| mailto:[email protected] |
+--------------------------------------------+

2000-11-25 19:17:59

by Alan

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

> Ok, I'll try. Since I'm not a kernel developer, I have the fool question,
> wether it is enough to overwrite the .c and .h files in the 2.2.14 source
> tree, or do I need to apply other changes too?

I believe that is all you need to do for that driver

2000-11-25 20:54:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

Alan Cox <[email protected]> writes:

> > benn compiled into the kernel, and not as a module) always gave the
> > errors:
> >
> > eth0: Transmit timed out: status 0050 0090 at 134704418/134704432
> > eth0: Trying to restart the transmitter...
>
> Known problem. This one might be fixed in current 2.2.18pre. SOme people
> see it some dont

I have another data point on this problem.
I have seen it most with 2.4.0-test9. But I'll look at 2.2.18pre.
I can trigger this bug fairly reliably by warm booting, several times
in a row. With my linux warm booting directly into linux code triggers this
one fairly reliably :) Also putting another nick in seems to help
trigger it as well.

The 2.4.0-testxxx watchdog seems eventually to handle this case
but it takes 1/2 hour or so to actually kick in and reset the card.

Eric

2000-11-25 21:36:47

by Mr. Big

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels


> Alan Cox <[email protected]> writes:
>
> > > benn compiled into the kernel, and not as a module) always gave the
> > > errors:
> > >
> > > eth0: Transmit timed out: status 0050 0090 at 134704418/134704432
> > > eth0: Trying to restart the transmitter...
> >
> > Known problem. This one might be fixed in current 2.2.18pre. SOme people
> > see it some dont
>
> I have another data point on this problem.
> I have seen it most with 2.4.0-test9. But I'll look at 2.2.18pre.
> I can trigger this bug fairly reliably by warm booting, several times
> in a row. With my linux warm booting directly into linux code triggers this
> one fairly reliably :) Also putting another nick in seems to help
> trigger it as well.

Ok. I won't use that card anymore, and wont compile this part of code
neither. But I still doesn't know why does my 2.4.0-test11 crash to black:
no console, not keyboard, no logs, nothing... like a cut on the
electricity.
Just some mins ago I got another interesting thing: (with the eepro100
driver)
I've been logged in trough ssh. Like 5-10 mins before the crash the packet
loss jumped up to around 30% (the network was ok, the other hosts on the
same network came with 0% loss) Then after a while no answare to the ping,
but my ssh worked still, and even I could log in. So it seems so, that
something in the network driver died, and it didn't answare anymore to the
ICMP requests.
On the console was nothing again. When the guy pressed the ctrl-alt-del,
the normal reboot message came to the ssh terminals, but nothing happened
on the console, or on the computer (no hard drive activity, no leds on the
keyboard)

For the confusion of the services maybe the libc6 could be blamed. But I
yet doesn't understan why the packet loss rise, and why wasn't the console
working.

And a note for Alan: it isn't so simple to hack the Mylex driver from the
2.2.17 into the 2.2.14... I'm currently trying it...


+--------------------------------------------+
| Nagy Attila |
| mailto:[email protected] |
+--------------------------------------------+

2000-11-26 00:20:18

by Andrew Morton

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

Nice report. Wish they were all like that.

Look:

"Mr. Big" wrote:
>
> I thought that the khttpd is guilty, I won't use it anymore. Next morning
> it crashed again, now without khttpd, without high load, without high
> memory usage, just the 3Com driver said:
> kernel: eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is caused by the APIC(s) forgetting how to deliver interrupts
for a particular IRQ. Normally, reloading the driver doesn't help.
But in your case it did. This is odd.

It looks like the same thing is happening with the eepro NIC as
well. But the eepro diagnostics aren't as informative when this
happens.

This can normally be prevented by booting with the `noapic' option
on the LILO command line.

So I suggest you stick with the 3c905C and boot with `noapic'. Try
the eepro as well - it may well work fine with the APIC disabled.

You may also get some benefit from running:

# echo "512 1024 1536" > /proc/sys/vm/freepages

after booting. The default values are a little too low for
applications which are very network intensive.

2000-11-26 13:01:39

by Ingo Oeser

[permalink] [raw]
Subject: readonly /proc/sys/vm/freepages (was: Re: PROBLEM: crashing kernels)

On Sun, Nov 26, 2000 at 10:49:50AM +1100, Andrew Morton wrote:
> You may also get some benefit from running:
>
> # echo "512 1024 1536" > /proc/sys/vm/freepages
>
> after booting.

... which is a NOOP on recent 2.4.0-testX-kernels. So please
complain at Rik for this (CC'ed him) ;-)

It's simply not that easy to set in the new VM since we count the
inactive_clean and/or inactive_dirty pages like free pages in
some cases.

> The default values are a little too low for
> applications which are very network intensive.

Especially for low memory machines, which are dedicated only for
this purpose. Many people use (embedded) Linux and a (embedded)
PC to cheaply fill functionality gaps in industrial environments.

Regards

Ingo Oeser
--
To the systems programmer, users and applications
serve only to provide a test load.
<esc>:x

2000-11-26 16:37:38

by Mr. Big

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels


Previous day I followed the hint of Alan, and made a 2.2.14 kernel, with
the DAC960 driver taken from the 2.2.17, for these changes I've copied
the:
drivers/block/DAC960.c
drivers/block/DAC960.h
drivers/pci/oldrpoc.c
include/linux/pci.h
from the 2.2.17 sourcetree into the 2.2.14. Ok, it compiled.
We also borrowed another mainboard, cpu's and ram for test. We changed
them all, but the borrowed ram was just 512Mb, so we had to do a hard cut
back on the services, for could fit into it. I think I didn't mention yet,
we're using an intel L440GX+ mainboard. For the CPU's see my original
mail. The new CPU's were the same speed (550Mhz) but Katmai's insted of
the Coppermine's.
We removed the Eepros also, just left a 3Com, and a dlink card. This one
isn't the best, but with it's own driver in another computer it's working
with the 2.2.14 kernel since months ago without any troubles.

The kernel booted up. Just the 3Com driver throwed some errors, and warned
that the cheksum isn't good, so I decided to take the driver from the
2.2.17 also, for this I copied the
drivers/net/3c59x.c

For the boot I also followed Andrew's hint, and booted with the noapic
option.
This one is much newer, and this way the 2.2.14 worked well the whole
night.

In the afternoon we decided to put back the original mainboard+ram+cpu.
We booted the kernel described above.
I just noticed then, that all the interrupts went to the CPU1, while the
CPU0 didn't get any. Is this because of the noapic option?

The worst is, that after two hours same black crash again. (no ping, no
console, no keyboard leds)

I believe we face a kind of hardware problem, or some hardware specific
software problem. Any idea wich of the hardware components could be bad,
or wich are badly supported by some drivers?

> This is caused by the APIC(s) forgetting how to deliver interrupts
> for a particular IRQ. Normally, reloading the driver doesn't help.
> But in your case it did. This is odd.

How could an APIC 'forget' how to deliver the interrupts? Could this mean
a problem with the mainboard, or with the CPU?

Thanks for Your help.

+--------------------------------------------+
| Nagy Attila |
| mailto:[email protected] |
+--------------------------------------------+



2000-11-26 19:07:36

by Rik van Riel

[permalink] [raw]
Subject: Re: readonly /proc/sys/vm/freepages (was: Re: PROBLEM: crashing kernels)

On Sun, 26 Nov 2000, Ingo Oeser wrote:
> On Sun, Nov 26, 2000 at 10:49:50AM +1100, Andrew Morton wrote:
> > You may also get some benefit from running:
> >
> > # echo "512 1024 1536" > /proc/sys/vm/freepages
> >
> > after booting.
>
> ... which is a NOOP on recent 2.4.0-testX-kernels. So please
> complain at Rik for this (CC'ed him) ;-)

I wasn't aware I studied at tu-chemnitz ;)

> It's simply not that easy to set in the new VM since we count
> the inactive_clean and/or inactive_dirty pages like free pages
> in some cases.

And also, because HIGHMEM pages are not at all usable for kernel
things, so simply reserving 20MB for network bursts isn't going
to help you when it's all in highmem pages ...

> > The default values are a little too low for
> > applications which are very network intensive.
>
> Especially for low memory machines, which are dedicated only for
> this purpose. Many people use (embedded) Linux and a (embedded)
> PC to cheaply fill functionality gaps in industrial
> environments.

Indeed, I agree that we want this tunable back...

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-11-27 17:22:10

by Mr. Big

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels


> > In the afternoon we decided to put back the original mainboard+ram+cpu.
> > We booted the kernel described above.
>
> With `noapic', I assume?
Yes, of course

>
> It could be hardware or a driver or whatever. Suggest you
> go to a more recent kernel and if the problems persist,
> swap hardware out. Power supply, memory, CPUs, etc.
>
...
>
> We don't know. It doesn't correlate with any particular chipset.
> Could be a hardware bug, a Linux bug or a chip errata which we don't
> know about.
>

Another crash, and error message in this topic:
Kernel Panic: skput:over: a00f8d9b: 1526 put: 66 dev: eth1
In interrupt handler - not syncing

Because we have the SysRQ compiled, we tried the SysRQ + ALT + u
combination, to umount the partitions at least. After a big dump of hexa
numbers we got this:
Aiee, killing interrupt handler
Unable to handle kernel NULL pointer dereference

The eth1 is a dlink card, we use a driver from the cards developer. We use
this type of card with another computer since months ago, with the same
(2.2.14) kernel, and we didn't experience any problems yet. Of course I've
compiled the modul on the same computer where it's been run, and where
also the kernel has ben compiled and run.

The two cards in the two computers also have the same load (because they
are connected with a crosslink cable ;)

So I suppose it's not the fault of the network driver this time. I still
believe is somewhere around the apic.

I hope I could give You some more informations


+--------------------------------------------+
| Nagy Attila |
| mailto:[email protected] |
+--------------------------------------------+


2000-11-27 18:03:26

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

On Sun, 26 Nov 2000, Mr. Big wrote:

> How could an APIC 'forget' how to deliver the interrupts? Could this mean
> a problem with the mainboard, or with the CPU?

Do you have an USB host controller in your system? If so, could you
please send me an output of `/sbin/lspci' and the contents of
/proc/interrupts? I wonder if this might be the reason...

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2000-11-27 18:08:46

by Mr. Big

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels


> > How could an APIC 'forget' how to deliver the interrupts? Could this mean
> > a problem with the mainboard, or with the CPU?
>
> Do you have an USB host controller in your system? If so, could you
> please send me an output of `/sbin/lspci' and the contents of
> /proc/interrupts? I wonder if this might be the reason...

Nope, we also disabled all unneeded periferies, like serial and parallel
ports.

But maybe the /proc/interrupts could be usefull:
CPU0 CPU1
0: 413111 0 XT-PIC timer
1: 687 0 XT-PIC keyboard
2: 0 0 XT-PIC cascade
7: 751660 0 XT-PIC eth1
10: 7377626 0 XT-PIC eth0
11: 238981 0 XT-PIC Mylex AcceleRAID 352, aic7xxx, aic7xxx
13: 1 0 XT-PIC fpu
14: 10 0 XT-PIC ide0
NMI: 0
ERR: 0

This is how it looks like, since we have the apic disabled. I've tried to
avoid that the Mylex and the Adaptec (aic7xxx) get the same irq, but the
bios was too lame on these things :(

+--------------------------------------------+
| Nagy Attila |
| mailto:[email protected] |
+--------------------------------------------+

2000-11-27 18:45:27

by Roger Larsson

[permalink] [raw]
Subject: Re: readonly /proc/sys/vm/freepages (was: Re: PROBLEM: crashing kernels)

On Sunday 26 November 2000 19:36, Rik van Riel wrote:
> On Sun, 26 Nov 2000, Ingo Oeser wrote:
> > On Sun, Nov 26, 2000 at 10:49:50AM +1100, Andrew Morton wrote:
> > > You may also get some benefit from running:
> > >
> > > # echo "512 1024 1536" > /proc/sys/vm/freepages
> > >
> > > after booting.
> >
> > ... which is a NOOP on recent 2.4.0-testX-kernels. So please
> > complain at Rik for this (CC'ed him) ;-)
>
> I wasn't aware I studied at tu-chemnitz ;)
>
> > It's simply not that easy to set in the new VM since we count
> > the inactive_clean and/or inactive_dirty pages like free pages
> > in some cases.
>
> And also, because HIGHMEM pages are not at all usable for kernel
> things, so simply reserving 20MB for network bursts isn't going
> to help you when it's all in highmem pages ...

Should the
echo "512 1024 1536" > /proc/sys/vm/freepages
apply only to DMA pages?
(It would work correctly with <16 M machines, and probably ok with others)

Sidenote:
Can you build a clean x86 computer that do not especially care about
DMA able pages? (no ISA cards, no memory limited PCI cards, etc...)
Would it then be nice to be able to remove that zone completely?
(like we can with the HIGHMEM today)

>
> > > The default values are a little too low for
> > > applications which are very network intensive.
> >
> > Especially for low memory machines, which are dedicated only for
> > this purpose. Many people use (embedded) Linux and a (embedded)
> > PC to cheaply fill functionality gaps in industrial
> > environments.
>
> Indeed, I agree that we want this tunable back...
>

/RogerL

--
Home page:
http://www.norran.net/nra02596/

2000-11-27 19:35:37

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

On Mon, 27 Nov 2000, Mr. Big wrote:

> But maybe the /proc/interrupts could be usefull:
> CPU0 CPU1
> 0: 413111 0 XT-PIC timer
> 1: 687 0 XT-PIC keyboard
> 2: 0 0 XT-PIC cascade
> 7: 751660 0 XT-PIC eth1
> 10: 7377626 0 XT-PIC eth0
> 11: 238981 0 XT-PIC Mylex AcceleRAID 352, aic7xxx, aic7xxx
> 13: 1 0 XT-PIC fpu
> 14: 10 0 XT-PIC ide0
> NMI: 0
> ERR: 0

Hmm, nothing special. Getting this in the APIC mode would possibly be
more useful. Isn't there anything that's sharing the IRQ with eth0 that's
unhandled? An unhandled onboard device? What IRQs are reported by
`/sbin/lspci -v'?

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2000-11-27 20:05:27

by Mr. Big

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

On Mon, 27 Nov 2000, Maciej W. Rozycki wrote:

> On Mon, 27 Nov 2000, Mr. Big wrote:
>
> > But maybe the /proc/interrupts could be usefull:
> > CPU0 CPU1
> > 0: 413111 0 XT-PIC timer
> > 1: 687 0 XT-PIC keyboard
> > 2: 0 0 XT-PIC cascade
> > 7: 751660 0 XT-PIC eth1
> > 10: 7377626 0 XT-PIC eth0
> > 11: 238981 0 XT-PIC Mylex AcceleRAID 352, aic7xxx, aic7xxx
> > 13: 1 0 XT-PIC fpu
> > 14: 10 0 XT-PIC ide0
> > NMI: 0
> > ERR: 0
>
> Hmm, nothing special. Getting this in the APIC mode would possibly be
> more useful. Isn't there anything that's sharing the IRQ with eth0 that's
> unhandled? An unhandled onboard device? What IRQs are reported by
> `/sbin/lspci -v'?

We've disabled the apic, because there was a hint, that maybe there's some
bug with the hardware or software on it. I belive that it's could be
better to use the apic.

The output of lspci -v:

00:00.0 Host bridge: Intel Corporation 440GX - 82443GX Host bridge
Flags: bus master, medium devsel, latency 64
Memory at f8000000 (32-bit, prefetchable)
Capabilities: [a0] AGP version 1.0

00:01.0 PCI bridge: Intel Corporation 440GX - 82443GX AGP bridge (prog-if 00 [Normal decode])
Flags: bus master, 66Mhz, medium devsel, latency 64
Bus: primary=00, secondary=01, subordinate=02, sec-latency=64

00:0b.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)
Subsystem: 3Com Corporation 3C905C-TX Fast Etherlink for PC Management NIC
Flags: bus master, medium devsel, latency 64, IRQ 10
I/O ports at 2080
Memory at f4101000 (32-bit, non-prefetchable)
Capabilities: [dc] Power Management version 2

00:0c.0 SCSI storage controller: Adaptec 7896
Subsystem: Adaptec: Unknown device 0053
Flags: bus master, medium devsel, latency 64, IRQ 11
BIST result: 00
I/O ports at 2400
Memory at f4102000 (64-bit, non-prefetchable)
Capabilities: [dc] Power Management version 1

00:0c.1 SCSI storage controller: Adaptec 7896
Subsystem: Adaptec: Unknown device 0053
Flags: bus master, medium devsel, latency 64, IRQ 11
BIST result: 00
I/O ports at 2800
Memory at f4103000 (64-bit, non-prefetchable)
Capabilities: [dc] Power Management version 1

00:0d.0 PCI bridge: Intel Corporation: Unknown device 0964 (rev 02) (prog-if 00 [Normal decode])
Flags: bus master, medium devsel, latency 64
Bus: primary=00, secondary=03, subordinate=03, sec-latency=64

00:0d.1 RAID bus controller: Mylex Corporation: Unknown device 0050 (rev 02)
Subsystem: Mylex Corporation: Unknown device 0050
Flags: bus master, medium devsel, latency 64, IRQ 11
Memory at f4106000 (32-bit, prefetchable)
Capabilities: [80] Power Management version 2

00:0e.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
Subsystem: Intel Corporation 82559 Fast Ethernet LAN on Motherboard
Flags: bus master, medium devsel, latency 64, IRQ 5
Memory at f4100000 (32-bit, non-prefetchable)
I/O ports at 2000
Memory at f4000000 (32-bit, non-prefetchable)
Capabilities: [dc] Power Management version 2

00:10.0 Ethernet controller: D-Link System Inc: Unknown device 1002
Subsystem: D-Link System Inc: Unknown device 1002
Flags: bus master, medium devsel, latency 64, IRQ 7
I/O ports at 2c00
Memory at f4101400 (32-bit, non-prefetchable)
Capabilities: [50] Power Management version 1

00:12.0 ISA bridge: Intel Corporation 82371AB PIIX4 ISA (rev 02)
Flags: bus master, medium devsel, latency 0

00:12.1 IDE interface: Intel Corporation 82371AB PIIX4 IDE (rev 01) (prog-if 80 [Master])
Flags: bus master, medium devsel, latency 64
I/O ports at 2040

00:12.2 USB Controller: Intel Corporation 82371AB PIIX4 USB (rev 01) (prog-if 00 [UHCI])
Flags: bus master, medium devsel, latency 64, IRQ 5
I/O ports at 2060

00:12.3 Bridge: Intel Corporation 82371AB PIIX4 ACPI (rev 02)
Flags: medium devsel

00:14.0 VGA compatible controller: Cirrus Logic GD 5480 (rev 23) (prog-if 00 [VGA])
Subsystem: Cirrus Logic CL-GD5480
Flags: bus master, medium devsel, latency 64
Memory at f6000000 (32-bit, prefetchable)
Memory at f4104000 (32-bit, non-prefetchable)

01:0f.0 PCI bridge: Digital Equipment Corporation DECchip 21150 (rev 06) (prog-if 00 [Normal decode])
Flags: bus master, fast Back2Back, 66Mhz, medium devsel, latency 240
Bus: primary=01, secondary=02, subordinate=02, sec-latency=68
Capabilities: [dc] Power Management version 1



+--------------------------------------------+
| Nagy Attila |
| mailto:[email protected] |
+--------------------------------------------+

2000-11-28 09:21:49

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

On Mon, 27 Nov 2000, Mr. Big wrote:

> We've disabled the apic, because there was a hint, that maybe there's some
> bug with the hardware or software on it. I belive that it's could be
> better to use the apic.
>
> The output of lspci -v:
[...]
> 00:0e.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
> Subsystem: Intel Corporation 82559 Fast Ethernet LAN on Motherboard
> Flags: bus master, medium devsel, latency 64, IRQ 5

Hmm, this is the device you reported you have a problem initially, isn't
it? If it is, then...

> 00:12.2 USB Controller: Intel Corporation 82371AB PIIX4 USB (rev 01) (prog-if 00 [UHCI])
> Flags: bus master, medium devsel, latency 64, IRQ 5

... it shares its IRQ with an USB host adapter as I suspected. And you
don't have an USB driver installed. Does the following patch help? (Hmm,
since you tested 2.4.0-test* as well -- it might not as it's just a
backport... Then again -- you might hit a different problem with
2.4.0-test*.)

It's not impossible for an I/O APIC to lose an EOI message if there are
severe errors during the transmission -- since you already tried
2.4.0-test*: have you seen any APIC errors in the syslog?

Maciej

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

diff -up --recursive --new-file linux-2.2.17.macro/drivers/pci/quirks.c linux-2.2.17/drivers/pci/quirks.c
--- linux-2.2.17.macro/drivers/pci/quirks.c Wed Oct 27 00:53:40 1999
+++ linux-2.2.17/drivers/pci/quirks.c Fri Oct 20 10:33:01 2000
@@ -144,6 +144,26 @@ __initfunc(static void quirk_isa_dma_han
}
}

+/*
+ * PIIX3 USB: We have to disable USB interrupts that are
+ * hardwired to PIRQD# and may be shared with an
+ * external device.
+ *
+ * Legacy Support Register (LEGSUP):
+ * bit13: USB PIRQ Enable (USBPIRQDEN),
+ * bit4: Trap/SMI On IRQ Enable (USBSMIEN).
+ *
+ * We mask out all r/wc bits, too.
+ */
+__initfunc(static void quirk_piix3_usb(struct pci_dev *dev, int arg))
+{
+ u16 legsup;
+
+ pci_read_config_word(dev, 0xc0, &legsup);
+ legsup &= 0x50ef;
+ pci_write_config_word(dev, 0xc0, legsup);
+}
+

typedef void (*quirk_handler)(struct pci_dev *, int);

@@ -202,6 +222,8 @@ static struct quirk_info quirk_list[] __
*/
{ PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_82C586_0, quirk_isa_dma_hangs, 0x00 },
{ PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_82C596_0, quirk_isa_dma_hangs, 0x00 },
+ { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371SB_2, quirk_piix3_usb, 0x00 },
+ { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371AB_2, quirk_piix3_usb, 0x00 },
};

__initfunc(void pci_quirks_init(void))

2000-11-28 19:12:20

by Mr. Big

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels


> > The output of lspci -v:
> [...]
> > 00:0e.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
> > Subsystem: Intel Corporation 82559 Fast Ethernet LAN on Motherboard
> > Flags: bus master, medium devsel, latency 64, IRQ 5
>
> Hmm, this is the device you reported you have a problem initially, isn't
> it? If it is, then...
>
> > 00:12.2 USB Controller: Intel Corporation 82371AB PIIX4 USB (rev 01) (prog-if 00 [UHCI])
> > Flags: bus master, medium devsel, latency 64, IRQ 5
>
> ... it shares its IRQ with an USB host adapter as I suspected. And you
> don't have an USB driver installed. Does the following patch help? (Hmm,
> since you tested 2.4.0-test* as well -- it might not as it's just a
> backport... Then again -- you might hit a different problem with
> 2.4.0-test*.)

Yes You are right. This Ether Express is integrated on the motherboard, so
we couldn't get it out totally :( But there isn't any cable connected to
it, we also doesn't have the driver in the kernel. This is the same with
the USB too. Do You mind, that they could have some kind of conflict just
on they own, without being really used?

Currently after changing the processors to Katmais, and disabling the
apic, and some of the other unused peripheries, it seems, that the system
is more stable. If the errors come again, I'll try to compile the USB
driver also.


> It's not impossible for an I/O APIC to lose an EOI message if there are
> severe errors during the transmission -- since you already tried
> 2.4.0-test*: have you seen any APIC errors in the syslog?

Nope, specially we didn't get errors from the APIC himself. But both
network cards (except the EtherExpress) were saying errors considering to
interrupts.
3Com:
kernel: eth0: Interrupt posted but not delivered -- IRQ blocked by another
device?
kernel: Flags; bus-master 1, full 0; dirty 112(0) current 112(0).
kernel: Transmit list 00000000 vs. f20ac200.
kernel: 0: @f20ac200 length 80000036 status
...
kernel: 15: @f20ac2f0 length 80000042 status
kernel: eth0: Resetting the Tx ring pointer.


DLink:
Kernel Panic: skput:over: a00f8d9b: 1526 put: 66 dev: eth1
In interrupt handler - not syncing

after SysRQ+ALT+u:
Aiee, killing interrupt handler
Unable to handle kernel NULL pointer dereference

Thanx for the USB patch, I'll try it also.


+--------------------------------------------+
| Nagy Attila |
| mailto:[email protected] |
+--------------------------------------------+

2000-11-29 12:24:48

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

On Tue, 28 Nov 2000, Mr. Big wrote:

> Yes You are right. This Ether Express is integrated on the motherboard, so
> we couldn't get it out totally :( But there isn't any cable connected to
> it, we also doesn't have the driver in the kernel. This is the same with
> the USB too. Do You mind, that they could have some kind of conflict just
> on they own, without being really used?

An unhandled device may keep it's IRQ line asserted. It shouldn't but
experience shows that PIIX* USB host adapters tend to do so. If the line
is shared with another device and the device's driver enables the IRQ in
the interrupt controller, the line will stay asserted all the time due to
the other device. For APIC systems it will result in a heavy load of the
inter-APIC bus -- interrupt and end of interrupt (EOI) messages will be
sent over and over again. The former ones are long (i.e. last for many
cycles) and incur a priority arbitration which involves all APICs in the
system. Such a load leads to weird behaviour.

> Currently after changing the processors to Katmais, and disabling the
> apic, and some of the other unused peripheries, it seems, that the system
> is more stable. If the errors come again, I'll try to compile the USB
> driver also.

You need not to -- the patch I gave you disables the USB interrupt within
the host adapter -- it is no longer asserted and thus it does not affect
the rest of the system.

It is possible there are APIC errata we hit. It would be interesting to
see a dump of APIC registers when the lockup conditions appear. I suppose
you cannot debug your system as it's used in production, right? Otherwise
I have a debugging patch that may be used to dump all APIC registers --
that would let me know if the interrupt is stuck in a local APIC or if
there is a sychronization problem, i.e. the local APIC considers the
interrupt finished while the origination I/O APIC does not (it must
receive an EOI message for the interrupt to be completely processed).

> Nope, specially we didn't get errors from the APIC himself. But both

Good -- it means your inter-APIC bus is clean, so there is a little
chance for an undetected error to slip. All APIC messages are checksummed
and simple errors get detected. A severe corruption might not and it
happens for noisy buses.

> network cards (except the EtherExpress) were saying errors considering to
> interrupts.
> 3Com:
> kernel: eth0: Interrupt posted but not delivered -- IRQ blocked by another
> device?
> kernel: Flags; bus-master 1, full 0; dirty 112(0) current 112(0).
> kernel: Transmit list 00000000 vs. f20ac200.
> kernel: 0: @f20ac200 length 80000036 status
> ...
> kernel: 15: @f20ac2f0 length 80000042 status
> kernel: eth0: Resetting the Tx ring pointer.

I would need a dump of APIC registers. Following is a patch (credit goes
to Andrew Morton for the idea and the original implementation) that adds a
SysRq feature, which allows you to dump APIC and PIC registers upon
hitting SysRq+A. The patch requires a recent 2.4.0-test* (it should work
from about -test2 up -- you don't have to use -test11 exactly) kernel to
work. If you were able to grab such a dump under the above conditions it
might help tracking the problem down. Please use `dmesg -s 32768' to grab
the log as most of messages go under the debug priority level and are
likely not to appear in the syslog).

> Thanx for the USB patch, I'll try it also.

If you do not use the Intel card (i.e. you do not share an unhandled IRQ
with an active device) you should not be affected by the problem it
addresses. But you should check whether you really have no other device
with it's IRQ enabled but unhandled.

You appear to have the PIIX4 ISA bridge -- it contains an embedded ACPI
controller. It has it's IRQ hardwired to 11. Try to move device IRQs
away from it (by setting IRQ 11 to "ISA/Legacy" in BIOS). While I've not
heard of ACPI IRQ problems so far, it does not mean they never happen.

You might be able to check /proc/stat, possibly with the `procinfo'
program if there are no IRQs counters incrementing that are unhandled --
/proc/stat, as opposed to /proc/interrupts shows all IRQs -- the latter
displays only the ones with a driver.

Good luck,

Maciej

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

diff -up --recursive --new-file linux-2.4.0-test11.macro/arch/i386/kernel/io_apic.c linux-2.4.0-test11/arch/i386/kernel/io_apic.c
--- linux-2.4.0-test11.macro/arch/i386/kernel/io_apic.c Thu Oct 5 21:08:17 2000
+++ linux-2.4.0-test11/arch/i386/kernel/io_apic.c Sun Nov 26 12:39:01 2000
@@ -692,7 +692,7 @@ void __init UNEXPECTED_IO_APIC(void)
printk(KERN_WARNING " to [email protected]\n");
}

-void __init print_IO_APIC(void)
+void /*__init*/ print_IO_APIC(void)
{
int apic, i;
struct IO_APIC_reg_00 reg_00;
diff -up --recursive --new-file linux-2.4.0-test11.macro/drivers/char/sysrq.c linux-2.4.0-test11/drivers/char/sysrq.c
--- linux-2.4.0-test11.macro/drivers/char/sysrq.c Tue Nov 14 10:24:52 2000
+++ linux-2.4.0-test11/drivers/char/sysrq.c Sun Nov 26 12:42:11 2000
@@ -72,6 +72,15 @@ void handle_sysrq(int key, struct pt_reg
console_loglevel = 7;
printk(KERN_INFO "SysRq: ");
switch (key) {
+ case 'a':
+ printk("\n");
+ printk("print_PIC()\n");
+ print_PIC();
+ printk("print_IO_APIC()\n");
+ print_IO_APIC();
+ printk("print_all_local_APICs()\n");
+ print_all_local_APICs();
+ break;
case 'r': /* R -- Reset raw mode */
if (kbd) {
kbd->kbdmode = VC_XLATE;

2000-11-30 11:27:06

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: PROBLEM: crashing kernels

On Wed, 29 Nov 2000, Maciej W. Rozycki wrote:

> You appear to have the PIIX4 ISA bridge -- it contains an embedded ACPI
> controller. It has it's IRQ hardwired to 11. Try to move device IRQs
> away from it (by setting IRQ 11 to "ISA/Legacy" in BIOS). While I've not
> heard of ACPI IRQ problems so far, it does not mean they never happen.

This is of course incorrect -- the ACPI controller of PIIX4 uses IRQ 9.
I have no idea why I thought it uses IRQ 11, yesterday... :-(

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +