2002-08-01 12:21:50

by Lars Schmitt

[permalink] [raw]
Subject: Kernel panic on Dual Athlon MP


Hello,

We are running a number of dual Athlon MP machines with Tyan Tiger MPX,
3ware RAID controllers and 3COM 3C996T GigE. These machines have serious
stability problems and crash in particular when large file copies from
the RAID to the network are initiated.

I have been trying the following RH kernels:

- 2.4.9-31 with the bcm5700 driver
- 2.4.18-3 with the tg3 driver

The result essentially is the same. It crashes with kernel panic -
see the message appended below.

For both kernels the preceding oops points to the network driver's
interrupt routine, although the drivers are different.

> On 2002-03-15 18:54:24 Alan Cox <[email protected]> wrote:
>> Some gige cards don't seem to work with some dual athlon bioses. Other than
>> that it should be fine

Could you specify that a bit more? In particular, am I likely to
have such an unlucky combination? Would a BIOS upgrade help or
should I get a different GigE card?

It seems that the kernel option "mem=nopentium" improves the situation
a bit but does not remove the crashes completely. Somebody suggested
to also use the "noapic" option, which however would reduce the
performance of the system. Could you comment on this.

> On 2002-07-22 15:35:18 Alan Cox <[email protected]> wrote:
>> In your case thats the memory handling stuff that the NVidia module
>> happens to trip. Its actually a kernel related thing not their fault.
>> There is a horrible work around in 2.4.19-rc2, and people are working on
>> the right fixes for this.

Could you explain the changes in 2.4.19-rc2 (or was it rc3?) ?
Will the right fixes already be in 2.4.19-final?

Thanks very much for your help.

Yours,
Lars Schmitt

PS: I don't quite understand, why there are the warnings from ksymoops
not loading certain modules.
-----------
Dr. Lars Schmitt
Lars.Schmitt<@>ph.tum.de Lars.Schmitt<@>cern.ch
TU Muenchen, Physik-Department, E18 CERN, Div. EP
James-Franck-Str. 1, 85747 Garching CH-1211 Geneva 23

----------
ksymoops 2.4.1 on i686 2.4.9-31.1.cernsmp. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.9-31.1.cernsmp/ (default)
-m /boot/System.map-2.4.9-31.1.cernsmp (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Error (expand_objects): cannot stat(/lib/ext3.o) for ext3
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/jbd.o) for jbd
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/3w-xxxx.o) for 3w-xxxx
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
ksymoops: No such file or directory
Warning (compare_maps): ksyms_base symbol GPLONLY_IO_APIC_get_PCI_irq_vector not found in System.map. Ignoring ksyms_base entry
Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says c01c2ca0, System.map says c01601a0. Ignoring ksyms_base entry
Warning (compare_maps): mismatch on symbol nlmsvc_grace_period , lockd says f8a92a74, /lib/modules/2.4.9-31.1.cernsmp/kernel/fs/lockd/lockd.o says f8a91ed4. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/fs/lockd/lockd.o entry
Warning (compare_maps): mismatch on symbol nlmsvc_ops , lockd says f8a92a70, /lib/modules/2.4.9-31.1.cernsmp/kernel/fs/lockd/lockd.o says f8a91ed0. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/fs/lockd/lockd.o entry
Warning (compare_maps): mismatch on symbol nlmsvc_timeout , lockd says f8a92a78, /lib/modules/2.4.9-31.1.cernsmp/kernel/fs/lockd/lockd.o says f8a91ed8. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/fs/lockd/lockd.o entry
Warning (compare_maps): mismatch on symbol nfs_debug , sunrpc says f8ab8c60, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab8940. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol nfsd_debug , sunrpc says f8ab8c64, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab8944. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol nlm_debug , sunrpc says f8ab8c68, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab8948. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_debug , sunrpc says f8ab8c5c, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab893c. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_garbage_args , sunrpc says f8ab8c3c, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab891c. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_success , sunrpc says f8ab8c2c, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab890c. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_system_err , sunrpc says f8ab8c40, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab8920. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol xdr_one , sunrpc says f8ab8c24, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab8904. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol xdr_two , sunrpc says f8ab8c28, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab8908. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol xdr_zero , sunrpc says f8ab8c20, /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o says f8ab8900. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/net/sunrpc/sunrpc.o entry
Warning (map_ksym_to_module): cannot match loaded module ext3 to a unique module object. Trace may not be reliable.
Warning (compare_maps): mismatch on symbol tw_device_extension_list , 3w-xxxx says f8825600, /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/3w-xxxx.o says f8825480. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/3w-xxxx.o entry
Warning (compare_maps): mismatch on symbol sd , sd_mod says f881cda0, /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/sd_mod.o says f881cd00. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/sd_mod.o entry
Warning (compare_maps): mismatch on symbol proc_scsi , scsi_mod says f881827c, /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o says f8816ab4. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o entry
Warning (compare_maps): mismatch on symbol scsi_devicelist , scsi_mod says f88182a8, /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o says f8816ae0. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o entry
Warning (compare_maps): mismatch on symbol scsi_hostlist , scsi_mod says f88182a4, /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o says f8816adc. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o entry
Warning (compare_maps): mismatch on symbol scsi_hosts , scsi_mod says f88182ac, /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o says f8816ae4. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o entry
Warning (compare_maps): mismatch on symbol scsi_logging_level , scsi_mod says f8818278, /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o says f8816ab0. Ignoring /lib/modules/2.4.9-31.1.cernsmp/kernel/drivers/scsi/scsi_mod.o entry
Unable to handle kernel NULL pointer dereference at virtual address 00000000
f8a52ebb
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[3c59x:__insmod_3c59x_O/lib/modules/2.4.9-31.1.cernsmp/kernel/driv+-106821/96] Tainted: PF
EIP: 0010:[<f8a52ebb>] Tainted: PF
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000016 ebx: f4c88140 ecx: 00000005 edx: 0000010f
esi: 00000000 edi: f4c8a45c ebp: 0000000b esp: c02f1f2c
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c02f1000)
Stack: f4c88140 0000001d f4c88000 f8a4b73f f4c88140 f6ddc140 00000000 04000001
c0108ace 0000000b f4c88000 c02f1f94 c02f1f94 c0331ac0 0000000b 00000000
c0108cd4 0000000b c02f1f94 f6ddc140 f6ddc140 c0105420 c02f0000 c02f0000
Call Trace: [3c59x:__insmod_3c59x_O/lib/modules/2.4.9-31.1.cernsmp/kernel/driv+-137409/96] bcm5700_probe [bcm5700] 0x10df
Call Trace: [<f8a4b73f>] bcm5700_probe [bcm5700] 0x10df
[<c0108ace>] handle_IRQ_event [kernel] 0x5e
[<c0108cd4>] do_IRQ [kernel] 0xa4
[<c0105420>] default_idle [kernel] 0x0
[<c0105420>] default_idle [kernel] 0x0
[<c022cb7c>] call_do_IRQ [kernel] 0x5
[<c0105420>] default_idle [kernel] 0x0
[<c0105420>] default_idle [kernel] 0x0
[<c010544d>] default_idle [kernel] 0x2d
[<c01054d2>] cpu_idle [kernel] 0x52
[<c0105000>] stext [kernel] 0x0
[<c022bc80>] .rodata.str1.32 [kernel] 0x4a0
Code: c7 06 00 00 00 00 8b 93 94 2c 00 00 8d 83 90 2c 00 00 85 d2

>>EIP; f8a52ebb <[bcm5700]LM_ServiceInterrupts+24b/2e0> <=====
Trace; f8a4b73f <[bcm5700]bcm5700_interrupt+bf/220>
Trace; c0108ace <handle_IRQ_event+5e/90>
Trace; c0108cd4 <do_IRQ+a4/f0>
Trace; c0105420 <default_idle+0/40>
Trace; c0105420 <default_idle+0/40>
Trace; c022cb7c <call_do_IRQ+5/d>
Trace; c0105420 <default_idle+0/40>
Trace; c0105420 <default_idle+0/40>
Trace; c010544d <default_idle+2d/40>
Trace; c01054d2 <cpu_idle+52/70>
Trace; c0105000 <_stext+0/0>
Trace; c022bc80 <llc_oui+870/1748>
Code; f8a52ebb <[bcm5700]LM_ServiceInterrupts+24b/2e0>
00000000 <_EIP>:
Code; f8a52ebb <[bcm5700]LM_ServiceInterrupts+24b/2e0> <=====
0: c7 06 00 00 00 00 movl $0x0,(%esi) <=====
Code; f8a52ec1 <[bcm5700]LM_ServiceInterrupts+251/2e0>
6: 8b 93 94 2c 00 00 mov 0x2c94(%ebx),%edx
Code; f8a52ec7 <[bcm5700]LM_ServiceInterrupts+257/2e0>
c: 8d 83 90 2c 00 00 lea 0x2c90(%ebx),%eax
Code; f8a52ecd <[bcm5700]LM_ServiceInterrupts+25d/2e0>
12: 85 d2 test %edx,%edx

<0>Kernel panic: Aiee, killing interrupt handler!

24 warnings and 5 errors issued. Results may not be reliable.


2002-08-01 13:14:25

by Alan

[permalink] [raw]
Subject: Re: Kernel panic on Dual Athlon MP

On Thu, 2002-08-01 at 13:24, Lars Schmitt wrote:
> > On 2002-03-15 18:54:24 Alan Cox <[email protected]> wrote:
> >> Some gige cards don't seem to work with some dual athlon bioses. Other than
> >> that it should be fine
>
> Could you specify that a bit more? In particular, am I likely to
> have such an unlucky combination? Would a BIOS upgrade help or
> should I get a different GigE card?

My board won't even POST with a tg3 card in it. With a newer BIOS it
passes the POST test and seems to work with the kernel fixups for the
PCI bridges. There are also multiple people who had 3ware problems with
very fast machines, but I gather the latest driver merge fixed that.

So 2.4.9 I'd certainly expect to fail dismally. 2.4.18 ought to be ok as
should 2.4.19rc4. I run my board with an aacraid scsi and with netgear
ethernet and its stable.

Alan

2002-08-01 17:36:10

by Kwijibo

[permalink] [raw]
Subject: Re: Kernel panic on Dual Athlon MP

Alan Cox wrote:

>My board won't even POST with a tg3 card in it. With a newer BIOS it
>
Great, I had plans to put a tg3 card in my dual Athlon box.
Do you know if this is a Tiger board only problem? I have
the Thunder S2462UNG with just the MP chipset. How
new of a bios? I don't plan to use the tg3 card on the box
until 2.4.19 comes out due to the important fixes in it for
the tg3.

>passes the POST test and seems to work with the kernel fixups for the
>PCI bridges. There are also multiple people who had 3ware problems with
>
I have 2 3ware 7850's on a dual Athlon MP 1900 box and I have
no problems yet with it. I am about to increase the network to
gigi speed and I hope it has no problems. Currently this is
with 2.4.18.

>very fast machines, but I gather the latest driver merge fixed that.
>
>So 2.4.9 I'd certainly expect to fail dismally. 2.4.18 ought to be ok as
>should 2.4.19rc4. I run my board with an aacraid scsi and with netgear
>ethernet and its stable.
>
>Alan
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
Steve


2002-08-01 17:45:22

by Alan

[permalink] [raw]
Subject: Re: Kernel panic on Dual Athlon MP

On Thu, 2002-08-01 at 18:48, [email protected] wrote:
> >My board won't even POST with a tg3 card in it. With a newer BIOS it
> >
> Great, I had plans to put a tg3 card in my dual Athlon box.
> Do you know if this is a Tiger board only problem? I have
> the Thunder S2462UNG with just the MP chipset. How
> new of a bios? I don't plan to use the tg3 card on the box
> until 2.4.19 comes out due to the important fixes in it for
> the tg3.

The BIOS in mine is pretty old and newer BIOSes fix it. As I understand
it this was caused by PCI misconfiguration in the BIOS. I've not tested
current compatibility myself so I'm going on what I've heard there.

> I have 2 3ware 7850's on a dual Athlon MP 1900 box and I have
> no problems yet with it. I am about to increase the network to
> gigi speed and I hope it has no problems. Currently this is
> with 2.4.18.

Good to hear


2002-08-01 18:57:00

by Roland Kuhn

[permalink] [raw]
Subject: Re: Kernel panic on Dual Athlon MP

Hi!

Because of his knowledge about this NIC I've included Dave Miller.

On Thu, 1 Aug 2002, Lars Schmitt wrote:

> Hello,
>
> We are running a number of dual Athlon MP machines with Tyan Tiger MPX,
> 3ware RAID controllers and 3COM 3C996T GigE. These machines have serious
> stability problems and crash in particular when large file copies from
> the RAID to the network are initiated.
>
> I have been trying the following RH kernels:
>
> - 2.4.9-31 with the bcm5700 driver
> - 2.4.18-3 with the tg3 driver
>
> The result essentially is the same. It crashes with kernel panic -
> see the message appended below.
>
Some deeper investigation shows that both drivers suffer from the same
problem (terminology below is from tg3.c):

hw_idx and sw_idx differ when entering tg3_tx(), but there is no skb in
the tx_buffer. In case of tg3 this triggers the BUG, part of which you
find below (the rest had scrolled off the screen); the bcm5700 driver
does not check this and fails upon pPacket->PacketStatus=0 in
LM_ServiceTxInterrupt(), pPacket being NULL.

Is it possible that this is a hardware problem, possibly the NIC writing
bogus values to tp->hw_status? Or is it a locking problem somewhere else?
I'm not a kernel hacker, so I don't have the full picture. If there's
information missing I would happily supply it, e.g. if you need the full
dump of the hardware status or whatever; you would only have to tell me
how to get this out of the machine.

Now here's the (partial) BUG message (tg3.c, line 1510): {the .LC13 is an
artifact, it's the address of the string "tg3.c" for do_BUG()}

EIP is at tg3_tx [tg3] 0x5e (2.4.18-3smp)
eax: 0000001a ebx: f7832410 ecx: c02eb840 edx: 00005226
esi: 00000182 edi: 00000000 ebp: f2dc6560 esp: c0307f10
ds: 0018 es: 0018 ss: 0018
Stack: f8a531e4 000005e6 c0306000 00000186 c04e5000 f2dc6560 04000001 0000000b
f8a4c533 f2dc6560 f2dc6560 c04e5000 f8a4c5b4 f2dc6560 f7e2b900 00000000
c010a52e 0000000b f2dc6400 c0307f98 c0307f98 c0356ac0 0000000b 00000000
CallTrace: [<f8a531e4>] .LC13 [tg3] 0x0
[<f8a4c533>] tg3_interrupt_main_work [tg3] 0x53
[<f8a4c5b4>] tg3_interrupt [tg3] 0x44
[<c010a52e>] handle_IRQ_event [kernel] 0x5e
[<c010a734>] do_IRQ [kernel] 0xa4
[<c0106e70>] default_idle [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c0106e70>] default_idle [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c0106e9c>] default_idle [kernel] 0x2c
[<c0106ef4>] cpu_idle [kernel] 0x24
Code: 0f 0b 59 58 c7 03 00 00 00 00 8b 87 90 00 00 00 46 31 db eb
<6>Aieee, killing interrupt handler
In interrupt handler - not syncing

Ciao,
Roland

+---------------------------+-------------------------+
| TU Muenchen | |
| Physik-Department E18 | Raum 3558 |
| James-Franck-Str. | Telefon 089/289-12592 |
| 85747 Garching | |
+---------------------------+-------------------------+