2003-01-22 23:18:45

by Jacek Radajewski

[permalink] [raw]
Subject: RE: 2650 - tg3 on 2.4.18-19.7.xsmp rh7.3 ... OOPS YET AGAIN

is the network card really the problem ? I don't want to be replacing all my network cards if the problem is elsewhere .... if you can understand the oops message please, please, please let me know where the problem is ...


another oops:
---------------------- cut ----------------------------------------------------------------

ksymoops 2.4.4 on i686 2.4.18-19.7.xsmp. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.18-19.7.xsmp/ (default)
-m /boot/System.map-2.4.18-19.7.xsmp (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Error (expand_objects): cannot stat(/lib/ext3.o) for ext3
Error (expand_objects): cannot stat(/lib/jbd.o) for jbd
Error (expand_objects): cannot stat(/lib/aacraid.o) for aacraid
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
Warning (map_ksym_to_module): cannot match loaded module ext3 to a unique module object. Trace may not be reliable.
Warning (map_ksym_to_module): cannot match loaded module aacraid to a unique module object. Trace may not be reliable.
Unable to handle kernel NULL pointer dereference at virtual address 00000007
f897f51d
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<f897f51d>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: f6a2dc00 ebx: 00000000 ecx: ffffffff edx: c03fdc04
esi: c03fdc04 edi: 00000000 ebp: dd70b89c esp: c0349ee4
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c0349000)
Stack: c01ac741 c03fdc04 dd70b89c 00000000 00000000 c03fdbc0 00000000 c03fdc04
c03fdbc0 dd70b89c c03fdc04 0000000e c01acacc c03fdc04 dd70b89c c3684e80
c03fdc04 00000296 c03fdbc0 c01acf99 c3684e80 0000000e f897f2f0 c36b0d60
Call Trace: [<c01ac741>] start_request [kernel] 0x1a1 (0xc0349ee4))
[<c01acacc>] ide_do_request [kernel] 0x29c (0xc0349f14))
[<c01acf99>] ide_intr [kernel] 0x129 (0xc0349f30))
[<f897f2f0>] cdrom_pc_intr [ide-cd] 0x0 (0xc0349f3c))
[<c010a61e>] handle_IRQ_event [kernel] 0x5e (0xc0349f50))
[<c010a852>] do_IRQ [kernel] 0xc2 (0xc0349f70))
[<c0106e60>] default_idle [kernel] 0x0 (0xc0349f88))
[<c0105000>] stext [kernel] 0x0 (0xc0349f8c))
[<c010d058>] call_do_IRQ [kernel] 0x5 (0xc0349f94))
[<c0106e60>] default_idle [kernel] 0x0 (0xc0349fa4))
[<c0105000>] stext [kernel] 0x0 (0xc0349fa8))
[<c0106e8c>] default_idle [kernel] 0x2c (0xc0349fc0))
[<c0106ef4>] cpu_idle [kernel] 0x24 (0xc0349fcc))
Code: c7 41 08 00 00 00 00 68 b0 f4 97 f8 8b 41 04 50 52 e8 8d f2

>>EIP; f897f51d <.data.end+6f1e/????> <=====
Trace; c01ac741 <start_request+1a1/210>
Trace; c01acacc <ide_do_request+29c/2f0>
Trace; c01acf99 <ide_intr+129/160>
Trace; f897f2f0 <.data.end+6cf1/????>
Trace; c010a61e <handle_IRQ_event+5e/90>
Trace; c010a852 <do_IRQ+c2/110>
Trace; c0106e60 <default_idle+0/40>
Trace; c0105000 <_stext+0/0>
Trace; c010d058 <call_do_IRQ+5/d>
Trace; c0106e60 <default_idle+0/40>
Trace; c0105000 <_stext+0/0>
Trace; c0106e8c <default_idle+2c/40>
Trace; c0106ef4 <cpu_idle+24/30>
Code; f897f51d <.data.end+6f1e/????>
00000000 <_EIP>:
Code; f897f51d <.data.end+6f1e/????> <=====
0: c7 41 08 00 00 00 00 movl $0x0,0x8(%ecx) <=====
Code; f897f524 <.data.end+6f25/????>
7: 68 b0 f4 97 f8 push $0xf897f4b0
Code; f897f529 <.data.end+6f2a/????>
c: 8b 41 04 mov 0x4(%ecx),%eax
Code; f897f52c <.data.end+6f2d/????>
f: 50 push %eax
Code; f897f52d <.data.end+6f2e/????>
10: 52 push %edx
Code; f897f52e <.data.end+6f2f/????>
11: e8 8d f2 00 00 call f2a3 <_EIP+0xf2a3> f898e7c0 <END_OF_CODE+161c1/????>

<0>Kernel panic: Aiee, killing interrupt handler!

3 warnings and 5 errors issued. Results may not be reliable.


-----Original Message-----
From: Seth Mos [mailto:[email protected]]
Sent: Wednesday, 22 January 2003 7:50 PM
To: Jacek Radajewski
Cc: [email protected]
Subject: RE: 2650 - tg3 on 2.4.18-19.7.xsmp rh7.3 ... OOPS AGAIN


At 15:13 22-1-2003 +1000, you wrote:
>Hi all,
>
>I've been running Linux on dell hardware for almost 5 years now and never
>had any problems. The number of crashes I've experienced recently makes
>our boxes unfit for production and therefore worthless.

After a relatively short period (3 months) we replaced the broadcom cards
with Intel e1000 cards. We disabled the onboard cards as well and stuck a
e1000 in there. We have 0 network related crashes since then. I have just
my testbox left that actually has a broadcom card in it.

I do have 1 e1000 (out of 5) card that seems to have some network errors
which happen during the nightly NFS backup. I suspect a cabling issue.

10:47am up 65 days, 22:15, 38 users
eth0 - Intel(R) PRO/1000 Network Driver - version 4.3.2-k1 NAPI (020618)
RX packets:950232292 errors:56443 dropped:56443 overruns:68 frame:0
TX packets:357551335 errors:0 dropped:0 overruns:0 carrier:0

The thing is connected to a 3com 6 port gigabit switch and is _not_ using
jumbo frames.

Cheers

--
Seth
It might just be your lucky day, if you only knew.


2003-01-22 23:44:18

by jason andrade

[permalink] [raw]
Subject: RE: 2650 - tg3 on 2.4.18-19.7.xsmp rh7.3 ... OOPS YET AGAIN

On Thu, 23 Jan 2003, Jacek Radajewski wrote:

> is the network card really the problem ? I don't want to be replacing all my network cards if the problem is elsewhere .... if you can understand the oops message please, please, please let me know where the problem is ...
>

Jacek,

To date there are about 20 replies that say they have had some degree of problems
with broadcom chipset based network interfaces and about 2 that say it works without
any problems for them. All of the people having problems say it ranges from interface
issues, to causing the entire machine to panic or worse, to hang until power cycled
or reset.

Based on the information supplied i am inclinced to think there are issues with the
broadcom chipset despite the best efforts of people like Jeff Garzik to address
this (but perhaps he can step in and comment as he knows more since he's the one
writing/supporting/fixing the drivers at redhat :-)

To date, a lot of people have said that they disable the onboard broadcom nics and
use intel e1000s instead. We have been using the Intel e100/e1000s (with the
intel supplied drivers dropped in, not the default redhat ones) for 2+ years now
at planetmirror.com without any problems. Those cards are doing between 70 and
100Mbit/sec (and more) 24 by 7 for 2+ years now.

regards,

-jason

2003-01-23 02:12:00

by Jeff Garzik

[permalink] [raw]
Subject: Re: 2650 - tg3 on 2.4.18-19.7.xsmp rh7.3 ... OOPS YET AGAIN

Jacek Radajewski wrote:
> is the network card really the problem ? I don't want to be replacing all my network cards if the problem is elsewhere .... if you can understand the oops message please, please, please let me know where the problem is ...

> Trace; c01ac741 <start_request+1a1/210>
> Trace; c01acacc <ide_do_request+29c/2f0>
> Trace; c01acf99 <ide_intr+129/160>
> Trace; f897f2f0 <.data.end+6cf1/????>
> Trace; c010a61e <handle_IRQ_event+5e/90>
> Trace; c010a852 <do_IRQ+c2/110>
> Trace; c0106e60 <default_idle+0/40>
> Trace; c0105000 <_stext+0/0>
> Trace; c010d058 <call_do_IRQ+5/d>
> Trace; c0106e60 <default_idle+0/40>
> Trace; c0105000 <_stext+0/0>
> Trace; c0106e8c <default_idle+2c/40>
> Trace; c0106ef4 <cpu_idle+24/30>


nope, that trace has nothing to do with the network stack or net card...

Jeff



2003-01-23 12:17:48

by Mikael Pettersson

[permalink] [raw]
Subject: RE: 2650 - tg3 on 2.4.18-19.7.xsmp rh7.3 ... OOPS YET AGAIN

jason andrade writes:
> On Thu, 23 Jan 2003, Jacek Radajewski wrote:
>
> > is the network card really the problem ? I don't want to be replacing all my network cards if the problem is elsewhere .... if you can understand the oops message please, please, please let me know where the problem is ...
> >
>
> Jacek,
>
> To date there are about 20 replies that say they have had some degree of problems
> with broadcom chipset based network interfaces and about 2 that say it works without
> any problems for them. All of the people having problems say it ranges from interface
> issues, to causing the entire machine to panic or worse, to hang until power cycled
> or reset.

For the record, _our_ Dell PE 2650 has been running RH7.3 and RH8.0 since August,
and it's been solid as a rock. Neither the Broadcom NIC nor the tg3 driver has
ever given us any problems.

2003-01-23 12:49:02

by Michael Shuey

[permalink] [raw]
Subject: Re: 2650 - tg3 on 2.4.18-19.7.xsmp rh7.3 ... OOPS YET AGAIN

On Thu, Jan 23, 2003 at 09:27:40AM +1000, Jacek Radajewski wrote:
> is the network card really the problem ? I don't want to be replacing all my network cards if the problem is elsewhere .... if you can understand the oops message please, please, please let me know where the problem is ...

You get oops messages? You're lucky - our PE 2650s would just lock up solid.
No oops message, no crash dumps (if we used a kernel with that patch), no
console messages, nothing. It would happen every 4-6 hours (and much sooner
when we tried a production-level amount of IO to the machine). At the time
we were using 2.4.18-18.7.x from RedHat 7.3.

Not sure if it was the network card (tg3) or the RAID adapter (aacraid). We
switched to 2.4.20, built with the same options (well, all that apply at any
rate) as the RedHat kernel. We haven't had a single problem since. You might
want to give that a try before replacing a pile of gigabit NICs....

--
Mike Shuey

2003-01-24 06:05:10

by GrandMasterLee

[permalink] [raw]
Subject: RE: 2650 - tg3 on 2.4.18-19.7.xsmp rh7.3 ... OOPS YET AGAIN

On Thu, 2003-01-23 at 06:26, Mikael Pettersson wrote:
> jason andrade writes:
> > On Thu, 23 Jan 2003, Jacek Radajewski wrote:
> >
> > > is the network card really the problem ? I don't want to be replacing all my network cards if the problem is elsewhere .... if you can understand the oops message please, please, please let me know where the problem is ...
> > >
> >
> > Jacek,
> >
> > To date there are about 20 replies that say they have had some degree of problems
> > with broadcom chipset based network interfaces and about 2 that say it works without
> > any problems for them. All of the people having problems say it ranges from interface
> > issues, to causing the entire machine to panic or worse, to hang until power cycled
> > or reset.
>
> For the record, _our_ Dell PE 2650 has been running RH7.3 and RH8.0 since August,
> and it's been solid as a rock. Neither the Broadcom NIC nor the tg3 driver has
> ever given us any problems.

Given the varying degree of answers on this particular thread. I'd say
that the problem is application dependant. Perhaps we can isolate that
in some way?

--
GrandMasterLee