2003-09-07 22:14:17

by ebiederman

[permalink] [raw]
Subject: How do I track TG3 peculiarities?


Running 2.4.21 we are seeing some very weird problems with
the tg3 driver. The test setup is a medium sized cluster with
256 dual opetron machines running in 32bit mode, against and extreme
networks switch. All ports running at gigabit speeds. The current
problem reproducer is to run linpack on all of the nodes. I am
looking for something simpler. We have just escalated the problem
from trying a few simple things like swapping drivers to serious
debugging.

I will do what tracing I can but as I am not especially familiar
with the linux network stack I am asking for help as to suggestions
as to reasonable interpretations of events and places to start looking.
This problem is currently a show stopper for us so we will solve
it one way or another.

Below is one good oops we have captured. I would have to check but
I believe we have updated the tg3 driver in this instance to the
one that comes with 2.4.23-pre3.

The very puzzling part is that in the crashes I don't see the tg3
driver at all just the network stack. All module addresses according
to /proc/ksyms started with at 0xf8, and the tg3 driver was built as
a module.

I have been having trouble understanding why skb_clone would be called
to transmit a packet. Any ideas?

<1>Unable to handle kernel NULL pointer dereference at virtual address
00000004
printing eip:
c0258575
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0258575>] Not tainted
EFLAGS: 00010016
eax: 00000000 ebx: f7510180 ecx: 00000000 edx: c03f5240
esi: f7616b80 edi: 00000202 ebp: f7bc90e8 esp: f774de44
ds: 0018 es: 0018 ss: 0018
Process init (pid: 15, stackpage=f774d000)
Stack: 00000098 00000286 00000246 f7bc91bc f7616b80 000005a8 c0289554
f7616b80
00000020 f7bc9080 c0258114 00000000 00000001 eb307d90 f7bc91bc
f7bc9080
c027d3d5 f7bc9080 00000001 00000098 00000000 f774dee4 00000000
c0317c34
Call Trace: [<c0289554>] [<c0258114>] [<c027d3d5>] [<c013e02c>]
[<c029dca2>]
[<c02544c5>] [<c025475e>] [<c0145be3>] [<c01094af>]
Code: 89 50 04 89 81 40 52 3f c0 c7 43 04 00 00 00 00 c7 03 00 00



************************ HERE IS THE ksymoops *************************





ksymoops 2.4.5 on i686 2.4.21-cm26_lnxi2smp. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.21-cm26_lnxi2smp/ (default)
-m /boot/System.map-2.4.21-cm26_lnxi2smp (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Warning (expand_objects): object /tmp/amd76xrom.o for module amd76xrom
has changed since load
ksymoops: No such file or directory
Warning (expand_objects): object /tmp/cfi_cmdset_0002.o for module
cfi_cmdset_0002 has changed since load
Error (expand_objects): cannot stat(/modules/tg3.o) for tg3
Error (expand_objects): cannot stat(/modules/bproc.o) for bproc
ksymoops: No such file or directory
ksymoops: No such file or directory
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/modules/vmadump.o) for vmadump
Error (add_ss_object): cannot stat(/lib/modules/2.4.21-cm26_lnxi2smp/)
Error (regular_file): read_nm_symbols stat
/lib/modules/2.4.21-cm26_lnxi2smp/ failed
ksymoops: No such file or directory
Warning (read_object): no symbols in /lib/modules/2.4.21-cm26_lnxi2smp/
Error (regular_file): read_system_map stat
/boot/System.map-2.4.21-cm26_lnxi2smp failed
ksymoops: No such file or directory
Warning (map_ksym_to_module): cannot match loaded module tg3 to a unique
module object. Trace may not be reliable.
Warning (map_ksym_to_module): cannot match loaded module bproc to a
unique module object. Trace may not be reliable.
Warning (map_ksym_to_module): cannot match loaded module vmadump to a
unique module object. Trace may not be reliable.
Fatal Error (find_fullpath): could not find executable ksymoops
[root@bluesteel conman]# ksymoops /tmp/oops
ksymoops 2.4.5 on i686 2.4.21-cm26_lnxi2smp. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.21-cm26_lnxi2smp/ (default)
-m /boot/System.map-2.4.21-cm26_lnxi2smp (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Error (expand_objects): cannot stat(/lib/ext3.o) for ext3
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/jbd.o) for jbd
ksymoops: No such file or directory
Warning (map_ksym_to_module): cannot match loaded module ext3 to a
unique module object. Trace may not be reliable.
<1>Unable to handle kernel NULL pointer dereference at virtual address
00000004
c0258575
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0258575>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010016
eax: 00000000 ebx: f7510180 ecx: 00000000 edx: c03f5240
esi: f7616b80 edi: 00000202 ebp: f7bc90e8 esp: f774de44
ds: 0018 es: 0018 ss: 0018
Process init (pid: 15, stackpage=f774d000)
Stack: 00000098 00000286 00000246 f7bc91bc f7616b80 000005a8 c0289554
f7616b80
00000020 f7bc9080 c0258114 00000000 00000001 eb307d90 f7bc91bc
f7bc9080
c027d3d5 f7bc9080 00000001 00000098 00000000 f774dee4 00000000
c0317c34
Call Trace: [<c0289554>] [<c0258114>] [<c027d3d5>] [<c013e02c>]
[<c029dca2>]
[<c02544c5>] [<c025475e>] [<c0145be3>] [<c01094af>]
Code: 89 50 04 89 81 40 52 3f c0 c7 43 04 00 00 00 00 c7 03 00 00


>>EIP; c0258575 <skb_clone+45/1d0> <=====

>>ebx; f7510180 <_end+37103a84/38400964>
>>edx; c03f5240 <skb_head_pool+0/1040>
>>esi; f7616b80 <_end+3720a484/38400964>
>>ebp; f7bc90e8 <_end+377bc9ec/38400964>
>>esp; f774de44 <_end+37341748/38400964>

Trace; c0289554 <tcp_write_xmit+184/280>
Trace; c0258114 <alloc_skb+c4/1e0>
Trace; c027d3d5 <tcp_sendmsg+585/10c0>
Trace; c013e02c <__alloc_pages+4c/1a0>
Trace; c029dca2 <inet_sendmsg+42/50>
Trace; c02544c5 <sock_sendmsg+75/c0>
Trace; c025475e <sock_write+ae/c0>
Trace; c0145be3 <sys_write+a3/160>
Trace; c01094af <system_call+33/38>

Code; c0258575 <skb_clone+45/1d0>
00000000 <_EIP>:
Code; c0258575 <skb_clone+45/1d0> <=====
0: 89 50 04 mov %edx,0x4(%eax) <=====
Code; c0258578 <skb_clone+48/1d0>
3: 89 81 40 52 3f c0 mov %eax,0xc03f5240(%ecx)
Code; c025857e <skb_clone+4e/1d0>
9: c7 43 04 00 00 00 00 movl $0x0,0x4(%ebx)
Code; c0258585 <skb_clone+55/1d0>
10: c7 03 00 00 00 00 movl $0x0,(%ebx)


2 warnings and 2 errors issued. Results may not be reliable.


2003-09-08 04:10:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: How do I track TG3 peculiarities?

On Sep 07, 2003 16:21 -0600, Eric W. Biederman wrote:
> Below is one good oops we have captured. I would have to check but
> I believe we have updated the tg3 driver in this instance to the
> one that comes with 2.4.23-pre3.
>
> The very puzzling part is that in the crashes I don't see the tg3
> driver at all just the network stack. All module addresses according
> to /proc/ksyms started with at 0xf8, and the tg3 driver was built as
> a module.
>
> I have been having trouble understanding why skb_clone would be called
> to transmit a packet. Any ideas?

Do you have the stack overflow checking enabled? That has been a source
of problems for us. It was especially difficult to reproduce, because it
only happened during a double interrupt.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-09-12 17:46:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: How do I track TG3 peculiarities?

Andreas Dilger <[email protected]> writes:

> On Sep 07, 2003 16:21 -0600, Eric W. Biederman wrote:
> > Below is one good oops we have captured. I would have to check but
> > I believe we have updated the tg3 driver in this instance to the
> > one that comes with 2.4.23-pre3.
> >
> > The very puzzling part is that in the crashes I don't see the tg3
> > driver at all just the network stack. All module addresses according
> > to /proc/ksyms started with at 0xf8, and the tg3 driver was built as
> > a module.
> >
> > I have been having trouble understanding why skb_clone would be called
> > to transmit a packet. Any ideas?
>
> Do you have the stack overflow checking enabled?
No.
> That has been a source
> of problems for us. It was especially difficult to reproduce, because it
> only happened during a double interrupt.


So a quick update on what I am seeing.

I have now tried with a myriad of driver and kernel versions watching very
carefully for a pattern.

What I have observed is memory corruption in what looks like it may
be a confined area of memory. The ECC SDRAM is being closely
monitored and I am not getting as much as a correctable error much
less and uncorrectable error that would show up so memory is ruled
out. When I am connected to a particular Extreme Networks gigabit
switch. (The switch has some problems and it is hypothesized the
switch is injecting problematic packets into the network).

Bad packets should not be a problem but it looks like they are
triggering the problem whatever it is.

Eric

2003-09-18 09:36:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: How do I track TG3 peculiarities?

And just to wrap this up so no one has a weird feeling.

It turns out there was a Opteron rev B3 errata that was did not have a
work around in LinuxBIOS. I have talked with AMD and gotten the
information I need to work around the errata.

And to everyone who gave me hints.
Thank you.

Eric