2003-02-14 20:30:23

by James Bourne

[permalink] [raw]
Subject: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

Hi,
Since sometime in December two systems we have on site using P4 HT (one
Dell 2650 and one Dell 4600, both dual CPU, both ht/mce capable) have been
locking up without any kernel output and without sysrq keys working (the
keyboard is locked solid). I've dropped the 4600 back to 2.4.19 but the
2650, not yet in production, is still running 2.4.20 to troubleshoot the
problem...

Using nmi_watchdog I've managed to get a stack track and ran ksymoops over
it (attached). Also attached is the .config file used to build the kernel.
The lockup is reproducable, although this is the first time I've managed
to get any feedback from the kernel on the problem. 2.4.19 with the
same patches, but without tg3, does not lockup...

Thanks in advance for any help that can be given.

Here's more information about the system the oops was captured on:

(kernel compiler)
bash# gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)

(Additional patches)
(at http://www.hardrock.org/kernel/2.4.20)
linux-2.4.20-mrc-base.patch: big UID quotas
linux-2.4.20-VFS-lock patch: VFS lock patch for ext3 and lvm
linux-2.4.20-ext3.patch: Andrew Mortons ext3 patches for 2.4.20
irqbalance-2.4.20-MRC.patch: IRQ load balancing patch for the P4 ServerWorks
(Ingo Molnar <[email protected]>) brought forward from 2.4.17

(lspci output)
00:00.0 Host bridge: ServerWorks: Unknown device 0012 (rev 13)
00:00.1 Host bridge: ServerWorks: Unknown device 0012
00:00.2 Host bridge: ServerWorks: Unknown device 0000
00:04.0 Class ff00: Dell Computer Corporation Embedded Systems Management Device 4
00:04.1 Class ff00: Dell Computer Corporation PowerEdge Expandable RAID Controller 3/Di
00:04.2 Class 0c07: Dell Computer Corporation: Unknown device 000d
00:0e.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:0f.0 Host bridge: ServerWorks CSB5 South Bridge (rev 93)
00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93)
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 USB Controller (rev 05)
00:0f.3 ISA bridge: ServerWorks GCHE CSB5 South Bridge
00:10.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:10.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:11.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:11.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
03:06.0 Ethernet controller: BROADCOM Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15)
03:08.0 Ethernet controller: BROADCOM Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15)
04:08.0 PCI bridge: Intel Corp. 80960RP [i960 RP Microprocessor/Bridge] (rev 01)
04:08.1 RAID bus controller: Dell Computer Corporation PowerEdge Expandable RAID Controller 3/Di (rev 01)
05:06.0 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01)
05:06.1 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01)


(contents of /proc/cpuinfo)
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) XEON(TM) CPU 1.80GHz
stepping : 4
cpu MHz : 1794.248
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3578.26

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) XEON(TM) CPU 1.80GHz
stepping : 4
cpu MHz : 1794.248
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3578.26

processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) XEON(TM) CPU 1.80GHz
stepping : 4
cpu MHz : 1794.248
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3578.26

processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) XEON(TM) CPU 1.80GHz
stepping : 4
cpu MHz : 1794.248
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 3578.26


Regards
James Bourne



--
James Bourne, Supervisor Data Centre Operations
Mount Royal College, Calgary, AB, CA
http://www.mtroyal.ab.ca

******************************************************************************
This communication is intended for the use of the recipient to which it is
addressed, and may contain confidential, personal, and or privileged
information. Please contact the sender immediately if you are not the
intended recipient of this communication, and do not copy, distribute, or
take action relying on it. Any communication received in error, or
subsequent reply, should be deleted or destroyed.
******************************************************************************


"There are only 10 types of people in this world: those who
understand binary and those who don't."


Attachments:
stackdump.txt (958.00 B)
stack dump
output-ksymoops.txt (2.69 kB)
ksymoops output
config-dell2650-4600-2.4.20-2 (19.50 kB)
.config file
Download all attachments

2003-02-14 20:54:00

by Pete Zaitcev

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

> Since sometime in December two systems we have on site using P4 HT (one
> Dell 2650 and one Dell 4600, both dual CPU, both ht/mce capable) have been
> locking up without any kernel output and without sysrq keys working (the
> keyboard is locked solid).
>[...]
> Using nmi_watchdog I've managed to get a stack track and ran ksymoops over
> it (attached).

Good report. To tell the truth, I know that this lockup exists,
there's an RH issue-tracker item against me on this.
It is different from the old "porkchop" lockup, which DaveM and
Jeff Garzik fixed.

The stumbling block is that NMI oopser catches a thread which
gets stuck because of the lock, but this does not explain
how the lock was taken.

I think the best resolution would be an instrumentation patch
which records lock takers, and prints them when the thing is
forcefuly oopsed. I should come with it eventually, if someone
does not beat me to it (I wish they did, actually :-)

-- Pete

2003-02-18 02:54:05

by David Miller

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

From: James Bourne <[email protected]>
Date: Fri, 14 Feb 2003 13:39:24 -0700 (MST)

This communication is intended for the use of the recipient to which it is
addressed, and may contain confidential, personal, and or privileged
information. Please contact the sender immediately if you are not the
intended recipient of this communication, and do not copy, distribute, or
take action relying on it. Any communication received in error, or
subsequent reply, should be deleted or destroyed.

I do not read emails which have disclaimers attached to them
like this one.

2003-02-18 04:11:58

by James Bourne

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

On Mon, 17 Feb 2003, David S. Miller wrote:

> From: James Bourne <[email protected]>
> Date: Fri, 14 Feb 2003 13:39:24 -0700 (MST)
...
> take action relying on it. Any communication received in error, or
> subsequent reply, should be deleted or destroyed.
>
> I do not read emails which have disclaimers attached to them
> like this one.
>

Dave,
It's a disclaimer that I don't have a choice in having attached to my
email. If you actually read it, it only says that if you don't think it's
intended for you then contact the sender and delete the message. The PC
police at the institution I work for don't want to have a law suit because
of someones mistake (lawyers).

Regards,
James Bourne

--
James Bourne, Supervisor Data Centre Operations
Mount Royal College, Calgary, AB, CA
http://www.mtroyal.ab.ca

******************************************************************************
This communication is intended for the use of the recipient to which it is
addressed, and may contain confidential, personal, and or privileged
information. Please contact the sender immediately if you are not the
intended recipient of this communication, and do not copy, distribute, or
take action relying on it. Any communication received in error, or
subsequent reply, should be deleted or destroyed.
******************************************************************************


"There are only 10 types of people in this world: those who
understand binary and those who don't."

2003-02-18 05:17:39

by David Miller

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

From: James Bourne <[email protected]>
Date: Mon, 17 Feb 2003 21:21:53 -0700 (MST)

It's a disclaimer that I don't have a choice in having attached to
my email.

Just because you have to do it doesn't mean anything to me.

That disclaimer is a liability for me, because it means that
I could get into trouble if I put your bug report into a database (for
example) and then forward that posting to other people privately in
order to work on the bug.

2003-02-18 05:54:26

by James Bourne

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

On Mon, 17 Feb 2003, David S. Miller wrote:

> Just because you have to do it doesn't mean anything to me.
>
> That disclaimer is a liability for me, because it means that
> I could get into trouble if I put your bug report into a database (for
> example) and then forward that posting to other people privately in
> order to work on the bug.

This I understand, thanks for the explanation. I will also pass it onto
others at our institution as it then becomes a problem if I want to send a
bug report to you... Of course, I could always just use my home email I
guess...

Regards
James

--
James Bourne, Supervisor Data Centre Operations
Mount Royal College, Calgary, AB, CA
http://www.mtroyal.ab.ca

******************************************************************************
This communication is intended for the use of the recipient to which it is
addressed, and may contain confidential, personal, and or privileged
information. Please contact the sender immediately if you are not the
intended recipient of this communication, and do not copy, distribute, or
take action relying on it. Any communication received in error, or
subsequent reply, should be deleted or destroyed.
******************************************************************************


"There are only 10 types of people in this world: those who
understand binary and those who don't."

2003-02-26 06:49:59

by Rhodes, Tom

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

>> Since sometime in December two systems we have on site using P4 HT
(one
>> Dell 2650 and one Dell 4600, both dual CPU, both ht/mce capable) have
been
>> locking up without any kernel output and without sysrq keys working
(the
>> keyboard is locked solid).
>>[...]
>> Using nmi_watchdog I've managed to get a stack track and ran ksymoops
over
>> it (attached).


> Good report. To tell the truth, I know that this lockup exists,
> there's an RH issue-tracker item against me on this.

Several of us at HP have been chasing this problem as well. Here is why
there is a deadlock: deliver_to_old_ones()attempts to stop all timers
from running and then blocks until all the timers are no longer running.
This code is called from netif_receive_skb which is called from tg3_poll
while it is holding a lock in the tg3 driver. On another CPU, the
tg3_timer routine is run but is blocked by the lock held in the tg3_poll
routine. The tg3_timer routine never finishes because it can't acquire
the lock being held by tg3_poll on another CPU. That prevents
deliver_to_old_ones from executing because there is still a timer
routine executing.

Here is the call stack of the deadlocked CPUs on a RH8.0 system with a
2.4.18-24.8.0 smp kernel:
CPU 2:
deliver_to_old_ones+45
netif_receive_skb
tg3_rx+27b
tg3_poll+81
net_rx_action
do_softirq
do_IRQ
call_do_IRQ

CPU 6:
tg3_timer (tg3+9fc4)
run_timer_list+0x112
bh_action+55
tasklet_hi_action+67
do_softirq+d9
do_IRQ
call_do_IRQ+5

Thanks
Tom Rhodes
[email protected]

2003-02-26 12:45:34

by Denis Vlasenko

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

On 26 February 2003 09:00, Rhodes, Tom wrote:
> >> Since sometime in December two systems we have on site using P4 HT
>
> (one
>
> >> Dell 2650 and one Dell 4600, both dual CPU, both ht/mce capable)
> >> have
>
> been
>
> >> locking up without any kernel output and without sysrq keys
> >> working
>
> (the
>
> >> keyboard is locked solid).
> >>[...]
> >> Using nmi_watchdog I've managed to get a stack track and ran
> >> ksymoops
>
> over
>
> >> it (attached).
> >
> > Good report. To tell the truth, I know that this lockup exists,
> > there's an RH issue-tracker item against me on this.
>
> Several of us at HP have been chasing this problem as well. Here is
> why there is a deadlock: deliver_to_old_ones()attempts to stop all
> timers from running and then blocks until all the timers are no
> longer running.

Here, linux/interrupt.h:

static inline void tasklet_unlock_wait(struct tasklet_struct *t)
{
while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { barrier(); }
}

Yes this can run forever

> This code is called from netif_receive_skb which is
> called from tg3_poll while it is holding a lock in the tg3 driver. On
> another CPU, the tg3_timer routine is run but is blocked by the lock
> held in the tg3_poll routine. The tg3_timer routine never finishes
> because it can't acquire the lock being held by tg3_poll on another
> CPU. That prevents deliver_to_old_ones from executing because there
> is still a timer routine executing.
>
> Here is the call stack of the deadlocked CPUs on a RH8.0 system with
> a 2.4.18-24.8.0 smp kernel:
> CPU 2:
> deliver_to_old_ones+45
> netif_receive_skb
> tg3_rx+27b
> tg3_poll+81
> net_rx_action
> do_softirq
> do_IRQ
> call_do_IRQ
>
> CPU 6:
> tg3_timer (tg3+9fc4)
> run_timer_list+0x112
> bh_action+55
> tasklet_hi_action+67
> do_softirq+d9
> do_IRQ
> call_do_IRQ+5
--
vda

2003-02-26 15:20:22

by Rhodes, Tom

[permalink] [raw]
Subject: RE: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

> >> Since sometime in December two systems we have on site using P4 HT
> (one
> >> Dell 2650 and one Dell 4600, both dual CPU, both ht/mce capable)
have
> been
> >> locking up without any kernel output and without sysrq keys working
> (the
> >> keyboard is locked solid).
> >>[...]
> >> Using nmi_watchdog I've managed to get a stack track and ran
ksymoops
> over
> >> it (attached).
>
>
> > Good report. To tell the truth, I know that this lockup exists,
> > there's an RH issue-tracker item against me on this.
>
> Several of us at HP have been chasing this problem as well. Here is
why
> there is a deadlock: deliver_to_old_ones()attempts to stop all timers
> from running and then blocks until all the timers are no longer
running.
> This code is called from netif_receive_skb which is called from
tg3_poll
> while it is holding a lock in the tg3 driver. On another CPU, the
> tg3_timer routine is run but is blocked by the lock held in the
tg3_poll
> routine. The tg3_timer routine never finishes because it can't acquire
> the lock being held by tg3_poll on another CPU. That prevents
> deliver_to_old_ones from executing because there is still a timer
> routine executing.
>
> Here is the call stack of the deadlocked CPUs on a RH8.0 system with a
> 2.4.18-24.8.0 smp kernel:
> CPU 2:
> deliver_to_old_ones+45
> netif_receive_skb
> tg3_rx+27b
> tg3_poll+81
> net_rx_action
> do_softirq
> do_IRQ
> call_do_IRQ
>
> CPU 6:
> tg3_timer (tg3+9fc4)
> run_timer_list+0x112
> bh_action+55
> tasklet_hi_action+67
> do_softirq+d9
> do_IRQ
> call_do_IRQ+5
>
Until this is fixed, there are a couple work-arounds:
- Put the system behind a NAT firewall. This problem is caused by
another protocol on the wire, probably IPX (Netware).
- Switch to a non-NAPI driver. Since you're using the BroadComm NIC, try
the BCM5700 driver or switch to a different NIC.

While troubleshooting this problem here at HP, I have a couple of
systems that have run for over 2 weeks by using the above work-arounds.

Tom Rhodes
[email protected]

2003-02-26 15:53:38

by Jeff Garzik

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

Rhodes, Tom wrote:
>>Here is the call stack of the deadlocked CPUs on a RH8.0 system with a
>>2.4.18-24.8.0 smp kernel:
>>CPU 2:
>>deliver_to_old_ones+45
>>netif_receive_skb
>>tg3_rx+27b
>>tg3_poll+81
>>net_rx_action
>>do_softirq
>>do_IRQ
>>call_do_IRQ
>>
>>CPU 6:
>>tg3_timer (tg3+9fc4)
>>run_timer_list+0x112
>>bh_action+55
>>tasklet_hi_action+67
>>do_softirq+d9
>>do_IRQ
>>call_do_IRQ+5
>>
>
> Until this is fixed, there are a couple work-arounds:

It's fixed in tg3 version 1.4c, which is attached. James Bourne also
put up tg3 2.4.20 patches containing the needed fixes, at
http://www.hardrock.org/kernel/2.4.20

Jeff



Attachments:
tg3.tar.bz2 (62.00 kB)

2003-02-26 19:46:20

by Pete Zaitcev

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

> Date: Wed, 26 Feb 2003 09:29:56 -0600
> From: "Rhodes, Tom" <[email protected]>

> > there is a deadlock: deliver_to_old_ones()attempts to stop all timers

> Until this is fixed, there are a couple work-arounds:
> - Put the system behind a NAT firewall. [...]
> - Switch to a non-NAPI driver. [...]

I think you are a little late with these workarounds, tg3 1.4c
is available for a week already.
http://people.redhat.com/jgarzik/tg3/tg3-1.4c/

People using ancient non-NAPI kernels better upgrade to 2.4.20.
If they really, really, really cannot upgrade (and this only
happens when they are stuck with binary module crap), they
should say "I will quit this tob tomorrow" three times,
then use RH AS2.1 driver, only make sure to get tg3 1.2e3
from drivers/addon/tg3 in 2.4.9-e.12.

-- Pete

2003-02-26 20:44:13

by Mikael Pettersson

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

Pete Zaitcev writes:
> > Date: Wed, 26 Feb 2003 09:29:56 -0600
> > From: "Rhodes, Tom" <[email protected]>
>
> > > there is a deadlock: deliver_to_old_ones()attempts to stop all timers
>
> > Until this is fixed, there are a couple work-arounds:
> > - Put the system behind a NAT firewall. [...]
> > - Switch to a non-NAPI driver. [...]
>
> I think you are a little late with these workarounds, tg3 1.4c
> is available for a week already.
> http://people.redhat.com/jgarzik/tg3/tg3-1.4c/
>
> People using ancient non-NAPI kernels better upgrade to 2.4.20.
> If they really, really, really cannot upgrade (and this only
> happens when they are stuck with binary module crap), they
> should say "I will quit this tob tomorrow" three times,
> then use RH AS2.1 driver, only make sure to get tg3 1.2e3
> from drivers/addon/tg3 in 2.4.9-e.12.

Which non-AS RH8.0 kernel should one use to not get hangs from tg3?
I had to downgrade our Dell PE2650 from 2.4.18-24 to 2.4.18-17 since
2.4.18-24 and -19 caused bi-weekly hangs, while -17 and earlier never
had any problems. -18 seemed Ok, but the box didn't run it very long.

This is a compile server with very light network load: a couple of
ssh sessions, NFS-mounted home dirs, ntpd, nothing else.

/Mikael

2003-03-07 23:17:36

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)

Jeff Garzik <[email protected]> wrote:
> Rhodes, Tom wrote:
[snip]
>>
>> Until this is fixed, there are a couple work-arounds:
>
> It's fixed in tg3 version 1.4c, which is attached. James Bourne also
> put up tg3 2.4.20 patches containing the needed fixes, at
> http://www.hardrock.org/kernel/2.4.20
>
> Jeff

I got an hang/oops with the tg3 driver aswell.
I just wanted to make sure that 1.4c+ fixes this bug too:

ftp://ftp.fbab.net/pub/mag/tg3_oops_small.jpg

I've still got it in KDB if you want to find anything out from that
(tell me what to do) ...
Before this hang i had a e1000 card that also stopped working after a
while, so i switched to tg3, and ofcourse i got a hang there too :)

Regards

Magnus