2003-09-01 19:32:40

by Norbert Preining

[permalink] [raw]
Subject: 2.4, b44 transmit timeout

Hi!

I have the following problem since I switched from bcm4400 to the
`in-kernel' driver b44:
Sep 1 17:37:11 gandalf vmunix: b44: eth0: Link is up at 10 Mbps, half duplex.
Sep 1 17:37:11 gandalf vmunix: b44: eth0: Flow control is off for TX and off for RX.
Sep 1 17:37:16 gandalf vmunix: NETDEV WATCHDOG: eth0: transmit timed out
Sep 1 17:37:16 gandalf vmunix: b44: eth0: transmit timed out, resetting
Sep 1 17:37:17 gandalf vmunix: b44: eth0: Link is down.
Sep 1 17:37:20 gandalf vmunix: b44: eth0: Link is up at 10 Mbps, half duplex.
Sep 1 17:37:20 gandalf vmunix: b44: eth0: Flow control is off for TX and off for RX.

and so on. This didn't (and still does not) happen with the bcm4400
(from debian sid bcm4400-source).

I compiled the kernel myself on debian/sid, it is a laptop (acer tm654).

If you need more information I will provide all I can do.

Best wishes

Norbert

-------------------------------------------------------------------------------
Norbert Preining <preining AT logic DOT at> Technische Universit?t Wien
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
TIMBLE (vb.)
(Of small nasty children.) To fail over very gently, look around to
see who's about, and then yell blue murder.
--- Douglas Adams, The Meaning of Liff


2003-09-01 20:00:36

by Pekka Pietikäinen

[permalink] [raw]
Subject: Re: 2.4, b44 transmit timeout

On Mon, Sep 01, 2003 at 09:32:35PM +0200, Norbert Preining wrote:
> Hi!
>
> I have the following problem since I switched from bcm4400 to the
> `in-kernel' driver b44:
> Sep 1 17:37:11 gandalf vmunix: b44: eth0: Link is up at 10 Mbps, half duplex.
> Sep 1 17:37:11 gandalf vmunix: b44: eth0: Flow control is off for TX and off for RX.
> Sep 1 17:37:16 gandalf vmunix: NETDEV WATCHDOG: eth0: transmit timed out
> Sep 1 17:37:16 gandalf vmunix: b44: eth0: transmit timed out, resetting
> Sep 1 17:37:17 gandalf vmunix: b44: eth0: Link is down.
> Sep 1 17:37:20 gandalf vmunix: b44: eth0: Link is up at 10 Mbps, half duplex.
> Sep 1 17:37:20 gandalf vmunix: b44: eth0: Flow control is off for TX and off for RX.
>
> and so on. This didn't (and still does not) happen with the bcm4400
> (from debian sid bcm4400-source).
>
> I compiled the kernel myself on debian/sid, it is a laptop (acer tm654).
>
> If you need more information I will provide all I can do.
Argh, I had hoped the driver was BugFree (tm)!

Could you try adding some debug printk's to b44_tx and b44_start_xmit

say something like:

in b44_tx()
for (cons = bp->tx_cons; cons != cur; cons = NEXT_TX(cons)) {
struct ring_info *rp = &bp->tx_buffers[cons];
struct sk_buff *skb = rp->skb;
+ printk(KERN_DEBUG "b44_tx cons: %d cur: %d skb: %p\n",cons,cur,skb);


and in b44_start_xmit something like:

bp->tx_ring[entry].addr = cpu_to_le32((u32) mapping+bp->dma_offset);
+ printk(KERN_DEBUG "b44_start_xmit ctrl: %x addr: %p entry: %d skb: %p",bp->tx_ring[entry].ctrl,bp->tx_ring[entry].addr,entry,skb);

(that should hopefully be enough to see what's happening)

Also setting B44_FLAG_BUGGY_TXPTR and B44_FLAG_REORDER_BUG in
b44_get_invariants() might be worth a shot. I'm not sure where those
flags came from, my A7V8X works fine without them and bcm4400 doesn't have
that stuff either.

2003-09-02 07:34:36

by Norbert Preining

[permalink] [raw]
Subject: Re: 2.4, b44 transmit timeout

Hi!

On Mon, 01 Sep 2003, Pekka Pietikainen wrote:
> Argh, I had hoped the driver was BugFree (tm)!

grin -- nothing is bug free.

> + printk(KERN_DEBUG "b44_tx cons: %d cur: %d skb: %p\n",cons,cur,skb);
>
> + printk(KERN_DEBUG "b44_start_xmit ctrl: %x addr: %p entry: %d skb: %p",bp->tx_ring[entry].ctrl,bp->tx_ring[entry].addr,entry,skb);
>
> Also setting B44_FLAG_BUGGY_TXPTR and B44_FLAG_REORDER_BUG in

I have done all of this and here is a syslog excerpt with comments:

* Ok, loading b44 modules
Sep 2 09:20:08 gandalf vmunix: b44.c:v0.9 (Jul 14, 2003)
Sep 2 09:20:08 gandalf vmunix: eth0: Broadcom 4400 10/100BaseT Ethernet 00:c0:9f:1f:59:38

* Configuring by hand with ifconfig eth0 192.168.1.13
Sep 2 09:20:16 gandalf vmunix: b44: eth0: Link is down.
Sep 2 09:20:18 gandalf vmunix: b44: eth0: Link is up at 100 Mbps, full duplex.
Sep 2 09:20:18 gandalf vmunix: b44: eth0: Flow control is on for TX and on for RX.

* pinging the gateway (.1.1), working
Sep 2 09:20:22 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5d82 entry: 0 skb: d935b480
Sep 2 09:20:22 gandalf vmunix: b44_tx cons: 0 cur: 1 skb: d935b480
Sep 2 09:20:22 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 1 skb: d935b680
Sep 2 09:20:22 gandalf vmunix: b44_tx cons: 1 cur: 2 skb: d935b680
Sep 2 09:20:23 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 2 skb: d935b480
Sep 2 09:20:23 gandalf vmunix: b44_tx cons: 2 cur: 3 skb: d935b480
Sep 2 09:20:24 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 3 skb: d935b480
Sep 2 09:20:24 gandalf vmunix: b44_tx cons: 3 cur: 4 skb: d935b480
Sep 2 09:20:25 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 4 skb: d935b480
Sep 2 09:20:25 gandalf vmunix: b44_tx cons: 4 cur: 5 skb: d935b480
Sep 2 09:20:26 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 5 skb: d935b480
Sep 2 09:20:26 gandalf vmunix: b44_tx cons: 5 cur: 6 skb: d935b480
Sep 2 09:20:27 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5082 entry: 6 skb: d935ba80
Sep 2 09:20:27 gandalf vmunix: b44_tx cons: 6 cur: 7 skb: d935ba80
Sep 2 09:20:27 gandalf kernel: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 7 skb: d935ba80
Sep 2 09:20:27 gandalf kernel: b44_tx cons: 7 cur: 8 skb: d935ba80
Sep 2 09:20:28 gandalf kernel: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 8 skb: d935b480
Sep 2 09:20:28 gandalf kernel: b44_tx cons: 8 cur: 9 skb: d935b480
Sep 2 09:20:29 gandalf kernel: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 9 skb: d935b480
Sep 2 09:20:29 gandalf kernel: b44_tx cons: 9 cur: 10 skb: d935b480

* stopping ping, ifconfig eth0 down, starting pump for getting dhcp
* address, no success!
Sep 2 09:20:41 gandalf pumpd[7584]: starting at (uptime 0 days, 2:04:19) Tue Sep 2 09:20:41 2003
Sep 2 09:20:41 gandalf pumpd[7584]: PUMP: sending discover
Sep 2 09:20:44 gandalf vmunix: b44: eth0: Link is up at 100 Mbps, full duplex.
Sep 2 09:20:44 gandalf vmunix: b44: eth0: Flow control is on for TX and on for RX.
Sep 2 09:20:55 gandalf pumpd[7584]: PUMP: sending discover
Sep 2 09:20:58 gandalf vmunix: b44: eth0: Link is up at 100 Mbps, full duplex.
Sep 2 09:20:58 gandalf vmunix: b44: eth0: Flow control is on for TX and on for RX.

* killing pump, ifconfig manually again
Sep 2 09:21:43 gandalf vmunix: b44: eth0: Link is up at 100 Mbps, full duplex.
Sep 2 09:21:43 gandalf vmunix: b44: eth0: Flow control is on for TX and on for RX.

* pinging the gateway, not working this time
Sep 2 09:21:48 gandalf vmunix: NETDEV WATCHDOG: eth0: transmit timed out
Sep 2 09:21:48 gandalf vmunix: b44: eth0: transmit timed out, resetting
Sep 2 09:21:48 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5682 entry: 0 skb: d935c680
Sep 2 09:21:48 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5082 entry: 1 skb: d8de1b80
Sep 2 09:21:49 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5782 entry: 2 skb: d935b480
Sep 2 09:21:49 gandalf vmunix: b44: eth0: Link is down.
Sep 2 09:21:50 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5d82 entry: 3 skb: d935c480
Sep 2 09:21:51 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5882 entry: 4 skb: d8de1e80
Sep 2 09:21:51 gandalf vmunix: b44: eth0: Link is up at 100 Mbps, full duplex.
Sep 2 09:21:51 gandalf vmunix: b44: eth0: Flow control is on for TX and on for RX.
Sep 2 09:21:52 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5982 entry: 5 skb: d935cb80
Sep 2 09:21:53 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5482 entry: 6 skb: d8de1680
Sep 2 09:21:54 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5b82 entry: 7 skb: d935c580
Sep 2 09:21:55 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5e82 entry: 8 skb: d8de1980
Sep 2 09:21:56 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5f982982 entry: 9 skb: d8de1480
Sep 2 09:21:57 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 4863ac82 entry: 10 skb: d935ce80
Sep 2 09:21:58 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 4863ae82 entry: 11 skb: d8de1880
Sep 2 09:21:59 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5a82 entry: 12 skb: d8de1280
Sep 2 09:22:00 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 4863ab82 entry: 13 skb: d8de1780

I can reproduce this.

Anything I can do more? Hope this helps.

Best wishes

Norbert

-------------------------------------------------------------------------------
Norbert Preining <preining AT logic DOT at> Technische Universit?t Wien
gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
-------------------------------------------------------------------------------
GLOADBY MARWOOD (n.)
Someone who stops Jon Cleese on the street and demands that he does a
funny walk.
--- Douglas Adams, The Meaning of Liff

2003-09-02 08:31:18

by Pekka Pietikäinen

[permalink] [raw]
Subject: Re: 2.4, b44 transmit timeout

On Tue, Sep 02, 2003 at 09:33:53AM +0200, Norbert Preining wrote:
> * pinging the gateway (.1.1), working
> Sep 2 09:20:22 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5d82 entry: 0 skb: d935b480
> Sep 2 09:20:22 gandalf vmunix: b44_tx cons: 0 cur: 1 skb: d935b480
> Sep 2 09:20:22 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 1 skb: d935b680
> Sep 2 09:20:22 gandalf vmunix: b44_tx cons: 1 cur: 2 skb: d935b680
> Sep 2 09:20:23 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 2 skb: d935b480
> Sep 2 09:20:23 gandalf vmunix: b44_tx cons: 2 cur: 3 skb: d935b480
> Sep 2 09:20:24 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 3 skb: d935b480
> Sep 2 09:20:24 gandalf vmunix: b44_tx cons: 3 cur: 4 skb: d935b480
> Sep 2 09:20:25 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 4 skb: d935b480
> Sep 2 09:20:25 gandalf vmunix: b44_tx cons: 4 cur: 5 skb: d935b480
> Sep 2 09:20:26 gandalf vmunix: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 5 skb: d935b480
> Sep 2 09:20:26 gandalf vmunix: b44_tx cons: 5 cur: 6 skb: d935b480
> Sep 2 09:20:27 gandalf vmunix: b44_start_xmit ctrl: e000002a addr: 5cea5082 entry: 6 skb: d935ba80
> Sep 2 09:20:27 gandalf vmunix: b44_tx cons: 6 cur: 7 skb: d935ba80
> Sep 2 09:20:27 gandalf kernel: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 7 skb: d935ba80
> Sep 2 09:20:27 gandalf kernel: b44_tx cons: 7 cur: 8 skb: d935ba80
> Sep 2 09:20:28 gandalf kernel: b44_start_xmit ctrl: e0000062 addr: 5f982682 entry: 8 skb: d935b480
> Sep 2 09:20:28 gandalf kernel: b44_tx cons: 8 cur: 9 skb: d935b480
> Sep 2 09:20:29 gandalf kernel: b44_start_xmit ctrl: e0000062 addr: 5f982982 entry: 9 skb: d935b480
> Sep 2 09:20:29 gandalf kernel: b44_tx cons: 9 cur: 10 skb: d935b480
> * stopping ping, ifconfig eth0 down, starting pump for getting dhcp
> * address, no success!
> Sep 2 09:20:41 gandalf pumpd[7584]: starting at (uptime 0 days, 2:04:19) Tue Sep 2 09:20:41 2003
> Sep 2 09:20:41 gandalf pumpd[7584]: PUMP: sending discover
> Sep 2 09:20:44 gandalf vmunix: b44: eth0: Link is up at 100 Mbps, full duplex.
> Sep 2 09:20:44 gandalf vmunix: b44: eth0: Flow control is on for TX and on for RX.
A-ha! Thanks, I think this should be enough to figure out what the problem
is... Looks like the driver doesn't even get the packets pump tries to send,
pump is a bit special in the way it bounces the interface up and down when
doing its work, that probably triggers a race in b44..

2003-09-08 09:34:51

by Jan De Luyck

[permalink] [raw]
Subject: Re: 2.4, b44 transmit timeout

On Tue, Sep 02, 2003 at 09:33:53AM +0200, Norbert Preining wrote:

> A-ha! Thanks, I think this should be enough to figure out what the problem
> is... Looks like the driver doesn't even get the packets pump tries to send,
> pump is a bit special in the way it bounces the interface up and down when
> doing its work, that probably triggers a race in b44..

I can also easily recreate this by calling dhcpcd e.g. when the cable isn't in
the socket yet. If i attach the cable then I see the interface coming up,
going down, and then the NETDEV watchdog message.

Unfortunatly this usually means that dhcpcd goes hanging. ifconfig hangs too
if I try to use it, and rebooting must be forced with sysrq and an oops a
alt-sysrq-o for poweroff...

Jan

2003-09-09 22:14:15

by Pekka Pietikäinen

[permalink] [raw]
Subject: Re: 2.4, b44 transmit timeout

On Mon, Sep 08, 2003 at 11:34:17AM +0200, Jan De Luyck wrote:
> On Tue, Sep 02, 2003 at 09:33:53AM +0200, Norbert Preining wrote:
>
> > A-ha! Thanks, I think this should be enough to figure out what the problem
> > is... Looks like the driver doesn't even get the packets pump tries to send,
> > pump is a bit special in the way it bounces the interface up and down when
> > doing its work, that probably triggers a race in b44..
>
> I can also easily recreate this by calling dhcpcd e.g. when the cable isn't in
> the socket yet. If i attach the cable then I see the interface coming up,
> going down, and then the NETDEV watchdog message.
>
> Unfortunatly this usually means that dhcpcd goes hanging. ifconfig hangs too
> if I try to use it, and rebooting must be forced with sysrq and an oops a
> alt-sysrq-o for poweroff...
Hi

(I was quite far away from civilization for a few days, which is why
I've been quiet on this topic for a while :-) )

Culd you try the patch Jeff Garzik posted as a fix for tg3
ifconfig down/up problems, I believe it fixes the same problem since
it's a generic thing with NAP

What I believe should happen after applying this patch is that the
b44 driver shouldn't hang as a result of doing tricks like the ones
mentioned. pump might not work even with it (If I understood correctly,
(pump & tg3 don't mix well either due to pump having some wrong assumptions
on when it can send packets). dhcpcd definately should work though.



--- SNIP --
Note that people seeing "ifconfig down ... ifconfig up" problems need to
apply this patch. (to 2.4.23-pre, too)

Jeff



[-- Attachment #2: patch --]
[-- Type: text/plain, Encoding: 7bit, Size: 0.5K --]

diff -Nru a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c Mon Sep 8 18:14:36 2003
+++ b/net/core/dev.c Mon Sep 8 18:14:36 2003
@@ -851,7 +851,11 @@
* engine, but this requires more changes in devices. */

smp_mb__after_clear_bit(); /* Commit netif_running(). */
- netif_poll_disable(dev);
+ while (test_bit(__LINK_STATE_RX_SCHED, &dev->state)) {
+ /* No hurry. */
+ current->state = TASK_INTERRUPTIBLE;
+ schedule_timeout(1);
+ }

-- SNIP --


2003-09-28 16:48:10

by Pekka Pietikäinen

[permalink] [raw]
Subject: Re: 2.4, b44 transmit timeout

On Mon, Sep 08, 2003 at 11:34:17AM +0200, Jan De Luyck wrote:
> On Tue, Sep 02, 2003 at 09:33:53AM +0200, Norbert Preining wrote:
>
> > A-ha! Thanks, I think this should be enough to figure out what the problem
> > is... Looks like the driver doesn't even get the packets pump tries to send,
> > pump is a bit special in the way it bounces the interface up and down when
> > doing its work, that probably triggers a race in b44..
>
> I can also easily recreate this by calling dhcpcd e.g. when the cable isn't in
> the socket yet. If i attach the cable then I see the interface coming up,
> going down, and then the NETDEV watchdog message.
> Unfortunatly this usually means that dhcpcd goes hanging. ifconfig hangs too
> if I try to use it, and rebooting must be forced with sysrq and an oops a
> alt-sysrq-o for poweroff...
This could help (tm) :-)

--- b44.c.orig 2003-09-28 19:36:48.000000000 +0300
+++ b44.c 2003-09-28 19:37:07.000000000 +0300
@@ -870,6 +870,7 @@

spin_unlock_irq(&bp->lock);

+ b44_enable_ints(bp);
netif_wake_queue(dev);
}

at least I could do some pretty dirty things to the poor chip and it still
seems to recover after this patch.

rmmod still has some problems, occasionally I get a "usage count 2" thing
(currently running the rawhide 2.4.22-1.2061.nptl, I've seen similar with
2.6.0-test5ish too, I suspect ipv6 might be involved, but sometimes I can rmmod
it even with ipv6 loaded so it's a bit random).