2004-01-23 21:43:12

by Petr Sebor

[permalink] [raw]
Subject: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out

Hello,

since we have upgraded cabling on our network and transfer speeds
increased a little
bit, we are experiencing very often situations where the Intel PRO/1000
nics just stop
responding and network dies for a while. Local console works, there are
no more error
messages other than (when the eth0 comes to a life again):

NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex

then its working ok again, until the next watchdog message.

It has happened even before the cabling upgrade, it was just very rare.

I have tried kernels 2.6.0, 2.6.1, 2.6.2-bk1 and it happens all the
time. It is just that is _seems_ to happen more often with the 2.6.2-bk

The machine in question is an Opteron 244 based server (though kernel
is compiled for 32bits/Athlon). SMP kernel makes no difference, it will
eventually happen as well. The server is not heavily loaded... only few
users
can trigger the issue. Board is MSI KT800 based.

I have tried to switch NICs, but there is no difference. Onboard
integrated TG3
gigabit network controller suffers the '100% CPU usage' issue when
utilized so
this unfortunately no option at the moment.

Anyone having a clue what might be wrong here?

Have a nice weekend,
Petr


2004-01-23 22:29:35

by Feldman, Scott

[permalink] [raw]
Subject: RE: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out

> since we have upgraded cabling on our network and transfer
> speeds increased a little bit, we are experiencing very often
> situations where the Intel PRO/1000 nics just stop responding
> and network dies for a while. Local console works, there are
> no more error messages other than (when the eth0 comes to a
> life again):
>
> NETDEV WATCHDOG: eth0: transmit timed out
> e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex

Petr, I need you to try something. Get ethtool 1.8
(sf.net/projects/gkernel) and turn off TSO:

# ethtool -K eth0 tso off

If you now longer see NETDEV WATCHDOG's, I have a next step. More on
that later.

-scott

2004-01-24 19:36:15

by Sergey S. Kostyliov

[permalink] [raw]
Subject: Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out

Hello Scott, Petr,

On Saturday 24 January 2004 01:28, Feldman, Scott wrote:
> > since we have upgraded cabling on our network and transfer
> > speeds increased a little bit, we are experiencing very often
> > situations where the Intel PRO/1000 nics just stop responding
> > and network dies for a while. Local console works, there are
> > no more error messages other than (when the eth0 comes to a
> > life again):
> >
> > NETDEV WATCHDOG: eth0: transmit timed out
> > e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
>
> Petr, I need you to try something. Get ethtool 1.8
> (sf.net/projects/gkernel) and turn off TSO:
>
> # ethtool -K eth0 tso off
>
> If you now longer see NETDEV WATCHDOG's, I have a next step. More on
> that later.
I have had exactly the same problem with 2.6.{0,1} kernels:
"NETDEV WATCHDOG: eth0: transmit timed out"
where eth0 is:
"03:07.0 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet Controller (Copper) (rev 01)".
The only difference is that my eth0 is at 100 Mbps Full Duplex.
And yes, in my case this problem was solved by `ethtool -K eth0 tso off`.

>
> -scott
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc

2004-01-26 10:34:04

by Petr Sebor

[permalink] [raw]
Subject: Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out

Feldman, Scott wrote:

>>since we have upgraded cabling on our network and transfer
>>speeds increased a little bit, we are experiencing very often
>>situations where the Intel PRO/1000 nics just stop responding
>>and network dies for a while. Local console works, there are
>>no more error messages other than (when the eth0 comes to a
>>life again):
>>
>>NETDEV WATCHDOG: eth0: transmit timed out
>>e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
>>
>>
>
>Petr, I need you to try something. Get ethtool 1.8
>(sf.net/projects/gkernel) and turn off TSO:
>
> # ethtool -K eth0 tso off
>
>If you now longer see NETDEV WATCHDOG's, I have a next step. More on
>that later.
>
>-scott
>
>
Scott,

after a weekend and half of working day (with extra torturing of the
network card)
the NETDEV WATCHDOG's are not barking anymore with the tso's disabled.

Do you want me to do more testing or will you tell me what _the_ next
step is ? :-)

Regards,
Petr

2004-01-27 00:02:15

by Feldman, Scott

[permalink] [raw]
Subject: Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out

On Mon, 26 Jan 2004, Petr Sebor wrote:

> after a weekend and half of working day (with extra torturing of the
> network card)
> the NETDEV WATCHDOG's are not barking anymore with the tso's disabled.
>
> Do you want me to do more testing or will you tell me what _the_ next
> step is ? :-)

Petr, sorry for the suspense. Here's a patch against 2.6.2-rc2 that fixes
a race in the Tx path of e1000 that you may be exposing with TSO on. The
race is:

Tx queue Tx clean (interrupt context)

...
if(h/w Q full) | clean h/w Q
... <---| if(s/w Q stopped)
stop s/w Q | wake s/w Q


So let's try this patch with TSO back on.

--- linux-2.6.2-rc2/drivers/net/e1000/e1000.h.orig 2004-01-26 15:38:40.000000000 -0800
+++ linux-2.6.2-rc2/drivers/net/e1000/e1000.h 2004-01-26 15:35:37.000000000 -0800
@@ -192,6 +192,7 @@

/* TX */
struct e1000_desc_ring tx_ring;
+ spinlock_t tx_lock;
uint32_t txd_cmd;
uint32_t tx_int_delay;
uint32_t tx_abs_int_delay;
--- linux-2.6.2-rc2/drivers/net/e1000/e1000_main.c.orig 2004-01-26 15:38:33.000000000 -0800
+++ linux-2.6.2-rc2/drivers/net/e1000/e1000_main.c 2004-01-26 15:33:25.000000000 -0800
@@ -669,6 +669,7 @@

atomic_set(&adapter->irq_sem, 1);
spin_lock_init(&adapter->stats_lock);
+ spin_lock_init(&adapter->tx_lock);

return 0;
}
@@ -1783,6 +1784,7 @@
struct e1000_adapter *adapter = netdev->priv;
unsigned int first;
unsigned int tx_flags = 0;
+ unsigned long flags;
int count;

if(skb->len <= 0) {
@@ -1790,10 +1792,13 @@
return 0;
}

+ spin_lock_irqsave(&adapter->tx_lock, flags);
+
if(adapter->hw.mac_type == e1000_82547) {
if(e1000_82547_fifo_workaround(adapter, skb)) {
netif_stop_queue(netdev);
mod_timer(&adapter->tx_fifo_stall_timer, jiffies);
+ spin_unlock_irqrestore(&adapter->tx_lock, flags);
return 1;
}
}
@@ -1814,11 +1819,14 @@
e1000_tx_queue(adapter, count, tx_flags);
else {
netif_stop_queue(netdev);
+ spin_unlock_irqrestore(&adapter->tx_lock, flags);
return 1;
}

netdev->trans_start = jiffies;

+ spin_unlock_irqrestore(&adapter->tx_lock, flags);
+
return 0;
}

@@ -2171,6 +2179,8 @@
unsigned int i, eop;
boolean_t cleaned = FALSE;

+ spin_lock(&adapter->tx_lock);
+
i = tx_ring->next_to_clean;
eop = tx_ring->buffer_info[i].next_to_watch;
eop_desc = E1000_TX_DESC(*tx_ring, eop);
@@ -2215,6 +2225,8 @@
if(cleaned && netif_queue_stopped(netdev) && netif_carrier_ok(netdev))
netif_wake_queue(netdev);

+ spin_unlock(&adapter->tx_lock);
+
return cleaned;
}


2004-01-27 14:00:19

by Petr Sebor

[permalink] [raw]
Subject: Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out

Feldman, Scott wrote:

>Petr, sorry for the suspense. Here's a patch against 2.6.2-rc2 that fixes
>a race in the Tx path of e1000 that you may be exposing with TSO on. The
>race is:
>
>Tx queue Tx clean (interrupt context)
>
>...
>if(h/w Q full) | clean h/w Q
> ... <---| if(s/w Q stopped)
> stop s/w Q | wake s/w Q
>
>
>So let's try this patch with TSO back on.
>
>
Scott,

thanks for the patch. Again, 3/4 of working day with moderate server
load resulted in no
WATCHDOG barking with the patched kernel and tso's turned on. I dare say
that this is it! :-)
(Little more testing here won't harm though)

If nothing, the stability of the e1000 has vastly improved

Thanks a lot!

Regards,
Petr