2008-08-19 18:10:46

by Marc Haber

[permalink] [raw]
Subject: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

Hi,

I have one HP DL 140 G1 running with Debian stable and a
locally-built, vanilla kernel. With 2.6.25.11, everything is fine.

But, after updating to 2.6.26.2, I noticed that pinging the host on
the local ethernet sometimes (e.g. several times a minute, but not
always) results in a round-trip time of larger than two seconds. The
packets are not lost though, they're only severly delayed. For an ssh
session to the host, this feels like somebody rocking a bad network
connector. The same behavior is visible with 2.6.26 and 2.6.26.1.

Going back to 2.6.25.11 immediately fixes the issue for me.

Syslog doesn't say anything conspicious, unfortunately. As I don't
have local access to the box, I cannot say whether it's only the
network that freezes or whether it's the entire box.

Does it make sense to test any later 2.6.25.x kernel, or is there any
post-2.6.26 patch available that may fix the issue for me?

Strangely, another DL 140 from the same charge runs just fine with
2.6.26.2.

Here is the output of lspci -vvv for the network interface:

02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (16000ns min), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 19
Region 0: Memory at febc0000 (64-bit, non-prefetchable) [size=64K]
Region 2: Memory at febb0000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at <ignored> [disabled]
Capabilities: [40] PCI-X non-bridge device
Command: DPERE- ERO- RBC=2048 OST=1
Status: Dev=02:00.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
Capabilities: [48] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable-
Address: 1502da2a94a95200 Data: 3ca4

If there is any information that can help, please say so.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190


2008-08-19 20:20:55

by David Miller

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

From: Marc Haber <[email protected]>
Date: Tue, 19 Aug 2008 19:20:19 +0200

[ netdev added to CC: ]

> I have one HP DL 140 G1 running with Debian stable and a
> locally-built, vanilla kernel. With 2.6.25.11, everything is fine.
>
> But, after updating to 2.6.26.2, I noticed that pinging the host on
> the local ethernet sometimes (e.g. several times a minute, but not
> always) results in a round-trip time of larger than two seconds. The
> packets are not lost though, they're only severly delayed. For an ssh
> session to the host, this feels like somebody rocking a bad network
> connector. The same behavior is visible with 2.6.26 and 2.6.26.1.
>
> Going back to 2.6.25.11 immediately fixes the issue for me.
>
> Syslog doesn't say anything conspicious, unfortunately. As I don't
> have local access to the box, I cannot say whether it's only the
> network that freezes or whether it's the entire box.
>
> Does it make sense to test any later 2.6.25.x kernel, or is there any
> post-2.6.26 patch available that may fix the issue for me?
>
> Strangely, another DL 140 from the same charge runs just fine with
> 2.6.26.2.
>
> Here is the output of lspci -vvv for the network interface:
>
> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
> Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
> Latency: 64 (16000ns min), Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 19
> Region 0: Memory at febc0000 (64-bit, non-prefetchable) [size=64K]
> Region 2: Memory at febb0000 (64-bit, non-prefetchable) [size=64K]
> Expansion ROM at <ignored> [disabled]
> Capabilities: [40] PCI-X non-bridge device
> Command: DPERE- ERO- RBC=2048 OST=1
> Status: Dev=02:00.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
> Capabilities: [48] Power Management version 2
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
> Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
> Capabilities: [50] Vital Product Data
> Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable-
> Address: 1502da2a94a95200 Data: 3ca4
>
> If there is any information that can help, please say so.
>
> Greetings
> Marc
>
> --
> -----------------------------------------------------------------------------
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
> Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-08-19 21:30:38

by Michael Chan

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network


On Tue, 2008-08-19 at 13:20 -0700, David Miller wrote:
> From: Marc Haber <[email protected]>
> Date: Tue, 19 Aug 2008 19:20:19 +0200
>
> [ netdev added to CC: ]
>
> > I have one HP DL 140 G1 running with Debian stable and a
> > locally-built, vanilla kernel. With 2.6.25.11, everything is fine.
> >
> > But, after updating to 2.6.26.2, I noticed that pinging the host on
> > the local ethernet sometimes (e.g. several times a minute, but not
> > always) results in a round-trip time of larger than two seconds. The
> > packets are not lost though, they're only severly delayed. For an ssh
> > session to the host, this feels like somebody rocking a bad network
> > connector. The same behavior is visible with 2.6.26 and 2.6.26.1.
> >
> > Going back to 2.6.25.11 immediately fixes the issue for me.
> >
> > Syslog doesn't say anything conspicious, unfortunately. As I don't
> > have local access to the box, I cannot say whether it's only the
> > network that freezes or whether it's the entire box.
> >
> > Does it make sense to test any later 2.6.25.x kernel, or is there any
> > post-2.6.26 patch available that may fix the issue for me?
> >
> > Strangely, another DL 140 from the same charge runs just fine with
> > 2.6.26.2.
> >

The may be the 2.5 second polling that was mistakenly added to the code
instead of the intended 2.5 msec.

It has been fixed a few days ago in the net-2.6 tree:

tg3: Fix firmware event timeouts

and it should be in Linus' tree very soon. This reminds us to send the
same patch to -stable.

Thanks.

2008-08-19 22:14:54

by Marc Haber

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

On Tue, Aug 19, 2008 at 10:29:22AM -0700, Michael Chan wrote:
> It has been fixed a few days ago in the net-2.6 tree:
>
> tg3: Fix firmware event timeouts
>
> and it should be in Linus' tree very soon. This reminds us to send the
> same patch to -stable.

Not having much clue about git, can you send me the patch to try
locally?

I am not even sure to have a network issue here.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190

2008-08-19 22:30:39

by Matt Carlson

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

Can you try the attached patch? The patch reduces the delay back to
what it should have been. If this helps, then it means you are being
bitten by the same bug the upstream patch fixed.

On Tue, Aug 19, 2008 at 03:14:39PM -0700, Marc Haber wrote:
> On Tue, Aug 19, 2008 at 10:29:22AM -0700, Michael Chan wrote:
> > It has been fixed a few days ago in the net-2.6 tree:
> >
> > tg3: Fix firmware event timeouts
> >
> > and it should be in Linus' tree very soon. This reminds us to send the
> > same patch to -stable.
>
> Not having much clue about git, can you send me the patch to try
> locally?
>
> I am not even sure to have a network issue here.
>
> Greetings
> Marc
>
> --
> -----------------------------------------------------------------------------
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
> Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190
>


Attachments:
(No filename) (0.98 kB)
timeout.patch (434.00 B)
Download all attachments

2008-08-20 17:48:15

by Marc Haber

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

On Tue, Aug 19, 2008 at 03:30:25PM -0700, Matt Carlson wrote:
> Can you try the attached patch? The patch reduces the delay back to
> what it should have been. If this helps, then it means you are being
> bitten by the same bug the upstream patch fixed.

It looks like the issue is fixed now. Thanks for your help.

Will this fix be in 2.6.26.3 and/or 2.6.27?

Now I need to understand why the other, nearly[1] identical box didn't
need the patch to function properly.

Greetings
Marc

[1] only difference is that the working box has two e1000 interfaces
on a PCI card in addition to the two tg3 interfaces on board (all four
of them being in use)

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190

2008-08-20 18:13:06

by Michael Chan

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network


On Wed, 2008-08-20 at 10:47 -0700, Marc Haber wrote:
> On Tue, Aug 19, 2008 at 03:30:25PM -0700, Matt Carlson wrote:
> > Can you try the attached patch? The patch reduces the delay back to
> > what it should have been. If this helps, then it means you are being
> > bitten by the same bug the upstream patch fixed.
>
> It looks like the issue is fixed now. Thanks for your help.
>
> Will this fix be in 2.6.26.3 and/or 2.6.27?

It was just submitted to -stable so it should appear 2.6.26.3 or .4.
And definitely it will be fixed in 2.6.27.

>
> Now I need to understand why the other, nearly[1] identical box didn't
> need the patch to function properly.

It depends on whether you have ASF enabled or not and what version of
ASF you have. ASF is management firmware running inside the NIC. When
the tg3 driver loads, it will show ASF[1] in dmesg if ASF is enabled.

>
> Greetings
> Marc
>
> [1] only difference is that the working box has two e1000 interfaces
> on a PCI card in addition to the two tg3 interfaces on board (all four
> of them being in use)
>
> --
> -----------------------------------------------------------------------------
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
> Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190
>

2008-08-21 16:09:45

by Marc Haber

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

On Wed, Aug 20, 2008 at 07:11:50AM -0700, Michael Chan wrote:
> On Wed, 2008-08-20 at 10:47 -0700, Marc Haber wrote:
> > Now I need to understand why the other, nearly[1] identical box didn't
> > need the patch to function properly.
>
> It depends on whether you have ASF enabled or not and what version of
> ASF you have. ASF is management firmware running inside the NIC. When
> the tg3 driver loads, it will show ASF[1] in dmesg if ASF is enabled.

Both machines have that string in their dmesg.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190

2008-08-21 16:21:47

by Michael Chan

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

Marc Haber wrote:

> On Wed, Aug 20, 2008 at 07:11:50AM -0700, Michael Chan wrote:
> > On Wed, 2008-08-20 at 10:47 -0700, Marc Haber wrote:
> > > Now I need to understand why the other, nearly[1]
> identical box didn't
> > > need the patch to function properly.
> >
> > It depends on whether you have ASF enabled or not and what
> version of
> > ASF you have. ASF is management firmware running inside
> the NIC. When
> > the tg3 driver loads, it will show ASF[1] in dmesg if ASF
> is enabled.
>
> Both machines have that string in their dmesg.
>

It may be different versions of ASF. Try ethtool -i eth0. It
may tell us the firmware version.

2008-08-22 11:33:45

by Marc Haber

[permalink] [raw]
Subject: Re: 2.6.26/tg3 ping roundtrip times > 2000 ms on local network

On Thu, Aug 21, 2008 at 09:21:28AM -0700, Michael Chan wrote:
> Marc Haber wrote:
> > On Wed, Aug 20, 2008 at 07:11:50AM -0700, Michael Chan wrote:
> > > On Wed, 2008-08-20 at 10:47 -0700, Marc Haber wrote:
> > > > Now I need to understand why the other, nearly[1]
> > identical box didn't
> > > > need the patch to function properly.
> > >
> > > It depends on whether you have ASF enabled or not and what
> > version of
> > > ASF you have. ASF is management firmware running inside
> > the NIC. When
> > > the tg3 driver loads, it will show ASF[1] in dmesg if ASF
> > is enabled.
> >
> > Both machines have that string in their dmesg.
> >
>
> It may be different versions of ASF. Try ethtool -i eth0. It
> may tell us the firmware version.

$ sudo ethtool -i eth0
driver: tg3
version: 3.92.1
firmware-version: 5704-v3.26
bus-info: 0000:02:00.0

The other box is the same, only that bus-info ends in .1, and the .0
device is unused.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190