2002-02-12 00:33:31

by Jonathan Woithe

[permalink] [raw]
Subject: Tulip oddity under 2.2.19 - possible bug?

Hi all

[ Please CC me any replies since I read the list through a web archive which
sometimes misses messages ]

In recent times I have encounted an oddity while using a tulip-based NIC
under 2.2.19. I am wondering whether it's associated with a known issue or
whether it's indicating a hardware fault of some description.

The kernel is the stock Linus 2.2.19 (as supplied in Slackware 8.0).

The arrangement: there are two machines conected with a crossed UTP cable.
One is the Linux box and the other is an NT4 box. Both have identical NICs
based on the Tulip chip (DEC 21041).

The symptoms: if both PCs are powered up simultaneously everything comes up
fine - communication on the network proceeds normally. If the NT box is for
some reason reset (without the Linux box going down), the network connection
is not re-established successfully once NT comes back up again. Trying to
ping the NT box causes the "tx/rx" led on the back of the linux NIC to
blink, but nothing is sent back from the NT box. Interestingly enough,
ifconfig reports some carrier errors in the fault condition. Rebooting the
NT4 box doesn't fix the problem; however, doing
ifconfig eth0 down
ifconfig eth0 up
on the Linux box does fix the problem.

If the network isn't connected when the Linux box is booted, a similar
problem occurs - once the network cable is connected the linux box needs the
above ifconfig cycle to be run before the network connection is
successfully established.

In addition to the above situation, I have also experienced similar hangups
in another box. In this case the linux box has a Lite-On card (LNE100TX)
connected to a 100Mbps repeater; it is running 2.2.19 as per the other box
I've mentioned. On boot everything is fine, but if the link goes down (the
repeater is reset for example), the Linux box looses network connectivity
until the ifconfig cycle is carried out.

During or after the fault condition nothing is written to any log files to
indicate a problem.

Is this a known issue with a workaround available, or is it indicative of a
hardware error?

If any further info is required please drop me an email and I'll try to
obtain the requested details.

Regards
jonathan
--
* Jonathan Woithe [email protected] *
* http://www.physics.adelaide.edu.au/~jwoithe *
***-----------------------------------------------------------------------***
** "Time is an illusion; lunchtime doubly so" **
* "...you wouldn't recognize a subtle plan if it painted itself purple and *
* danced naked on a harpsichord singing 'subtle plans are here again'" *


2002-02-12 15:42:50

by Chris Friesen

[permalink] [raw]
Subject: Re: Tulip oddity under 2.2.19 - possible bug?

Jonathan Woithe wrote:
>
> Hi all
>
> [ Please CC me any replies since I read the list through a web archive which
> sometimes misses messages ]
>
> In recent times I have encounted an oddity while using a tulip-based NIC
> under 2.2.19. I am wondering whether it's associated with a known issue or
> whether it's indicating a hardware fault of some description.

Have you tried the latest tulip drivers from http://www.scyld.com/network? I've found
them to be best for the 2.2 series.

That said, there is still a race condition in the current driver that can cause
the tx queue to hang, but it's highly unlikely that you'll hit it at home (have
to be doing significant traffic in both directions to make it likely that you'll
hit it).

Chris

--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2002-02-12 21:00:48

by Neale Banks

[permalink] [raw]
Subject: Re: Tulip oddity under 2.2.19 - possible bug?

On Tue, 12 Feb 2002, Jonathan Woithe wrote:

> In recent times I have encounted an oddity while using a tulip-based NIC
> under 2.2.19. I am wondering whether it's associated with a known issue or
> whether it's indicating a hardware fault of some description.

Per chance, would this be a NetGear card?

> The kernel is the stock Linus 2.2.19 (as supplied in Slackware 8.0).
>
> The arrangement: there are two machines conected with a crossed UTP cable.
> One is the Linux box and the other is an NT4 box. Both have identical NICs
> based on the Tulip chip (DEC 21041).
>
> The symptoms: if both PCs are powered up simultaneously everything comes up
> fine - communication on the network proceeds normally. If the NT box is for
> some reason reset (without the Linux box going down), the network connection
> is not re-established successfully once NT comes back up again. Trying to
> ping the NT box causes the "tx/rx" led on the back of the linux NIC to
> blink, but nothing is sent back from the NT box. Interestingly enough,
> ifconfig reports some carrier errors in the fault condition. Rebooting the
> NT4 box doesn't fix the problem; however, doing
> ifconfig eth0 down
> ifconfig eth0 up
> on the Linux box does fix the problem.

Sounds similar to an old "known" problem. In my case, arp wasn't
succeeding - debug by Keith Owens found that under some circumstances
sending an arp query resulted in a copy of the outgoing packet appearing
in the Rx ring/buffer (and not the reply, which was sniffed on the wire).
After much poking and sniffing, the conclusion of "bug on the card" was
looking highly likely.

If it's this bug, you should be able to work around it by any of:

(a) as above, downing and upping the interface
(b) switch the interface in and out of promisc mode
(c) pre-setting a static arp entry (on the Linux box).

and possibly:

(d) connect the Tulip to a switch (instead of "shared" media, like a hub).

> If the network isn't connected when the Linux box is booted, a similar
> problem occurs - once the network cable is connected the linux box needs the
> above ifconfig cycle to be run before the network connection is
> successfully established.

It may simply get back to a race as to which box arps first. E.g. IIRC
'doze configured with a static IP address will (as a sanity check) arp for
its own address on bootup - this may be enough to prime the Linux arp
cache and avoid the problem.

[...]
> During or after the fault condition nothing is written to any log files to
> indicate a problem.

"Undetected errors are handled as if no error had occured" :-(

Does "arp -n" show an "incomplete" entry for the relevant IP address?

> Is this a known issue with a workaround available, or is it indicative of a
> hardware error?

Possibly both ;-) I'd be really interested if you come up with any
"clean" fix/workaround.

HTH,
Neale.