2001-02-26 14:34:35

by Manfred Spraul

[permalink] [raw]
Subject: Re: PROBLEM: Network hanging - Tulip driver with Netgear (Lite-On)

I think I found the bug:

Someone (Jeff?) removed the line

tp->advertising[phy_idx++] = reg4;

from tulip/tulip_core.c

pnic_check_duplex uses that variable :-(

There are 2 workarounds:

* change pnic_check_duplex:
s/tp->advertising[0]/tp->mii_advertise/g

* remove the new mii_advertise variable and replace it with
'tp->advertising[i]'.

Jeff, is it really a good idea to have one global mii_advertise
variable? If someone builds a card with multiple transceivers, then
they'll probably support different medias.

--
Manfred


2001-02-26 20:11:25

by Jeff Garzik

[permalink] [raw]
Subject: Re: PROBLEM: Network hanging - Tulip driver with Netgear (Lite-On)

Manfred Spraul wrote:
>
> I think I found the bug:
>
> Someone (Jeff?) removed the line
>
> tp->advertising[phy_idx++] = reg4;
>
> from tulip/tulip_core.c
>
> pnic_check_duplex uses that variable :-(
>
> There are 2 workarounds:
>
> * change pnic_check_duplex:
> s/tp->advertising[0]/tp->mii_advertise/g
>
> * remove the new mii_advertise variable and replace it with
> 'tp->advertising[i]'.

mii_advertise is what MII is currently advertising on the current
media. tp->advertising is per-phy, on the other hand.

Pat, Manfred, in pnic_check_duplex, make this change:
> - negotiated = mii_reg5 & tp->advertising[0];
> + negotiated = mii_reg5 & tulip_mdio_read(dev, tp->phys[0], 4);

and let me know how it goes. I'm tempted to just remove
tp->advertising[] altogether.

Jeff


--
Jeff Garzik | "You see, in this world there's two kinds of
Building 1024 | people, my friend: Those with loaded guns
MandrakeSoft | and those who dig. You dig." --Blondie

2001-02-26 20:58:48

by Manfred Spraul

[permalink] [raw]
Subject: Re: PROBLEM: Network hanging - Tulip driver with Netgear (Lite-On)

Jeff Garzik wrote:
> Pat, Manfred, in pnic_check_duplex, make this change:
> > - negotiated = mii_reg5 & tp->advertising[0];
> > + negotiated = mii_reg5 & tulip_mdio_read(dev, tp->phys[0], 4);
>
The changed fixed the problem.

>
> Manfred Spraul wrote:
> >
> > I think I found the bug:
> >
> > Someone (Jeff?) removed the line
> >
> > tp->advertising[phy_idx++] = reg4;
> >
> > from tulip/tulip_core.c
> >
> > pnic_check_duplex uses that variable :-(
> >
> > There are 2 workarounds:
> >
> > * change pnic_check_duplex:
> > s/tp->advertising[0]/tp->mii_advertise/g
> >
> > * remove the new mii_advertise variable and replace it with
> > 'tp->advertising[i]'.
>
> mii_advertise is what MII is currently advertising on the current
> media. tp->advertising is per-phy, on the other hand.
>

Could you double check the code in tulip_core.c, around line 1450?
IMHO it's bogus.

1) if the network card contains multiple mii's, then the the advertised
value of all mii's is changed to the advertised value of the first mii.

2) the new driver starts with the current advertised value, the previous
driver recalculated the value from mii_status

[ mii_status = tulip_mdio_read(dev,phy,1); ]

- reg4 = ((mii_status>>6)& tp->to_advertise) | 1;

That could trigger 2 problems:
* I tested with 'options=11', and the new driver announces '100baseT4'
support, but the PHY doesn't support 100baseT4.
* If the mii is incorrectly initialized, then a wrong advertised value
is not corrected.

--
Manfred

2001-02-27 06:16:57

by Pat Verner

[permalink] [raw]
Subject: Re: PROBLEM: Network hanging - Tulip driver with Netgear (Lite-On)

Good morning all.

First thing this morning I applied Jeff's patch, as below. Started off
well, ran for about 20 minutes (and 40 MBytes) before hanging.

Reversed out Jeff's change and applied Manfred's patch to the same lines in
pnic.c. Ran for about 15 minutes (28 Mbytes) before hanging. It is still
early, and the network is still quiet, so the volume of data received is
still low, but the hanging problem is unfortunately still there.

=Pat

At 09:58 PM 26/02/2001 +0100, Manfred Spraul wrote:
>Jeff Garzik wrote:
> > Pat, Manfred, in pnic_check_duplex, make this change:
> > > - negotiated = mii_reg5 & tp->advertising[0];
> > > + negotiated = mii_reg5 & tulip_mdio_read(dev, tp->phys[0], 4);
> >
>The changed fixed the problem.
>
> >
> > Manfred Spraul wrote:
> > >
> > > I think I found the bug:
> > >
> > > Someone (Jeff?) removed the line
> > >
> > > tp->advertising[phy_idx++] = reg4;
> > >
> > > from tulip/tulip_core.c
> > >
> > > pnic_check_duplex uses that variable :-(
> > >
> > > There are 2 workarounds:
> > >
> > > * change pnic_check_duplex:
> > > s/tp->advertising[0]/tp->mii_advertise/g
> > >
> > > * remove the new mii_advertise variable and replace it with
> > > 'tp->advertising[i]'.
> >
> > mii_advertise is what MII is currently advertising on the current
> > media. tp->advertising is per-phy, on the other hand.
> >
>
>Could you double check the code in tulip_core.c, around line 1450?
>IMHO it's bogus.
>
>1) if the network card contains multiple mii's, then the the advertised
>value of all mii's is changed to the advertised value of the first mii.
>
>2) the new driver starts with the current advertised value, the previous
>driver recalculated the value from mii_status
>
>[ mii_status = tulip_mdio_read(dev,phy,1); ]
>
>- reg4 = ((mii_status>>6)& tp->to_advertise) | 1;
>
>That could trigger 2 problems:
>* I tested with 'options=11', and the new driver announces '100baseT4'
>support, but the PHY doesn't support 100baseT4.
>* If the mii is incorrectly initialized, then a wrong advertised value
>is not corrected.
>
>--
> Manfred

--
Pat Verner E-Mail: [email protected]
Isis Information Systems (Pty) Ltd
PO Box 281, Irene, 0062, South Africa
Phone: +27-12-667-1411 Fax: +27-12-667-3800

2001-03-02 21:15:39

by Jeff Garzik

[permalink] [raw]
Subject: Re: PROBLEM: Network hanging - Tulip driver with Netgear (Lite-On)

Manfred Spraul wrote:
> Could you double check the code in tulip_core.c, around line 1450?
> IMHO it's bogus.
>
> 1) if the network card contains multiple mii's, then the the advertised
> value of all mii's is changed to the advertised value of the first mii.

I'm really curious about this one myself.

Since I haven't digested all of the tulip media stuff in my brain yet,
and since I'm not familiar with all the corner cases, I'm loathe to
change the tulip media stuff without fully understanding what's going
on.

If you have a single controller with multiple MII phys... how does one
select the phy of choice (for tulip, in the absence of SROM media
table...)? And once phy A has been selected out of N available as the
active phy, should you care about the others at all?

Jeff


--
Jeff Garzik | "You see, in this world there's two kinds of
Building 1024 | people, my friend: Those with loaded guns
MandrakeSoft | and those who dig. You dig." --Blondie

2001-03-02 21:53:48

by Collectively Unconscious

[permalink] [raw]
Subject: I/O problem with sustained writes

We are having a problem with writes.
They start at 14 M/s for the first hour and then drop to 2.5 M/s and stay
that way. Reads do not seem effected and we've noticed this on the 2.2.16,
2.2.17, 2.2.18 and now the 2.2.19pre11 kernels.

These are SMP P-IIIs from 450 to 800 MHz. Redhat 6.2

Jay

2001-03-02 22:20:25

by Manfred Spraul

[permalink] [raw]
Subject: Re: PROBLEM: Network hanging - Tulip driver with Netgear (Lite-On)

Jeff Garzik wrote:
>
> Manfred Spraul wrote:
> > Could you double check the code in tulip_core.c, around line 1450?
> > IMHO it's bogus.
> >
> > 1) if the network card contains multiple mii's, then the the advertised
> > value of all mii's is changed to the advertised value of the first mii.
>
> I'm really curious about this one myself.
>
> Since I haven't digested all of the tulip media stuff in my brain yet,
> and since I'm not familiar with all the corner cases, I'm loathe to
> change the tulip media stuff without fully understanding what's going
> on.
>
> If you have a single controller with multiple MII phys... how does one
> select the phy of choice (for tulip, in the absence of SROM media
> table...)?

I'd choose the first one with a link partner.

> And once phy A has been selected out of N available as the
> active phy, should you care about the others at all?
>

Not until the link beat disappears.
Then scan all existing phy's and select the phy with a link beat as the
new active phy.

At least that's what the sis900.c driver does. Are there other linux
drivers that support multiple phy's?

--
Manfred

2001-03-02 22:47:50

by Donald Becker

[permalink] [raw]
Subject: Re: PROBLEM: Network hanging - Tulip driver with Netgear (Lite-On)

On Fri, 2 Mar 2001, Manfred Spraul wrote:
> Jeff Garzik wrote:
> > Manfred Spraul wrote:
> > > Could you double check the code in tulip_core.c, around line 1450?
> > > IMHO it's bogus.
> > >
> > > 1) if the network card contains multiple mii's, then the the advertised
> > > value of all mii's is changed to the advertised value of the first mii.
...
> > If you have a single controller with multiple MII phys... how does one
> > select the phy of choice (for tulip, in the absence of SROM media
> > table...)?
>
> I'd choose the first one with a link partner.

Well, yes, but what is "first"?

Are there any Tulip cards (besides the Comet-2 w/HPNA) that have multiple
MII transceivers?

The Comet2 is a special case, since only one transceiver is powered and
visible at a time. Polling the other transceiver switches off the
first.

> > And once phy A has been selected out of N available as the
> > active phy, should you care about the others at all?
>
> Not until the link beat disappears.

Uhmm, but you don't always know when you have lost link beat. In some
cases the driver does basic polling to check for duplex changes, but
the semantics are not as clean as you would expect.


Donald Becker [email protected]
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993

2001-03-02 23:57:25

by Andrew Morton

[permalink] [raw]
Subject: Re: I/O problem with sustained writes

Collectively Unconscious wrote:
>
> We are having a problem with writes.
> They start at 14 M/s for the first hour and then drop to 2.5 M/s and stay
> that way. Reads do not seem effected and we've noticed this on the 2.2.16,
> 2.2.17, 2.2.18 and now the 2.2.19pre11 kernels.
>
> These are SMP P-IIIs from 450 to 800 MHz. Redhat 6.2

I've seen something similar on Seagate ST313021A IDE drives.
After a few minutes their read throughput falls from 20
megs/sec to about 5. Issuing *any* drive-setting command
brings the throughput back. Even a command which the disk
doesn't support.

So I have a cron job which runs `hdparm -A1' once per minute.

-