2009-03-01 06:21:57

by Jesper Krogh

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected

Jay Vosburgh wrote:
> Jesper Krogh <[email protected]> wrote:
>
>> Jay Vosburgh wrote:
>>> Jesper Krogh <[email protected]> wrote:
>>> [...]
>>>> The offending commit seems to be:
>>>>
>>>> A test with a fresh 2.6.29-rc6 revealed that the problem has been fixed
>>>> subsequently.. but still exists in 2.6.27-newest. (havent tested
>>>> 2.6.28-newest yet).
>>>>
>>>> Any ideas of what the "fixing" commit is .. or should that also be
>>>> bisected?
>>> I went back and looked at your earlier mail. Since you're using
>>> 802.3ad mode, my first guess would be this commit:
>>>
>>> commit fd989c83325cb34795bc4d4aa6b13c06f90eac99
>
> I'll compile 2.6.28.7 here and see if it works for me.

I appreciate that you spend time on it, but my feeling is that it
definately isn't reproducible in all environments (otherwise we would
probably have seen a large cry by now).

I'm trying to bisect the "fix" down and hope that'll tell us something
more.

If you do the test, remember, that it is not like "bonding isn't
working". It just fails to initialize correctly at bootup and doesnt get
the link state by itself. Subsequently doing a /etc/init.d/networking
restart brigs it correct up.

--
Jesper


2009-03-01 13:19:35

by Jesper Krogh

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

Jesper Krogh wrote:
> I appreciate that you spend time on it, but my feeling is that it
> definately isn't reproducible in all environments (otherwise we would
> probably have seen a large cry by now).
>
> I'm trying to bisect the "fix" down and hope that'll tell us something
> more.

Ok, The fixing commit seems to be: cb52deba12f27af90a46d2f8667a64888118a888

Applying it to 2.6.28.7 and 2.6.27.19 makes them both work.

It also explains why my e1000 based bonds didnt break, allthough the
commitmessage doesnt mention anything about how it should effect
bonding. Wouldn't it make sense to propose this patch for 2.6.27 and
2.6.28 stable kernels?


commit cb52deba12f27af90a46d2f8667a64888118a888
Author: Ed Swierk <[email protected]>
Date: Mon Dec 1 12:24:43 2008 +0000

forcedeth: power down phy when interface is down

Bring the physical link down when the interface is down by placing
the PHY
in power-down state, unless WOL is enabled. This mirrors the
behavior of
other drivers including e1000 and tg3.

Without the patch, ifconfig down leaves the physical link up, which
confuses
datacenter users who expect the link lights both on the NIC and the
switch to
go out when they bring an interface down.

Furthermore, even though the phy is powered on, autonegotiation
stops working,
so a normally gigabit link might suddenly become 100 Mbit
half-duplex when the
interface goes down, and become gigabit when it comes up again.

Ayaz said:

I would not include this patch until further testing is
performed. NVIDIA
MCP chips use 3rd party PHY vendors. By powering down the phy,
it could
have adverse affects on certain phys.

Arthur Jones said:

I just ran across this patch. Tested on a Marvell 88E1121R (GigE
PHY)
and works great. This is a very important feature for me.

Signed-off-by: Ed Swierk <[email protected]>
Tested-by: Arthur Jones <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 0d7e575..12384df 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -1446,9 +1446,9 @@ static int phy_init(struct net_device *dev)
/* some phys clear out pause advertisment on reset, set it back */
mii_rw(dev, np->phyaddr, MII_ADVERTISE, reg);

- /* restart auto negotiation */
+ /* restart auto negotiation, power down phy */
mii_control = mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ);
- mii_control |= (BMCR_ANRESTART | BMCR_ANENABLE);
+ mii_control |= (BMCR_ANRESTART | BMCR_ANENABLE | BMCR_PDOWN);
if (mii_rw(dev, np->phyaddr, MII_BMCR, mii_control)) {
return PHY_ERROR;
}
@@ -5208,6 +5208,10 @@ static int nv_open(struct net_device *dev)

dprintk(KERN_DEBUG "nv_open: begin\n");

+ /* power up phy */
+ mii_rw(dev, np->phyaddr, MII_BMCR,
+ mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ) & ~BMCR_PDOWN);
+
/* erase previous misconfiguration */
if (np->driver_data & DEV_HAS_POWER_CNTRL)
nv_mac_reset(dev);
@@ -5401,6 +5405,10 @@ static int nv_close(struct net_device *dev)
if (np->wolenabled) {
writel(NVREG_PFF_ALWAYS|NVREG_PFF_MYADDR, base +
NvRegPacketFilterFlags);
nv_start_rx(dev);
+ } else {
+ /* power down phy */
+ mii_rw(dev, np->phyaddr, MII_BMCR,
+ mii_rw(dev, np->phyaddr, MII_BMCR,
MII_READ)|BMCR_PDOWN);
}

/* FIXME: power down nic */

2009-03-05 18:51:22

by Jay Vosburgh

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

Jesper Krogh <[email protected]> wrote:

>Jesper Krogh wrote:
>> I appreciate that you spend time on it, but my feeling is that it
>> definately isn't reproducible in all environments (otherwise we would
>> probably have seen a large cry by now).
>>
>> I'm trying to bisect the "fix" down and hope that'll tell us something
>> more.
>
>Ok, The fixing commit seems to be: cb52deba12f27af90a46d2f8667a64888118a888
>
>Applying it to 2.6.28.7 and 2.6.27.19 makes them both work.
>
>It also explains why my e1000 based bonds didnt break, allthough the
>commitmessage doesnt mention anything about how it should effect
>bonding. Wouldn't it make sense to propose this patch for 2.6.27 and
>2.6.28 stable kernels?

Perhaps.

I don't have a forcedeth to test with, and as you surmised, I
was unable to reproduce the problem with other chipsets (tg3 or e1000).

However, I did find another bug I introduced during the "mii
refactor" patch that you mentioned as being the original source of the
problem. That bug will cause 802.3ad to not notice speed changes.

Could you test the patch below on your 2.6.68.7 and/or 2.6.27.19
and see if it resolves your problem (without the forcedeth patch)?

-J

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 2c96b93..ad81474 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3545,11 +3545,27 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave
}
break;
case NETDEV_CHANGE:
- /*
- * TODO: is this what we get if somebody
- * sets up a hierarchical bond, then rmmod's
- * one of the slave bonding devices?
- */
+ if (bond->params.mode == BOND_MODE_8023AD ||
+ bond_is_lb(bond)) {
+ struct slave *slave;
+
+ slave = bond_get_slave_by_dev(bond, slave_dev);
+ if (slave) {
+ u16 old_speed = slave->speed;
+ u16 old_duplex = slave->duplex;
+
+ bond_update_speed_duplex(slave);
+
+ if (bond_is_lb(bond))
+ break;
+
+ if (old_speed != slave->speed)
+ bond_3ad_adapter_speed_changed(slave);
+ if (old_duplex != slave->duplex)
+ bond_3ad_adapter_duplex_changed(slave);
+ }
+ }
+
break;
case NETDEV_DOWN:
/*


---
-Jay Vosburgh, IBM Linux Technology Center, [email protected]

2009-03-09 20:54:04

by Jesper Krogh

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

Jay Vosburgh wrote:
> However, I did find another bug I introduced during the "mii
> refactor" patch that you mentioned as being the original source of the
> problem. That bug will cause 802.3ad to not notice speed changes.
>
> Could you test the patch below on your 2.6.68.7 and/or 2.6.27.19
> and see if it resolves your problem (without the forcedeth patch)?

There was something missing from the header to make it compile.. I found
that in a later version. Patch below fixed the problem (without the
forcedeth patch).

diff --git a/drivers/net/bonding/bond_main.c
b/drivers/net/bonding/bond_main.c
index 1b9c4dc..fd61dfb 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3516,11 +3516,27 @@ static int bond_slave_netdev_event(unsigned long
event, struct net_device *slave
}
break;
case NETDEV_CHANGE:
- /*
- * TODO: is this what we get if somebody
- * sets up a hierarchical bond, then rmmod's
- * one of the slave bonding devices?
- */
+ if (bond->params.mode == BOND_MODE_8023AD ||
+ bond_is_lb(bond)) {
+ struct slave *slave;
+
+ slave = bond_get_slave_by_dev(bond, slave_dev);
+ if (slave) {
+ u16 old_speed = slave->speed;
+ u16 old_duplex = slave->duplex;
+
+ bond_update_speed_duplex(slave);
+
+ if (bond_is_lb(bond))
+ break;
+
+ if (old_speed != slave->speed)
+
bond_3ad_adapter_speed_changed(slave);
+ if (old_duplex != slave->duplex)
+
bond_3ad_adapter_duplex_changed(slave);
+ }
+ }
+
break;
case NETDEV_DOWN:
/*
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index fb730ec..b1315e4 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -248,6 +248,14 @@ static inline struct bonding
*bond_get_bond_by_slave(struct slave *slave)
return (struct bonding *)slave->dev->master->priv;
}

+static inline bool bond_is_lb(const struct bonding *bond)
+{
+ return bond->params.mode == BOND_MODE_TLB
+ || bond->params.mode == BOND_MODE_ALB;
+}
+
+
+
#define BOND_FOM_NONE 0
#define BOND_FOM_ACTIVE 1
#define BOND_FOM_FOLLOW 2

2009-03-13 23:12:43

by David Miller

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

From: Jesper Krogh <[email protected]>
Date: Mon, 09 Mar 2009 21:53:39 +0100

> Jay Vosburgh wrote:
> > However, I did find another bug I introduced during the "mii
> > refactor" patch that you mentioned as being the original source of the
> > problem. That bug will cause 802.3ad to not notice speed changes.
> > Could you test the patch below on your 2.6.68.7 and/or 2.6.27.19
> > and see if it resolves your problem (without the forcedeth patch)?
>
> There was something missing from the header to make it compile.. I found that in a later version. Patch below fixed the problem (without the forcedeth patch).

Jay please resend this with proper signoffs etc. if you want
me to apply it.

2009-03-13 23:27:25

by Jay Vosburgh

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

David Miller <[email protected]> wrote:

>From: Jesper Krogh <[email protected]>
>Date: Mon, 09 Mar 2009 21:53:39 +0100
>
>> Jay Vosburgh wrote:
>> > However, I did find another bug I introduced during the "mii
>> > refactor" patch that you mentioned as being the original source of the
>> > problem. That bug will cause 802.3ad to not notice speed changes.
>> > Could you test the patch below on your 2.6.68.7 and/or 2.6.27.19
>> > and see if it resolves your problem (without the forcedeth patch)?
>>
>> There was something missing from the header to make it compile.. I found that in a later version. Patch below fixed the problem (without the forcedeth patch).
>
>Jay please resend this with proper signoffs etc. if you want
>me to apply it.

I posted it again with the usual stuff a day or two after I
posted the test patch; I'll append it to the end of this email. Note
that the below patch has a minor cosmetic change from the test patch.

I believe this fix should go to -stable for 2.6.26 and 2.6.27,
but it'll need the change Jesper added to pick up a macro that was added
to mainline:

diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index fb730ec..b1315e4 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -248,6 +248,14 @@ static inline struct bonding
*bond_get_bond_by_slave(struct slave *slave)
return (struct bonding *)slave->dev->master->priv;
}

+static inline bool bond_is_lb(const struct bonding *bond)
+{
+ return bond->params.mode == BOND_MODE_TLB
+ || bond->params.mode == BOND_MODE_ALB;
+}
+
+
+
#define BOND_FOM_NONE 0
#define BOND_FOM_ACTIVE 1
#define BOND_FOM_FOLLOW 2

The above fragment isn't needed for mainline, only for -stable.

-J

From: Jay Vosburgh <[email protected]>
To: [email protected]
Cc: "David S. Miller" <[email protected]>, [email protected]
Subject: [PATCH net-next-2.6] bonding: Fix updating of speed/duplex changes
Date: Fri, 06 Mar 2009 15:27:33 -0800


This patch corrects an omission from the following commit:

commit f0c76d61779b153dbfb955db3f144c62d02173c2
Author: Jay Vosburgh <[email protected]>
Date: Wed Jul 2 18:21:58 2008 -0700

bonding: refactor mii monitor

The un-refactored code checked the link speed and duplex of
every slave on every pass; the refactored code did not do so.

The 802.3ad and balance-alb/tlb modes utilize the speed and
duplex information, and require it to be kept up to date. This patch
adds a notifier check to perform the appropriate updating when the slave
device speed changes.

Signed-off-by: Jay Vosburgh <[email protected]>
---
drivers/net/bonding/bond_main.c | 25 ++++++++++++++++++++-----
1 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bfe1ed8..dce3cf9 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3545,11 +3545,26 @@ static int bond_slave_netdev_event(unsigned long event, struct net_device *slave
}
break;
case NETDEV_CHANGE:
- /*
- * TODO: is this what we get if somebody
- * sets up a hierarchical bond, then rmmod's
- * one of the slave bonding devices?
- */
+ if (bond->params.mode == BOND_MODE_8023AD || bond_is_lb(bond)) {
+ struct slave *slave;
+
+ slave = bond_get_slave_by_dev(bond, slave_dev);
+ if (slave) {
+ u16 old_speed = slave->speed;
+ u16 old_duplex = slave->duplex;
+
+ bond_update_speed_duplex(slave);
+
+ if (bond_is_lb(bond))
+ break;
+
+ if (old_speed != slave->speed)
+ bond_3ad_adapter_speed_changed(slave);
+ if (old_duplex != slave->duplex)
+ bond_3ad_adapter_duplex_changed(slave);
+ }
+ }
+
break;
case NETDEV_DOWN:
/*
--
1.6.0.2

2009-03-16 20:35:08

by Jesper Krogh

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

Jay Vosburgh wrote:
> The above fragment isn't needed for mainline, only for -stable.
>

Did you sent it off to the stable kernel maintainers?

--
Jesper

2009-03-16 20:36:26

by David Miller

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

From: Jesper Krogh <[email protected]>
Date: Mon, 16 Mar 2009 21:34:38 +0100

> Jay Vosburgh wrote:
> > The above fragment isn't needed for mainline, only for -stable.
> >
>
> Did you sent it off to the stable kernel maintainers?

The fix has to go into Linus's tree first.

I haven't integrated Jay's changes yet.

2009-03-17 20:19:01

by Jesper Krogh

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

David Miller wrote:
> From: Jesper Krogh <[email protected]>
> Date: Mon, 16 Mar 2009 21:34:38 +0100
>
>> Jay Vosburgh wrote:
>>> The above fragment isn't needed for mainline, only for -stable.
>>>
>> Did you sent it off to the stable kernel maintainers?
>
> The fix has to go into Linus's tree first.
>
> I haven't integrated Jay's changes yet.

Excellent. I was just trying to make sure that it wasn't lost somewhere
in the process.

--
Jesper

2009-03-19 01:39:57

by David Miller

[permalink] [raw]
Subject: Re: Regression in bonding between 2.6.26.8 and 2.6.27.6 - bisected - twice

From: Jay Vosburgh <[email protected]>
Date: Fri, 13 Mar 2009 16:27:16 -0700

> David Miller <[email protected]> wrote:
>
> >From: Jesper Krogh <[email protected]>
> >Date: Mon, 09 Mar 2009 21:53:39 +0100
> >
> >> Jay Vosburgh wrote:
> >> > However, I did find another bug I introduced during the "mii
> >> > refactor" patch that you mentioned as being the original source of the
> >> > problem. That bug will cause 802.3ad to not notice speed changes.
> >> > Could you test the patch below on your 2.6.68.7 and/or 2.6.27.19
> >> > and see if it resolves your problem (without the forcedeth patch)?
> >>
> >> There was something missing from the header to make it compile.. I found that in a later version. Patch below fixed the problem (without the forcedeth patch).
> >
> >Jay please resend this with proper signoffs etc. if you want
> >me to apply it.
>
> I posted it again with the usual stuff a day or two after I
> posted the test patch; I'll append it to the end of this email. Note
> that the below patch has a minor cosmetic change from the test patch.
>
> I believe this fix should go to -stable for 2.6.26 and 2.6.27,
> but it'll need the change Jesper added to pick up a macro that was added
> to mainline:

Applied and I'll queue it up for -stable too, thanks!