2014-04-02 00:11:30

by Alexander Holler

[permalink] [raw]
Subject: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
changed the initialization of the mv643xx_eth driver to use phy_init_hw()
to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
was broken such, that it used mdelay() instead of really waiting for a
reset to finish.

The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
come up anymore (no carrier).

Fix this by waiting for a reset to finish before proceeding further.

Signed-off-by: Alexander Holler <[email protected]>

Cc: Michal Simek <[email protected]>
Cc: Florian Fainelli <[email protected]>
Cc: <[email protected]>
---
drivers/net/phy/marvell.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index bd37e45..5b84808 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -396,7 +396,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
if (err < 0)
return err;

- mdelay(500);
+ do
+ temp = phy_read(phydev, MII_BMCR);
+ while (temp & BMCR_RESET);

err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0);
if (err < 0)
@@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
if (err < 0)
return err;

- mdelay(500);
+ do
+ temp = phy_read(phydev, MII_BMCR);
+ while (temp & BMCR_RESET);

return 0;
}
--
1.8.3.1


2014-04-02 00:01:17

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)

Please specify the commit message in parenthesis.

> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
> was broken such, that it used mdelay() instead of really waiting for a
> reset to finish.
>
> The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
> come up anymore (no carrier).
>
> Fix this by waiting for a reset to finish before proceeding further.
>
> Signed-off-by: Alexander Holler <[email protected]>
>
> Cc: Michal Simek <[email protected]>
> Cc: Florian Fainelli <[email protected]>
> Cc: <[email protected]>
> ---
> drivers/net/phy/marvell.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index bd37e45..5b84808 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -396,7 +396,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
> if (err < 0)
> return err;
>
> - mdelay(500);
> + do
> + temp = phy_read(phydev, MII_BMCR);
> + while (temp & BMCR_RESET);

Please use genphy_soft_reset() which ensures that the bit is cleared,
and will also wait for an appropriate amount of time before returning.
I had a bunch of local patches waiting for net-next which clean up the
Marvell PHY driver to use genphy_soft_reset() consistently instead of
open-coding BMCR_RESET without waiting for the bit to get cleared.

>
> err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0);
> if (err < 0)
> @@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
> if (err < 0)
> return err;
>
> - mdelay(500);
> + do
> + temp = phy_read(phydev, MII_BMCR);
> + while (temp & BMCR_RESET);

Ditto.

>
> return 0;
> }
> --
> 1.8.3.1
>



--
Florian

2014-04-02 00:58:31

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
> was broken such, that it used mdelay() instead of really waiting for a
> reset to finish.

So the only big difference before
7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
phy_init_hw to reset PHY") is that we waited potentially forever for
the BMCR_RESET bit to get cleared, while now, we only wait for up to
500 milliseconds.

Could you verify the following two things before your patch gets merged:

- how long does it take for your PHY to clear the BMCR_RESET bit, is
it more than the allowed time out in
drivers/net/phy/phy_device.c::phy_poll_reset
- is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
we might be hitting some corner case where toggling BMCR_RESET will
power it on, but at the expense of waiting longer

Thanks

>
> The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
> come up anymore (no carrier).
>
> Fix this by waiting for a reset to finish before proceeding further.
>
> Signed-off-by: Alexander Holler <[email protected]>
>
> Cc: Michal Simek <[email protected]>
> Cc: Florian Fainelli <[email protected]>
> Cc: <[email protected]>
> ---
> drivers/net/phy/marvell.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index bd37e45..5b84808 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -396,7 +396,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
> if (err < 0)
> return err;
>
> - mdelay(500);
> + do
> + temp = phy_read(phydev, MII_BMCR);
> + while (temp & BMCR_RESET);
>
> err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0);
> if (err < 0)
> @@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
> if (err < 0)
> return err;
>
> - mdelay(500);
> + do
> + temp = phy_read(phydev, MII_BMCR);
> + while (temp & BMCR_RESET);
>
> return 0;
> }
> --
> 1.8.3.1
>



--
Florian

2014-04-02 09:09:50

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 02.04.2014 02:57, schrieb Florian Fainelli:
> 2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>> was broken such, that it used mdelay() instead of really waiting for a
>> reset to finish.
>
> So the only big difference before
> 7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
> phy_init_hw to reset PHY") is that we waited potentially forever for
> the BMCR_RESET bit to get cleared, while now, we only wait for up to
> 500 milliseconds.
>
> Could you verify the following two things before your patch gets merged:
>
> - how long does it take for your PHY to clear the BMCR_RESET bit, is
> it more than the allowed time out in
> drivers/net/phy/phy_device.c::phy_poll_reset
> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
> we might be hitting some corner case where toggling BMCR_RESET will
> power it on, but at the expense of waiting longer

I've done two tests with pr_info before and after the two resets in
m88e1116r_config_init:

with my patch:
-----------------------
dmesg | grep -A1 -B1 AHO
[ 1.090099] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
[ 1.175916] AHO: before first reset
[ 1.179613] AHO: after first reset
[ 1.183743] AHO: before second reset
[ 1.187530] AHO: after second reset
[ 1.191905] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
address xx
--
[ 1.426487] netpoll: netconsole: device eth0 not up yet, forcing it
[ 1.505986] AHO: before first reset
[ 1.509725] AHO: after first reset
[ 1.513899] AHO: before second reset
[ 1.517754] AHO: after second reset
[ 1.521802] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
--
[ 21.372591] Adding 2996116k swap on /dev/sda3. Priority:-1 extents:1
across:2996116k
[ 28.305989] AHO: before first reset
[ 28.306200] AHO: after first reset
[ 28.306936] AHO: before second reset
[ 28.307153] AHO: after second reset
[ 31.509973] mv643xx_eth_port mv643xx_eth_port.0 eth0: link up, 1000
Mb/s, full duplex, flow control disabled
-----------------------


with mdelay (the value after reset is what contains MII_BMCR):

-----------------------
dmesg | grep -A1 -B1 AHO
[ 1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
[ 1.175888] AHO: before first reset
[ 1.678806] AHO: after first reset 0x0
[ 1.683281] AHO: before second reset
[ 2.186288] AHO: after second reset 0x0
[ 2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
address xx
--
[ 2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
[ 2.505917] AHO: before first reset
[ 2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
--
[ 2.829987] hub 1-1:1.0: 4 ports detected
[ 3.044502] AHO: after first reset 0x0
[ 3.049133] AHO: before second reset
[ 3.126110] usb 1-1.1: new high-speed USB device number 3 using
orion-ehci
--
[ 3.526107] usb 1-1.3: device descriptor read/64, error -32
[ 3.614264] AHO: after second reset 0x0
[ 3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
--
[ 21.335730] Adding 2996116k swap on /dev/sda3. Priority:-1 extents:1
across:2996116k
[ 28.195942] AHO: before first reset
[ 28.696270] AHO: after first reset 0x800
[ 28.696958] AHO: before second reset
[ 29.197354] AHO: after second reset 0x800
[ 111.612263] RPC: Registered named UNIX socket transport module.
-----------------------

So we see, the first reset in the last call of m88e1116r_config_init()
fails to complete in 500ms and the phy seems to be stuck afterwards,
but, for whatever reason, it doesn't need 500ms if mdelay isn't used (if
we can trust the timestamps).

(You can also see, I have netconsole enabled using netconsole=... in the
kernel cmdline).

That behaviour is reproducible. The first reset in the last call to
m88e1116r_config_init() always fails if mdelay is used.

Regards,

Alexander Holler

2014-04-02 10:54:45

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 02.04.2014 11:09, schrieb Alexander Holler:
> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>> changed the initialization of the mv643xx_eth driver to use
>>> phy_init_hw()
>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>> was broken such, that it used mdelay() instead of really waiting for a
>>> reset to finish.
>>
>> So the only big difference before
>> 7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
>> phy_init_hw to reset PHY") is that we waited potentially forever for
>> the BMCR_RESET bit to get cleared, while now, we only wait for up to
>> 500 milliseconds.
>>
>> Could you verify the following two things before your patch gets merged:
>>
>> - how long does it take for your PHY to clear the BMCR_RESET bit, is
>> it more than the allowed time out in
>> drivers/net/phy/phy_device.c::phy_poll_reset
>> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
>> we might be hitting some corner case where toggling BMCR_RESET will
>> power it on, but at the expense of waiting longer
>
> I've done two tests with pr_info before and after the two resets in
> m88e1116r_config_init:
>
> with my patch:
> -----------------------
> dmesg | grep -A1 -B1 AHO
> [ 1.090099] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version
> 1.4
> [ 1.175916] AHO: before first reset
> [ 1.179613] AHO: after first reset
> [ 1.183743] AHO: before second reset
> [ 1.187530] AHO: after second reset
> [ 1.191905] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [ 1.426487] netpoll: netconsole: device eth0 not up yet, forcing it
> [ 1.505986] AHO: before first reset
> [ 1.509725] AHO: after first reset
> [ 1.513899] AHO: before second reset
> [ 1.517754] AHO: after second reset
> [ 1.521802] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [ 21.372591] Adding 2996116k swap on /dev/sda3. Priority:-1 extents:1
> across:2996116k
> [ 28.305989] AHO: before first reset
> [ 28.306200] AHO: after first reset
> [ 28.306936] AHO: before second reset
> [ 28.307153] AHO: after second reset
> [ 31.509973] mv643xx_eth_port mv643xx_eth_port.0 eth0: link up, 1000
> Mb/s, full duplex, flow control disabled
> -----------------------
>
>
> with mdelay (the value after reset is what contains MII_BMCR):
>
> -----------------------
> dmesg | grep -A1 -B1 AHO
> [ 1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version
> 1.4
> [ 1.175888] AHO: before first reset
> [ 1.678806] AHO: after first reset 0x0
> [ 1.683281] AHO: before second reset
> [ 2.186288] AHO: after second reset 0x0
> [ 2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [ 2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
> [ 2.505917] AHO: before first reset
> [ 2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
> --
> [ 2.829987] hub 1-1:1.0: 4 ports detected
> [ 3.044502] AHO: after first reset 0x0
> [ 3.049133] AHO: before second reset
> [ 3.126110] usb 1-1.1: new high-speed USB device number 3 using
> orion-ehci
> --
> [ 3.526107] usb 1-1.3: device descriptor read/64, error -32
> [ 3.614264] AHO: after second reset 0x0
> [ 3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [ 21.335730] Adding 2996116k swap on /dev/sda3. Priority:-1 extents:1
> across:2996116k
> [ 28.195942] AHO: before first reset
> [ 28.696270] AHO: after first reset 0x800
> [ 28.696958] AHO: before second reset
> [ 29.197354] AHO: after second reset 0x800
> [ 111.612263] RPC: Registered named UNIX socket transport module.
> -----------------------
>
> So we see, the first reset in the last call of m88e1116r_config_init()
> fails to complete in 500ms and the phy seems to be stuck afterwards,
> but, for whatever reason, it doesn't need 500ms if mdelay isn't used (if
> we can trust the timestamps).
>
> (You can also see, I have netconsole enabled using netconsole=... in the
> kernel cmdline).
>
> That behaviour is reproducible. The first reset in the last call to
> m88e1116r_config_init() always fails if mdelay is used.

I've forgotten to add that MII_BMCR is always zero when
m88e1116r_config_init() is called.

So to conclude, I've no idea why using mdelay seems to break the PHY. It
looks like the forced delay of together one second in
m88e1116r_config_init() somehow breaks something, but that's pure
speculation. And as my patch fixes things here, I've no reason to dig
further.

Regards,

Alexander Holler

2014-04-02 11:51:47

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Hello.

On 02-04-2014 3:55, Alexander Holler wrote:

> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
> was broken such, that it used mdelay() instead of really waiting for a
> reset to finish.

> The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
> come up anymore (no carrier).

> Fix this by waiting for a reset to finish before proceeding further.

> Signed-off-by: Alexander Holler <[email protected]>

> Cc: Michal Simek <[email protected]>
> Cc: Florian Fainelli <[email protected]>
> Cc: <[email protected]>
> ---
> drivers/net/phy/marvell.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index bd37e45..5b84808 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -396,7 +396,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
> if (err < 0)
> return err;
>
> - mdelay(500);
> + do
> + temp = phy_read(phydev, MII_BMCR);
> + while (temp & BMCR_RESET);

I don't understand what's the point of this BMCR reset if phy_init_hw()
will have already done it before calling phy_init_hw().

> @@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
> if (err < 0)
> return err;
>
> - mdelay(500);
> + do
> + temp = phy_read(phydev, MII_BMCR);
> + while (temp & BMCR_RESET);

Not clear why it's necessary to reset one more time... Also, tight loop
without a timeout (0.5 sec, specified by IEEE 802.3) doesn't look good. The
comment above phy_poll_reset() suggests that it could be used in the PHY
drivers as well. Florian?

WBR, Sergei

2014-04-02 12:07:13

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

On 02-04-2014 15:51, Sergei Shtylyov wrote:

>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>> was broken such, that it used mdelay() instead of really waiting for a
>> reset to finish.

>> The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
>> come up anymore (no carrier).

>> Fix this by waiting for a reset to finish before proceeding further.

>> Signed-off-by: Alexander Holler <[email protected]>

>> Cc: Michal Simek <[email protected]>
>> Cc: Florian Fainelli <[email protected]>
>> Cc: <[email protected]>
>> ---
>> drivers/net/phy/marvell.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)

>> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
>> index bd37e45..5b84808 100644
>> --- a/drivers/net/phy/marvell.c
>> +++ b/drivers/net/phy/marvell.c
[...]
>> @@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
>> if (err < 0)
>> return err;
>>
>> - mdelay(500);
>> + do
>> + temp = phy_read(phydev, MII_BMCR);
>> + while (temp & BMCR_RESET);

> Not clear why it's necessary to reset one more time... Also, tight loop
> without a timeout (0.5 sec, specified by IEEE 802.3) doesn't look good. The
> comment above phy_poll_reset() suggests that it could be used in the PHY
> drivers as well. Florian?

Ah, I was looking at Linus' tree, not net-next.git, and hadn't read
Florian's replies before commenting on this...

WBR, Sergei

2014-04-02 14:35:45

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 02.04.2014 14:07, schrieb Sergei Shtylyov:
> On 02-04-2014 15:51, Sergei Shtylyov wrote:
>
>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>> changed the initialization of the mv643xx_eth driver to use
>>> phy_init_hw()
>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>> was broken such, that it used mdelay() instead of really waiting for a
>>> reset to finish.

(...)

>>>
>>> - mdelay(500);
>>> + do
>>> + temp = phy_read(phydev, MII_BMCR);
>>> + while (temp & BMCR_RESET);
>
>> Not clear why it's necessary to reset one more time... Also, tight
>> loop
>> without a timeout (0.5 sec, specified by IEEE 802.3) doesn't look
>> good. The
>> comment above phy_poll_reset() suggests that it could be used in the PHY
>> drivers as well. Florian?
>
> Ah, I was looking at Linus' tree, not net-next.git, and hadn't read
> Florian's replies before commenting on this...

Just to be clear, this patch isn't a change request, it's a bugfix
because the ethernet doesn't come up on my Kirkwood device(s) using
kernel 3.14. So it's mainly meant for other people which want to use
3.14 with similiar devices and have problems too.

Reverting the above commit does do the trick here too.

I haven't looked at the datasheet nor any possible erratas. So I don't
know why and if there are multiple resets necessary. I've just made the
code more similiar to what was used before the above commit changed the
reset of the PHY. The tight loop is a copy from other m*_config_inits in
the same file and was in use in mv643xx_eth before the above commit too.
So at least the tight loop is proven to work. And genphy_soft_reset(),
which does not wait indefinitely long if the phy never does come out of
reset, is only available in some -next tree, definitely not where I look
at and need it for a bugfix.

But I agree, all the stuff looks very curious, tight loops without a
timeout, multiple resets in order and such things more.

But this patch isn't a cleanup, it's meant to be as small as possible to
make things work again. And I decided that instead of just suggesting a
revert, it's better to fix m88e1116r_config_init() which might be used
at other places too.

It might make sense to wait for some other users to complain, I think it
won't need long as this SOC is used in Sheevaplugs and Dockstars, some
ARMv5 devices which were quiet popular. Or if nobody else complains,
maybe someone might test my change. Ut could be possible that the
failing sequence is through the/my use of netconsole, which is, I
assume, not very common.

Regards,

Alexander Holler

2014-04-02 19:02:09

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

2014-04-02 2:09 GMT-07:00 Alexander Holler <[email protected]>:
> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>
>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
>>>
>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>> was broken such, that it used mdelay() instead of really waiting for a
>>> reset to finish.
>>
>>
>> So the only big difference before
>> 7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
>> phy_init_hw to reset PHY") is that we waited potentially forever for
>> the BMCR_RESET bit to get cleared, while now, we only wait for up to
>> 500 milliseconds.
>>
>> Could you verify the following two things before your patch gets merged:
>>
>> - how long does it take for your PHY to clear the BMCR_RESET bit, is
>> it more than the allowed time out in
>> drivers/net/phy/phy_device.c::phy_poll_reset
>> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
>> we might be hitting some corner case where toggling BMCR_RESET will
>> power it on, but at the expense of waiting longer
>
>
> I've done two tests with pr_info before and after the two resets in
> m88e1116r_config_init:
>
> with my patch:
> -----------------------
> dmesg | grep -A1 -B1 AHO
> [ 1.090099] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
> [ 1.175916] AHO: before first reset
> [ 1.179613] AHO: after first reset

That makes about 3.697 milliseconds

> [ 1.183743] AHO: before second reset
> [ 1.187530] AHO: after second reset

That makes about 3.787 milliseconds

> [ 1.191905] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [ 1.426487] netpoll: netconsole: device eth0 not up yet, forcing it
> [ 1.505986] AHO: before first reset
> [ 1.509725] AHO: after first reset
> [ 1.513899] AHO: before second reset
> [ 1.517754] AHO: after second reset
> [ 1.521802] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [ 21.372591] Adding 2996116k swap on /dev/sda3. Priority:-1 extents:1
> across:2996116k
> [ 28.305989] AHO: before first reset
> [ 28.306200] AHO: after first reset
> [ 28.306936] AHO: before second reset
> [ 28.307153] AHO: after second reset
> [ 31.509973] mv643xx_eth_port mv643xx_eth_port.0 eth0: link up, 1000 Mb/s,
> full duplex, flow control disabled
> -----------------------
>
>
> with mdelay (the value after reset is what contains MII_BMCR):
>
> -----------------------
> dmesg | grep -A1 -B1 AHO
> [ 1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
> [ 1.175888] AHO: before first reset
> [ 1.678806] AHO: after first reset 0x0

That's 502.918 milliseconds

> [ 1.683281] AHO: before second reset
> [ 2.186288] AHO: after second reset 0x0

That's 503.007 milliseconds

> [ 2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [ 2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
> [ 2.505917] AHO: before first reset
> [ 2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
> --
> [ 2.829987] hub 1-1:1.0: 4 ports detected
> [ 3.044502] AHO: after first reset 0x0
> [ 3.049133] AHO: before second reset
> [ 3.126110] usb 1-1.1: new high-speed USB device number 3 using
> orion-ehci
> --
> [ 3.526107] usb 1-1.3: device descriptor read/64, error -32
> [ 3.614264] AHO: after second reset 0x0
> [ 3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [ 21.335730] Adding 2996116k swap on /dev/sda3. Priority:-1 extents:1
> across:2996116k
> [ 28.195942] AHO: before first reset
> [ 28.696270] AHO: after first reset 0x800
> [ 28.696958] AHO: before second reset
> [ 29.197354] AHO: after second reset 0x800
> [ 111.612263] RPC: Registered named UNIX socket transport module.
> -----------------------
>
> So we see, the first reset in the last call of m88e1116r_config_init() fails
> to complete in 500ms and the phy seems to be stuck afterwards, but, for
> whatever reason, it doesn't need 500ms if mdelay isn't used (if we can trust
> the timestamps).

I wonder if the extra 2/3 microseconds we are seeing are nothing more
than the buffered printk. At any rate, it looks like the software
reset of this PHY needs to polled, and frequently for it to complete
successfully.

Can you resubmit with the following information:

- you specify the commit that introduce the problem in parenthesis,
e.g: deadbeef ("dead: beef: cafe babe")
- put this dmesg snippet in your log message to clearly illustrate
what is happening
- clarify that the PHY needs to be polled in a comment in
m88e1116r_config_init()

Meanwhile, it would be good if someone with access to this particular
PHY datasheet could look for known erratas, problems, non-standard
compliant behavior ....

>
> (You can also see, I have netconsole enabled using netconsole=... in the
> kernel cmdline).
>
> That behaviour is reproducible. The first reset in the last call to
> m88e1116r_config_init() always fails if mdelay is used.
>
> Regards,
>
> Alexander Holler
--
Florian

2014-04-02 20:25:08

by Sebastian Hesselbarth

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

On 04/02/2014 09:01 PM, Florian Fainelli wrote:
> 2014-04-02 2:09 GMT-07:00 Alexander Holler <[email protected]>:
>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>> reset to finish.
>
> Can you resubmit with the following information:
>
> - you specify the commit that introduce the problem in parenthesis,
> e.g: deadbeef ("dead: beef: cafe babe")
> - put this dmesg snippet in your log message to clearly illustrate
> what is happening
> - clarify that the PHY needs to be polled in a comment in
> m88e1116r_config_init()
>
> Meanwhile, it would be good if someone with access to this particular
> PHY datasheet could look for known erratas, problems, non-standard
> compliant behavior ....

Alexander,

I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
(0x01410e40) and cannot reproduce any regression. I tried booting from
tftp and usb, I also rebooted twice to see if there are any side
effects - nothing - ethernet always comes up as expected.

I am curious, how you determined above commit to be the cause of the
regression you are seeing. Can you bisect, if you didn't already?

Sebastian

2014-04-02 22:12:43

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 02.04.2014 22:25, schrieb Sebastian Hesselbarth:
> On 04/02/2014 09:01 PM, Florian Fainelli wrote:
>> 2014-04-02 2:09 GMT-07:00 Alexander Holler <[email protected]>:
>>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
>>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>>> reset to finish.
>>
>> Can you resubmit with the following information:
>>
>> - you specify the commit that introduce the problem in parenthesis,
>> e.g: deadbeef ("dead: beef: cafe babe")
>> - put this dmesg snippet in your log message to clearly illustrate
>> what is happening
>> - clarify that the PHY needs to be polled in a comment in
>> m88e1116r_config_init()
>>
>> Meanwhile, it would be good if someone with access to this particular
>> PHY datasheet could look for known erratas, problems, non-standard
>> compliant behavior ....
>
> Alexander,
>
> I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
> (0x01410e40) and cannot reproduce any regression. I tried booting from
> tftp and usb, I also rebooted twice to see if there are any side
> effects - nothing - ethernet always comes up as expected.
>
> I am curious, how you determined above commit to be the cause of the
> regression you are seeing. Can you bisect, if you didn't already?

There was no bisecting necessary. I've just looked at what changed in
mv643xx_eth since 3.13 and the first commit I've reverted was already a
hit. Reading a bit source revealed the differences between the old reset
and the newly used one and ended up with my patch (first try) and was a
hit too.

Actually I assumed the reset needs longer than the 500ms, but as the
printks revealed, the reset is much faster.
So the problem seems to be the much increased time (1s) the newly used
reset function idles in mdelay.

But I think I have found the real reason. and the change of the reset
just increased the chance the problem is hit (here with 100% success or
fail rate however you want to name it).

Just give me a day or two to find the time to verify my assumption (I
don't want to speculate) and maybe find a real fix for the problem. Of
course, I still like my patch because it greatly decreases the time
necessary for a reset (and the chance to hit the problem).

Regards,

Alexander Holler

2014-04-02 22:20:45

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

2014-04-02 15:12 GMT-07:00 Alexander Holler <[email protected]>:
> Am 02.04.2014 22:25, schrieb Sebastian Hesselbarth:
>> On 04/02/2014 09:01 PM, Florian Fainelli wrote:
>>> 2014-04-02 2:09 GMT-07:00 Alexander Holler <[email protected]>:
>>>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
>>>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>>>> reset to finish.
>>>
>>> Can you resubmit with the following information:
>>>
>>> - you specify the commit that introduce the problem in parenthesis,
>>> e.g: deadbeef ("dead: beef: cafe babe")
>>> - put this dmesg snippet in your log message to clearly illustrate
>>> what is happening
>>> - clarify that the PHY needs to be polled in a comment in
>>> m88e1116r_config_init()
>>>
>>> Meanwhile, it would be good if someone with access to this particular
>>> PHY datasheet could look for known erratas, problems, non-standard
>>> compliant behavior ....
>>
>> Alexander,
>>
>> I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
>> (0x01410e40) and cannot reproduce any regression. I tried booting from
>> tftp and usb, I also rebooted twice to see if there are any side
>> effects - nothing - ethernet always comes up as expected.
>>
>> I am curious, how you determined above commit to be the cause of the
>> regression you are seeing. Can you bisect, if you didn't already?
>
> There was no bisecting necessary. I've just looked at what changed in
> mv643xx_eth since 3.13 and the first commit I've reverted was already a
> hit. Reading a bit source revealed the differences between the old reset
> and the newly used one and ended up with my patch (first try) and was a
> hit too.
>
> Actually I assumed the reset needs longer than the 500ms, but as the
> printks revealed, the reset is much faster.
> So the problem seems to be the much increased time (1s) the newly used
> reset function idles in mdelay.
>
> But I think I have found the real reason. and the change of the reset
> just increased the chance the problem is hit (here with 100% success or
> fail rate however you want to name it).
>
> Just give me a day or two to find the time to verify my assumption (I
> don't want to speculate) and maybe find a real fix for the problem. Of
> course, I still like my patch because it greatly decreases the time
> necessary for a reset (and the chance to hit the problem).

Why so mysterious, care to share your assumption?
--
Florian

2014-04-02 22:27:29

by Sebastian Hesselbarth

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

On 04/03/2014 12:12 AM, Alexander Holler wrote:
> Am 02.04.2014 22:25, schrieb Sebastian Hesselbarth:
>> On 04/02/2014 09:01 PM, Florian Fainelli wrote:
>>> 2014-04-02 2:09 GMT-07:00 Alexander Holler <[email protected]>:
>>>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <[email protected]>:
>>>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>>>> reset to finish.
>>>
>>> Can you resubmit with the following information:
>>>
>>> - you specify the commit that introduce the problem in parenthesis,
>>> e.g: deadbeef ("dead: beef: cafe babe")
>>> - put this dmesg snippet in your log message to clearly illustrate
>>> what is happening
>>> - clarify that the PHY needs to be polled in a comment in
>>> m88e1116r_config_init()
>>>
>>> Meanwhile, it would be good if someone with access to this particular
>>> PHY datasheet could look for known erratas, problems, non-standard
>>> compliant behavior ....
>>
>> Alexander,
>>
>> I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
>> (0x01410e40) and cannot reproduce any regression. I tried booting from
>> tftp and usb, I also rebooted twice to see if there are any side
>> effects - nothing - ethernet always comes up as expected.
>>
>> I am curious, how you determined above commit to be the cause of the
>> regression you are seeing. Can you bisect, if you didn't already?
>
> There was no bisecting necessary. I've just looked at what changed in
> mv643xx_eth since 3.13 and the first commit I've reverted was already a
> hit. Reading a bit source revealed the differences between the old reset
> and the newly used one and ended up with my patch (first try) and was a
> hit too.

Honestly, your own fix changes a different driver than mv643xx_eth.
There is a lot of changes from v3.13 to v3.14 and bisecting really
helps to pin-point the one offending patch. As you can see from my
tests with Dockstar, poking in the PHY driver may not be the right
place to fix it.

> Actually I assumed the reset needs longer than the 500ms, but as the
> printks revealed, the reset is much faster.
> So the problem seems to be the much increased time (1s) the newly used
> reset function idles in mdelay.

You assume that the PHY issue comes from waiting for too long _after_
the reset? And again, the very same PHY on Dockstar is not affected.

> But I think I have found the real reason. and the change of the reset
> just increased the chance the problem is hit (here with 100% success or
> fail rate however you want to name it).
>
> Just give me a day or two to find the time to verify my assumption (I
> don't want to speculate) and maybe find a real fix for the problem. Of
> course, I still like my patch because it greatly decreases the time
> necessary for a reset (and the chance to hit the problem).

Well, you can share your idea anytime. You already speculated that PHY
reset on 88e1116r is broken but it seems that is not true. The more
you share of your issue and the tries to fix it, the more likely is it
we can follow your patch immediately.

Again, if you really want to find the real patch breaking Sheevaplug,
use git bisect.

Sebastian

2014-04-02 22:30:23

by Sebastian Hesselbarth

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

On 04/02/2014 11:09 AM, Alexander Holler wrote:
> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>> Could you verify the following two things before your patch gets merged:
>>
>> - how long does it take for your PHY to clear the BMCR_RESET bit, is
>> it more than the allowed time out in
>> drivers/net/phy/phy_device.c::phy_poll_reset
>> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
>> we might be hitting some corner case where toggling BMCR_RESET will
>> power it on, but at the expense of waiting longer
>
> I've done two tests with pr_info before and after the two resets in
> m88e1116r_config_init:
>

[...]

> with mdelay (the value after reset is what contains MII_BMCR):
>
> -----------------------
> dmesg | grep -A1 -B1 AHO
> [ 1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version
> 1.4
> [ 1.175888] AHO: before first reset
> [ 1.678806] AHO: after first reset 0x0
> [ 1.683281] AHO: before second reset
> [ 2.186288] AHO: after second reset 0x0
> [ 2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [ 2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
> [ 2.505917] AHO: before first reset
> [ 2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
> --
> [ 2.829987] hub 1-1:1.0: 4 ports detected
> [ 3.044502] AHO: after first reset 0x0

If the above hex at the end is BMCR..

> [ 3.049133] AHO: before second reset
> [ 3.126110] usb 1-1.1: new high-speed USB device number 3 using
> orion-ehci
> --
> [ 3.526107] usb 1-1.3: device descriptor read/64, error -32
> [ 3.614264] AHO: after second reset 0x0
> [ 3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [ 21.335730] Adding 2996116k swap on /dev/sda3. Priority:-1 extents:1
> across:2996116k
> [ 28.195942] AHO: before first reset
> [ 28.696270] AHO: after first reset 0x800

.. have you noticed that your PHY enters POWERDOWN here?

Sebastian

> [ 28.696958] AHO: before second reset
> [ 29.197354] AHO: after second reset 0x800
> [ 111.612263] RPC: Registered named UNIX socket transport module.
> -----------------------

2014-04-03 07:17:42

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
> On 04/03/2014 12:12 AM, Alexander Holler wrote:

>>> I am curious, how you determined above commit to be the cause of the
>>> regression you are seeing. Can you bisect, if you didn't already?
>>
>> There was no bisecting necessary. I've just looked at what changed in
>> mv643xx_eth since 3.13 and the first commit I've reverted was already a
>> hit. Reading a bit source revealed the differences between the old reset
>> and the newly used one and ended up with my patch (first try) and was a
>> hit too.
>
> Honestly, your own fix changes a different driver than mv643xx_eth.

It changes stuff which now (through the mentioned commit) gets used
through the change in mv643xx_eth.

> There is a lot of changes from v3.13 to v3.14 and bisecting really
> helps to pin-point the one offending patch. As you can see from my
> tests with Dockstar, poking in the PHY driver may not be the right
> place to fix it.
>
>> Actually I assumed the reset needs longer than the 500ms, but as the
>> printks revealed, the reset is much faster.
>> So the problem seems to be the much increased time (1s) the newly used
>> reset function idles in mdelay.
>
> You assume that the PHY issue comes from waiting for too long _after_
> the reset? And again, the very same PHY on Dockstar is not affected.

Guess with which hardware I'm experiencing this problem? Hint:
http://ahsoftware.de/dockstar/ ;)

>
>> But I think I have found the real reason. and the change of the reset
>> just increased the chance the problem is hit (here with 100% success or
>> fail rate however you want to name it).
>>
>> Just give me a day or two to find the time to verify my assumption (I
>> don't want to speculate) and maybe find a real fix for the problem. Of
>> course, I still like my patch because it greatly decreases the time
>> necessary for a reset (and the chance to hit the problem).
>
> Well, you can share your idea anytime. You already speculated that PHY
> reset on 88e1116r is broken but it seems that is not true. The more
> you share of your issue and the tries to fix it, the more likely is it
> we can follow your patch immediately.

Sorry, but wild speculating doesn't help always. Otherwise I could
mention several dozen possible reasons, starting from broken memory or
other hw up to some memory corruption elsewhere in the kernel.

But I already have given a hint before, try what happens if you enable
netconsole (compiled in) through the kernel commandline
(netconsole=...). Maybe the ethernet on your dockstar will get stuck too.

>
> Again, if you really want to find the real patch breaking Sheevaplug,
> use git bisect.

That's silly if I already know a/the change which brings the problem to
light. If I revert the mentioned commit the problem disapears. So why
should I go through the pain to bisect stuff? I already have found the
knob to kill the ethernet on that machine.

Regards,

Alexander Holler

2014-04-03 08:49:16

by Sebastian Hesselbarth

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

On 04/03/2014 09:17 AM, Alexander Holler wrote:
> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>> I am curious, how you determined above commit to be the cause of the
>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>
>>> There was no bisecting necessary. I've just looked at what changed in
>>> mv643xx_eth since 3.13 and the first commit I've reverted was already a
>>> hit. Reading a bit source revealed the differences between the old reset
>>> and the newly used one and ended up with my patch (first try) and was a
>>> hit too.
>>
>> Honestly, your own fix changes a different driver than mv643xx_eth.
>
> It changes stuff which now (through the mentioned commit) gets used
> through the change in mv643xx_eth.

Sigh. You have proven youself that the commit isn't the root cause
of the issue you are seeing. Nor is "fixing" 88e1116r init sequence
a reasonable fix.

>> There is a lot of changes from v3.13 to v3.14 and bisecting really
>> helps to pin-point the one offending patch. As you can see from my
>> tests with Dockstar, poking in the PHY driver may not be the right
>> place to fix it.
>>
>>> Actually I assumed the reset needs longer than the 500ms, but as the
>>> printks revealed, the reset is much faster.
>>> So the problem seems to be the much increased time (1s) the newly used
>>> reset function idles in mdelay.
>>
>> You assume that the PHY issue comes from waiting for too long _after_
>> the reset? And again, the very same PHY on Dockstar is not affected.
>
> Guess with which hardware I'm experiencing this problem? Hint:
> http://ahsoftware.de/dockstar/ ;)

I don't know, but now I guess it is Dockstar.

>>> But I think I have found the real reason. and the change of the reset
>>> just increased the chance the problem is hit (here with 100% success or
>>> fail rate however you want to name it).
>>>
>>> Just give me a day or two to find the time to verify my assumption (I
>>> don't want to speculate) and maybe find a real fix for the problem. Of
>>> course, I still like my patch because it greatly decreases the time
>>> necessary for a reset (and the chance to hit the problem).
>>
>> Well, you can share your idea anytime. You already speculated that PHY
>> reset on 88e1116r is broken but it seems that is not true. The more
>> you share of your issue and the tries to fix it, the more likely is it
>> we can follow your patch immediately.
>
> Sorry, but wild speculating doesn't help always. Otherwise I could
> mention several dozen possible reasons, starting from broken memory or
> other hw up to some memory corruption elsewhere in the kernel.
>
> But I already have given a hint before, try what happens if you enable
> netconsole (compiled in) through the kernel commandline
> (netconsole=...). Maybe the ethernet on your dockstar will get stuck too.

If it is related to netconsole, I would guess it is more platforms
affected than just Dockstar? If you share the idea, we can try to
find a way to allow netconsole on more than just that
mv643xx_eth/88e116r combination.

>> Again, if you really want to find the real patch breaking Sheevaplug,
>> use git bisect.
>
> That's silly if I already know a/the change which brings the problem to
> light. If I revert the mentioned commit the problem disapears. So why
> should I go through the pain to bisect stuff? I already have found the
> knob to kill the ethernet on that machine.

Really, I can tell you two fixes for it right away: don't use netconsole
or remove Marvell PHY support. But neither is really helping here.

If you share your ideas early, it is at least two more who are looking
at it. This is just a suggestion, you are free to take it though.

Sebastian

2014-04-03 15:07:05

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 03.04.2014 10:49, schrieb Sebastian Hesselbarth:
> On 04/03/2014 09:17 AM, Alexander Holler wrote:
>> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>>> I am curious, how you determined above commit to be the cause of the
>>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>>
>>>> There was no bisecting necessary. I've just looked at what changed in
>>>> mv643xx_eth since 3.13 and the first commit I've reverted was already a
>>>> hit. Reading a bit source revealed the differences between the old reset
>>>> and the newly used one and ended up with my patch (first try) and was a
>>>> hit too.
>>>
>>> Honestly, your own fix changes a different driver than mv643xx_eth.
>>
>> It changes stuff which now (through the mentioned commit) gets used
>> through the change in mv643xx_eth.
>
> Sigh. You have proven youself that the commit isn't the root cause
> of the issue you are seeing. Nor is "fixing" 88e1116r init sequence
> a reasonable fix.

Sorry, continuing this discussion doesn't make sense. You have the same
hw, so just try enabling netconsole with 3.14 and use dhcpcd from
userland (this does the final reset here which hangs.

But don't suggest me (or insist on) a time consuming and totally useless
bisect when I already know what makes the problem to appear (100%
reliable here).

I have better ways to waste my time.

Alexander Holler

2014-04-03 15:12:41

by David Miller

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

From: Alexander Holler <[email protected]>
Date: Thu, 03 Apr 2014 17:06:52 +0200

> But don't suggest me (or insist on) a time consuming

Bisects are not time consuming, and help developers analyze your
issue tremendously.

2014-04-03 15:45:28

by Sebastian Hesselbarth

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

On 04/03/2014 05:06 PM, Alexander Holler wrote:
> Am 03.04.2014 10:49, schrieb Sebastian Hesselbarth:
>> On 04/03/2014 09:17 AM, Alexander Holler wrote:
>>> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>>>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>>>> I am curious, how you determined above commit to be the cause of the
>>>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>>>
>>>>> There was no bisecting necessary. I've just looked at what changed in
>>>>> mv643xx_eth since 3.13 and the first commit I've reverted was
>>>>> already a
>>>>> hit. Reading a bit source revealed the differences between the old
>>>>> reset
>>>>> and the newly used one and ended up with my patch (first try) and
>>>>> was a
>>>>> hit too.
>>>>
>>>> Honestly, your own fix changes a different driver than mv643xx_eth.
>>>
>>> It changes stuff which now (through the mentioned commit) gets used
>>> through the change in mv643xx_eth.
>>
>> Sigh. You have proven youself that the commit isn't the root cause
>> of the issue you are seeing. Nor is "fixing" 88e1116r init sequence
>> a reasonable fix.
>
> Sorry, continuing this discussion doesn't make sense. You have the same
> hw, so just try enabling netconsole with 3.14 and use dhcpcd from
> userland (this does the final reset here which hangs.
>
> But don't suggest me (or insist on) a time consuming and totally useless
> bisect when I already know what makes the problem to appear (100%
> reliable here).

I _suggest_ to use bisect, I don't _insist_. But for a patch that
tackles an issue, knowing the offending patch is really good.

Now that you revealed more input, it becomes more clear to me what is
happening (although I have no clue, what netconsole really does to the
eth/phy):

- u-boot powers on the PHY
- netconsole picks up the eth device and the PHY, drops it later
- the PHY state machine powers off the now unused PHY (also introduced
in between v3.13 and v3.14).
- dhcp client ifup's the device and tries to reset the powered down PHY

I already made a comment about the set POWERDOWN bit before, which you
silently ignored.

I will try to reproduce it and if I hit it, will do a bisect to find
(and fix) the offending patch.

> I have better ways to waste my time.

Like writing workarounds instead of fixing bugs?

Sebastian

2014-04-03 15:45:53

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 03.04.2014 17:14, schrieb David Miller:
> From: Alexander Holler <[email protected]>
> Date: Thu, 03 Apr 2014 17:06:52 +0200
>
>> But don't suggest me (or insist on) a time consuming
>
> Bisects are not time consuming, and help developers analyze your
> issue tremendously.

Hmm, compiling and booting several dozen kernels isn't time consuming?
Then I must have done something wrong in the past.

And I wonder why I've writen descriptions at all, if nobody wants to
understand that I already know the patch which leads to a broken network
on that system.

If the patch is really the reason for the problem, is something totally
different, but bisecting won't help here. Besides that I know since 4
years that netconsole is broken on that box, it just never broke the
network or anything else. And so I ignored it, foreseeing the necessary
time to debug and especially describe and discuss the problem.

You see, I'm already feeling like just talking with myself. But just in
case: After reverting patch 7cd1463, there exist a online-change to turn
on/off the problem:

------
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c
b/drivers/net/ethernet/marvell/mv643xx_eth.c
index e891b48..246f065 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -2095,7 +2095,8 @@ static void port_start(struct mv643xx_eth_private *mp)
struct ethtool_cmd cmd;

mv643xx_eth_get_settings(mp->dev, &cmd);
- phy_reset(mp);
+ //phy_reset(mp);
+ phy_init_hw(mp->phy);
mv643xx_eth_set_settings(mp->dev, &cmd);
phy_start(mp->phy);
}
------

Thats basically just what the reverted patch does. Now, why should I
still bisect?

Of course, my first assumption that the changed reset is the bug is
wrong, it just makes the problem to appear. But that wrong assumption
still doesn't make a bisect necessary. And it's discouraging if people
still insist on that.

I will see if I spend the time to write down a more detailed description
about what happens here.

Regards,

Alexander Holler

2014-04-03 15:58:25

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 03.04.2014 17:45, schrieb Sebastian Hesselbarth:

>
> I will try to reproduce it and if I hit it, will do a bisect to find
> (and fix) the offending patch.
>
>> I have better ways to waste my time.
>
> Like writing workarounds instead of fixing bugs?

Sure. If I would try to fix (or even describe) every bug in the kernel I
already know about, I would never have time to do anything else.

Besides that you still try to ignore, that I didn't know at first that
it's a workaround at all. Especially after I've discovered that the
reset is the place where things are failing. Otherwise I wouldn't have
sent the patch (which I know NOW that it just is a workaround).

Alexander Holler

2014-04-03 17:44:14

by Sebastian Hesselbarth

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

On 04/03/2014 05:06 PM, Alexander Holler wrote:
> Am 03.04.2014 10:49, schrieb Sebastian Hesselbarth:
>> On 04/03/2014 09:17 AM, Alexander Holler wrote:
>>> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>>>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>>>> I am curious, how you determined above commit to be the cause of the
>>>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>>>
>>>>> There was no bisecting necessary. I've just looked at what changed in
>>>>> mv643xx_eth since 3.13 and the first commit I've reverted was
>>>>> already a
>>>>> hit. Reading a bit source revealed the differences between the old
>>>>> reset
>>>>> and the newly used one and ended up with my patch (first try) and
>>>>> was a
>>>>> hit too.
>>>>
>
> Sorry, continuing this discussion doesn't make sense. You have the same
> hw, so just try enabling netconsole with 3.14 and use dhcpcd from
> userland (this does the final reset here which hangs.
>
> But don't suggest me (or insist on) a time consuming and totally useless
> bisect when I already know what makes the problem to appear (100%
> reliable here).

Alexander,

I tried to reproduce the regression on Dockstar with 3.14.0 and
3.14.0-07247-gcd6362be, netconsole enabled, DHCP (dhclient).

I can neither reproduce any issues with ethernet before nor after
dhclient kicks in. I can also use netconsole just fine and see
plenty of kernel bootlog messages sent to netconsole receiver.

Can you _please_ share a full report how to properly reproduce it?

Sebastian

2014-04-03 17:58:21

by Alexander Holler

[permalink] [raw]
Subject: Bug(s) with netconsole (using mv643xx_eth on Kirkwood)

(I've changed the topic and removed stable@ from the cc-list to reflect
the current status)

(Long mail, but hopefully a good problem description)

I already knew about problems with netconsole and mv643xx_eth since
4 years, but didn't care a lot because everything else worked flawless,
I even had forgotten that I've enabled netconsole. (But the bugs I've
experienced 4 years ago, seeing no msgs remotely from netconsole seem to
have disappeared).

But now, using 3.14, I hit a bug which killed the ethernet with a 100%
success rate, and, after digging a bit, I've come to the conclusion
that netconsole (together with a maybe broken initialization of the PHY)
is the source of the problem.

The kernel is 3.14 (mainline) with one reverted patch (7cd1463). This
patch changed the initialization of the PHY such, that the ethernet dies
100% reproducible on a Kirkwood 88F6281 based machine. Reverting that
patch gives me a oneline bug-enabler:

------
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c
b/drivers/net/ethernet/marvell/mv643xx_eth.c
index e891b48..246f065 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -2095,7 +2095,8 @@ static void port_start(struct mv643xx_eth_private *mp)
struct ethtool_cmd cmd;

mv643xx_eth_get_settings(mp->dev, &cmd);
- phy_reset(mp);
+ //phy_reset(mp);
+ phy_init_hw(mp->phy);
mv643xx_eth_set_settings(mp->dev, &cmd);
phy_start(mp->phy);
}
------

First I describe what happens at boot:

- Bootloader (U-Boot) enables (somehow) the network such that is usable
as a console for the bootloader,
- Kernel is loaded and started with netconsole enabled through the
kernel command line (netconsole=...),
- eth driver probe => PHY reset
- netconsole initializes the network (netpoll_setup) => PHY reset,
- userland starts,
- userland configures network (ip addr add fixedIP ..., a hack used for
a very early ntpdate before the rootfs becomes rw), I'm not sure if
that's end up again in a PHY reset.
- userland starts network by using dhcpcd => PHY reset

Now several use cases:

Case 1:
Using plain 3.14 the last step fails with no carrier, because the PHY
ends up in a never ending reset (BMCR_RESET always set) in
m88e1111_config_init() called by phy_init_hw() in port_start() in
mv643xx_eth.

Case 2:
Without enabling netconsole through the kernel command line, I see no
problems.

Case 3:
If I enable the old phy_reset() in mv643xx_eth, I see no problems.

Case 4:
If I reduce the time the newly used reset in phy_init_hw() spends in
calling mdelay(500) twice to some milliseconds m88e1111_config_init by
polling for a cleared BMCR_RESET, I see no problems.

Case 5:
If I disable the initialization of the network in the bootloader,
netconsole even worked 4 years ago. But I haven't looked into that case
further, because I always want to use the network as a console for the
bootloader.


Current assumption:

So, after having spend too much time into diagnosing the above stuff (so
I was right in ignoring the non-working netconsole for 4 years), I've
comed to the conclusion that some synchronization between
netconsole/netpoll and the normal network stack or mv643xx_eth is
missing. That would explain why the PHY ends up in a never ending reset
and why this only happens reproducible if the PHY reset needs a whole
second by using mdelay(500) twice (which likely is used to switch
the task to netconsole inbetween). It might be a hw problem too (I
haven't read the datasheet or looked for any erratas).

I hope everyone who missed some more information is happy now, otherwise
I (again) wasted time to type a problem description (not to speak about
the already spent time trying to diagnose the problem)

So go on and try to take the almost low hanging fruit. I'm not sure if I
will spend more time on that topic as I already have a working
patch/workaround and the discussion has become a bit tiresome. Sorry.

Regards,

Alexander Holler

2014-04-03 18:21:01

by Alexander Holler

[permalink] [raw]
Subject: Re: [PATCH regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Am 03.04.2014 19:44, schrieb Sebastian Hesselbarth:

> I tried to reproduce the regression on Dockstar with 3.14.0 and
> 3.14.0-07247-gcd6362be, netconsole enabled, DHCP (dhclient).

Try using dhcpcd, that's why I mentioned it. DHCP-clients do work
differently.

> I can neither reproduce any issues with ethernet before nor after
> dhclient kicks in. I can also use netconsole just fine and see
> plenty of kernel bootlog messages sent to netconsole receiver.
>
> Can you _please_ share a full report how to properly reproduce it?

Impossible because timing and other stuff (boot media, switch,
type/version of iproute, dhcp-client, ...) might be involved.

Alexander Holler

BTW: if you want to fix another bug, changing the mtu does reset the
ethernet device. Something special for mv643xx_eth and not very nice. ;)

2014-04-03 18:21:54

by Sebastian Hesselbarth

[permalink] [raw]
Subject: Re: Bug(s) with netconsole (using mv643xx_eth on Kirkwood)

On 04/03/2014 07:58 PM, Alexander Holler wrote:
[...]
> I already knew about problems with netconsole and mv643xx_eth since
> 4 years, but didn't care a lot because everything else worked flawless,
> I even had forgotten that I've enabled netconsole. (But the bugs I've
> experienced 4 years ago, seeing no msgs remotely from netconsole seem to
> have disappeared).

Alexander,

you need to be more specific. You cannot just say "problems" and "bugs".

I can run the very same board with v3.14.0 and netconsole enabled. No
PHY hickups, plenty netconsole messages, no dying PHY at any time.

> But now, using 3.14, I hit a bug which killed the ethernet with a 100%
> success rate, and, after digging a bit, I've come to the conclusion
> that netconsole (together with a maybe broken initialization of the PHY)
> is the source of the problem.

Given the fact that I can use above two just fine, I still doubt it is
related.

> The kernel is 3.14 (mainline) with one reverted patch (7cd1463). This
> patch changed the initialization of the PHY such, that the ethernet dies
> 100% reproducible on a Kirkwood 88F6281 based machine. Reverting that
> patch gives me a oneline bug-enabler:
>
> ------
> diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c
> b/drivers/net/ethernet/marvell/mv643xx_eth.c
> index e891b48..246f065 100644
> --- a/drivers/net/ethernet/marvell/mv643xx_eth.c
> +++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
> @@ -2095,7 +2095,8 @@ static void port_start(struct mv643xx_eth_private
> *mp)
> struct ethtool_cmd cmd;
>
> mv643xx_eth_get_settings(mp->dev, &cmd);
> - phy_reset(mp);
> + //phy_reset(mp);
> + phy_init_hw(mp->phy);
> mv643xx_eth_set_settings(mp->dev, &cmd);
> phy_start(mp->phy);
> }
> ------
>
> First I describe what happens at boot:
>
> - Bootloader (U-Boot) enables (somehow) the network such that is usable
> as a console for the bootloader,
> - Kernel is loaded and started with netconsole enabled through the
> kernel command line (netconsole=...),

Please provide full cmdline. I can use netconsole on it with

console=ttyS0,115200 root=/dev/sda2 rootwait rw
[email protected]/,@192.168.1.101/

.54 and .101 are netconsole sender and receiver, respectively.
/dev/sda2 is ext4 on a 4G usb stick. Also, my dockstar runs on
a power-supply that provides more power than the original one.

I mention this, because if you e.g. run a usb hdd on the dockstar,
it _can_ draw a lot of power and possibly cause the PHY to hard-reset.

Just to make sure: can you also provide above information or even
better change your setup to boot from usb stick?

> - eth driver probe => PHY reset
> - netconsole initializes the network (netpoll_setup) => PHY reset,
> - userland starts,
> - userland configures network (ip addr add fixedIP ..., a hack used for
> a very early ntpdate before the rootfs becomes rw), I'm not sure if
> that's end up again in a PHY reset.
> - userland starts network by using dhcpcd => PHY reset
>
> Now several use cases:
>
> Case 1:
> Using plain 3.14 the last step fails with no carrier, because the PHY
> ends up in a never ending reset (BMCR_RESET always set) in
> m88e1111_config_init() called by phy_init_hw() in port_start() in

PHY on Dockstar is 88E1116R so above should have been
m88r1116r_config_init()?

> mv643xx_eth.
>
> Case 2:
> Without enabling netconsole through the kernel command line, I see no
> problems.

I can even enable netconsole and see no issues.

> Case 3:
> If I enable the old phy_reset() in mv643xx_eth, I see no problems.
>
> Case 4:
> If I reduce the time the newly used reset in phy_init_hw() spends in
> calling mdelay(500) twice to some milliseconds m88e1111_config_init by
> polling for a cleared BMCR_RESET, I see no problems.
>
> Case 5:
> If I disable the initialization of the network in the bootloader,
> netconsole even worked 4 years ago. But I haven't looked into that case
> further, because I always want to use the network as a console for the
> bootloader.

Ok, that one is actually different from my setup. I have no netconsole
support for u-boot. Is your statement from _now_ or _4 years ago_, i.e.
does disabling u-boot netconsole _now_ change anything?

> Current assumption:
>
[...]
>
> I hope everyone who missed some more information is happy now, otherwise
> I (again) wasted time to type a problem description (not to speak about
> the already spent time trying to diagnose the problem)
>
> So go on and try to take the almost low hanging fruit. I'm not sure if I
> will spend more time on that topic as I already have a working
> patch/workaround and the discussion has become a bit tiresome. Sorry.

Alexander, please stop f*cking around here. Florian, David, and me
already showed a lot patience. None of your patches deals with the
real issue. Either help others to properly reproduce it or leave it.

Sebastian

2014-04-03 18:23:45

by Alexander Holler

[permalink] [raw]
Subject: Re: Bug(s) with netconsole (using mv643xx_eth on Kirkwood)

Am 03.04.2014 20:21, schrieb Sebastian Hesselbarth:
> On 04/03/2014 07:58 PM, Alexander Holler wrote:
> [...]
>> I already knew about problems with netconsole and mv643xx_eth since
>> 4 years, but didn't care a lot because everything else worked flawless,
>> I even had forgotten that I've enabled netconsole. (But the bugs I've
>> experienced 4 years ago, seeing no msgs remotely from netconsole seem to
>> have disappeared).
>
> Alexander,
>
> you need to be more specific. You cannot just say "problems" and "bugs".

I can. Full stop. Thanks.

2014-04-03 18:39:35

by Alexander Holler

[permalink] [raw]
Subject: Re: Bug(s) with netconsole (using mv643xx_eth on Kirkwood)

Am 03.04.2014 20:21, schrieb Sebastian Hesselbarth:
> On 04/03/2014 07:58 PM, Alexander Holler wrote:

>> I hope everyone who missed some more information is happy now, otherwise
>> I (again) wasted time to type a problem description (not to speak about
>> the already spent time trying to diagnose the problem)
>>
>> So go on and try to take the almost low hanging fruit. I'm not sure if I
>> will spend more time on that topic as I already have a working
>> patch/workaround and the discussion has become a bit tiresome. Sorry.
>
> Alexander, please stop f*cking around here. Florian, David, and me
> already showed a lot patience. None of your patches deals with the
> real issue. Either help others to properly reproduce it or leave it.

I haven't requested help.

I've send a patch which fixed a problem here I had with a new reset. The
problem appears inside the reset (the reset doesn't end), so the patch
looked very reasonable. That's all.

And, please, keep your words where they belong to.

Alexander Holler

2014-04-03 18:44:58

by Florian Fainelli

[permalink] [raw]
Subject: Re: Bug(s) with netconsole (using mv643xx_eth on Kirkwood)

2014-04-03 11:39 GMT-07:00 Alexander Holler <[email protected]>:
> Am 03.04.2014 20:21, schrieb Sebastian Hesselbarth:
>>
>> On 04/03/2014 07:58 PM, Alexander Holler wrote:
>
>
>>> I hope everyone who missed some more information is happy now, otherwise
>>> I (again) wasted time to type a problem description (not to speak about
>>> the already spent time trying to diagnose the problem)
>>>
>>> So go on and try to take the almost low hanging fruit. I'm not sure if I
>>> will spend more time on that topic as I already have a working
>>> patch/workaround and the discussion has become a bit tiresome. Sorry.
>>
>>
>> Alexander, please stop f*cking around here. Florian, David, and me
>> already showed a lot patience. None of your patches deals with the
>> real issue. Either help others to properly reproduce it or leave it.
>
>
> I haven't requested help.
>
> I've send a patch which fixed a problem here I had with a new reset. The
> problem appears inside the reset (the reset doesn't end), so the patch
> looked very reasonable. That's all.

You said so yourself, your patch is a workaround. What Sebastian is
asking for, which is perfectly reasonable is for you to root cause
this, since 1) he is not able to reproduce the problem, 2) the initial
patch is not acceptable as-is because it lacks too many critical
pieces of information.
--
Florian

2014-04-04 11:37:13

by Alexander Holler

[permalink] [raw]
Subject: Re: Bug(s) with netconsole (using mv643xx_eth on Kirkwood)

Thanks for proving all my assumptions about kernel maintainers.