2023-08-04 17:45:30

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On 2023-08-04, Nick Bowler <[email protected]> wrote:
> On 04/08/2023, Rob Herring <[email protected]> wrote:
>> On Fri, Aug 4, 2023 at 9:27 AM Nick Bowler <[email protected]> wrote:
>>> commit e461bd6f43f4e568f7436a8b6bc21c4ce6914c36
>>> Author: Robert Hancock <[email protected]>
>>> Date: Thu Jan 27 10:37:36 2022 -0600
>>>
>>> arm64: dts: zynqmp: Added GEM reset definitions
>>>
>>> Reverting this fixes the problem on 5.18. Reverting this fixes the
>>> problem on 6.1. Reverting this fixes the problem on 6.4. In all of
>>> these versions, with this change reverted, the network device appears
>>> without delay.
>>
>> With the above change, the kernel is going to be waiting for the reset
>> driver which either didn't exist or wasn't enabled in your config
>> (maybe kconfig needs to be tweaked to enable it automatically).
>
> The dts defines a reset-controller node with
>
> compatible = "xlnx,zynqmp-reset"
>
> As far as I can see, this is supposed to be handled by the code in
> drivers/reset/zynqmp-reset.c driver, it is enabled by CONFIG_ARCH_ZYNQMP,
> and I have that set to "y", and it appears to be getting compiled in (that
> is, there is a drivers/reset/zynqmp-reset.o file in the build directory).

Oh, I get it, to include this driver I need to also enable:

CONFIG_RESET_CONTROLLER=y

Setting this fixes 6.4. Perhaps CONFIG_ARCH_ZYNQMP should select it?
I guess the reset-zynqmp.o file that was in my build directory must
have been leftover garbage from a long time ago.

However, even with this option enabled, 6.5-rc4 remains broken (no
change in behaviour wrt. the network device). I will bisect this
now.

Cheers,
Nick


2023-08-04 17:51:11

by Rob Herring

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On Fri, Aug 4, 2023 at 10:54 AM Nick Bowler <[email protected]> wrote:
>
> On 2023-08-04, Nick Bowler <[email protected]> wrote:
> > On 04/08/2023, Rob Herring <[email protected]> wrote:
> >> On Fri, Aug 4, 2023 at 9:27 AM Nick Bowler <[email protected]> wrote:
> >>> commit e461bd6f43f4e568f7436a8b6bc21c4ce6914c36
> >>> Author: Robert Hancock <[email protected]>
> >>> Date: Thu Jan 27 10:37:36 2022 -0600
> >>>
> >>> arm64: dts: zynqmp: Added GEM reset definitions
> >>>
> >>> Reverting this fixes the problem on 5.18. Reverting this fixes the
> >>> problem on 6.1. Reverting this fixes the problem on 6.4. In all of
> >>> these versions, with this change reverted, the network device appears
> >>> without delay.
> >>
> >> With the above change, the kernel is going to be waiting for the reset
> >> driver which either didn't exist or wasn't enabled in your config
> >> (maybe kconfig needs to be tweaked to enable it automatically).
> >
> > The dts defines a reset-controller node with
> >
> > compatible = "xlnx,zynqmp-reset"
> >
> > As far as I can see, this is supposed to be handled by the code in
> > drivers/reset/zynqmp-reset.c driver, it is enabled by CONFIG_ARCH_ZYNQMP,
> > and I have that set to "y", and it appears to be getting compiled in (that
> > is, there is a drivers/reset/zynqmp-reset.o file in the build directory).
>
> Oh, I get it, to include this driver I need to also enable:
>
> CONFIG_RESET_CONTROLLER=y
>
> Setting this fixes 6.4. Perhaps CONFIG_ARCH_ZYNQMP should select it?

Maybe. Do other platforms do that?

> I guess the reset-zynqmp.o file that was in my build directory must
> have been leftover garbage from a long time ago.
>
> However, even with this option enabled, 6.5-rc4 remains broken (no
> change in behaviour wrt. the network device). I will bisect this
> now.

It would be good to know why the deferred probe timeout doesn't work.
If you disable modules, the kernel shouldn't wait past late_initcall.
Though this functionality keeps getting tweaked, so I may be off on
the current behavior.

Rob

2023-08-04 18:16:21

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On 2023-08-04, Rob Herring <[email protected]> wrote:
> On Fri, Aug 4, 2023 at 10:54 AM Nick Bowler <[email protected]> wrote:
>> Oh, I get it, to include this driver I need to also enable:
>>
>> CONFIG_RESET_CONTROLLER=y
>>
>> Setting this fixes 6.4. Perhaps CONFIG_ARCH_ZYNQMP should select it?
>
> Maybe. Do other platforms do that?

Of the ~40 platforms in arch/arm64/Kconfig.platforms, there appear to
be 5 that do select it.

>> However, even with this option enabled, 6.5-rc4 remains broken (no
>> change in behaviour wrt. the network device). I will bisect this
>> now.
>
> It would be good to know why the deferred probe timeout doesn't work.
> If you disable modules, the kernel shouldn't wait past late_initcall.
> Though this functionality keeps getting tweaked, so I may be off on
> the current behavior.

I don't know about the deferred probe timeout, but I bisected the 6.5-rc4
breakage to this commit:

commit c720a1f5e6ee8cb39c28435efc0819cec84d6ee2
Author: Michal Simek <[email protected]>
Date: Mon May 22 16:59:48 2023 +0200

arm64: zynqmp: Describe TI phy as ethernet-phy-id

So, reverting that on master appears to correct the issue (together with
setting CONFIG_RESET_CONTROLLER=y).

2023-08-04 20:52:11

by Rob Herring

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

+Saravana

On Fri, Aug 4, 2023 at 11:52 AM Nick Bowler <[email protected]> wrote:
>
> On 2023-08-04, Rob Herring <[email protected]> wrote:
> > On Fri, Aug 4, 2023 at 10:54 AM Nick Bowler <[email protected]> wrote:
> >> Oh, I get it, to include this driver I need to also enable:
> >>
> >> CONFIG_RESET_CONTROLLER=y
> >>
> >> Setting this fixes 6.4. Perhaps CONFIG_ARCH_ZYNQMP should select it?
> >
> > Maybe. Do other platforms do that?
>
> Of the ~40 platforms in arch/arm64/Kconfig.platforms, there appear to
> be 5 that do select it.

Then selecting should be okay. Unless there's a desire for resets to
remain optional (which is going to rely on the timeout).

> >> However, even with this option enabled, 6.5-rc4 remains broken (no
> >> change in behaviour wrt. the network device). I will bisect this
> >> now.
> >
> > It would be good to know why the deferred probe timeout doesn't work.
> > If you disable modules, the kernel shouldn't wait past late_initcall.
> > Though this functionality keeps getting tweaked, so I may be off on
> > the current behavior.
>
> I don't know about the deferred probe timeout, but I bisected the 6.5-rc4
> breakage to this commit:
>
> commit c720a1f5e6ee8cb39c28435efc0819cec84d6ee2
> Author: Michal Simek <[email protected]>
> Date: Mon May 22 16:59:48 2023 +0200
>
> arm64: zynqmp: Describe TI phy as ethernet-phy-id

I don't see anything obviously problematic with that commit. (The
#phy-cells property added is wrong as ethernet phys don't use the phy
binding, but that should just be ignored). I'd check if the phy probed
and has a DT node associated with it.

fw_devlink tracks parent-child dependencies and maybe changing to
parent-grandchild affected that. We don't yet track 'phy-handle'
dependencies, but we'd have a circular one here if we did (though that
should be handled). Does "fw_devlink=off" help?

> So, reverting that on master appears to correct the issue (together with
> setting CONFIG_RESET_CONTROLLER=y).

2023-08-04 23:35:00

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On 2023-08-04, Rob Herring <[email protected]> wrote:
> On Fri, Aug 4, 2023 at 11:52 AM Nick Bowler <[email protected]> wrote:
>> I don't know about the deferred probe timeout, but I bisected the 6.5-rc4
>> breakage to this commit:
>>
>> commit c720a1f5e6ee8cb39c28435efc0819cec84d6ee2
>> Author: Michal Simek <[email protected]>
>> Date: Mon May 22 16:59:48 2023 +0200
>>
>> arm64: zynqmp: Describe TI phy as ethernet-phy-id
>
> I don't see anything obviously problematic with that commit. (The
> #phy-cells property added is wrong as ethernet phys don't use the phy
> binding, but that should just be ignored). I'd check if the phy probed
> and has a DT node associated with it.

I think the answer is "no, the phy was not probed". Without reverting
that commit, there is absolutely nothing in /sys/bus/mdio_bus/devices.
There is no phy device link under /sys/bus/mdio_bus/drivers/"TI DP83867",
and there is no mdio_bus under /sys/bus/platform/devices/ff0e0000.ethernet.

When I revert that commit, I can locate the phy device under all these
locations.

> fw_devlink tracks parent-child dependencies and maybe changing to
> parent-grandchild affected that. We don't yet track 'phy-handle'
> dependencies, but we'd have a circular one here if we did (though that
> should be handled). Does "fw_devlink=off" help?

Booting with fw_devlink=off results in no obvious change in behaviour.

Thanks,
Nick

2023-08-04 23:54:48

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On Fri, Aug 04, 2023 at 05:31:21PM -0400, Nick Bowler wrote:
> On 2023-08-04, Rob Herring <[email protected]> wrote:
> > On Fri, Aug 4, 2023 at 11:52 AM Nick Bowler <[email protected]> wrote:
> >> I don't know about the deferred probe timeout, but I bisected the 6.5-rc4
> >> breakage to this commit:
> >>
> >> commit c720a1f5e6ee8cb39c28435efc0819cec84d6ee2
> >> Author: Michal Simek <[email protected]>
> >> Date: Mon May 22 16:59:48 2023 +0200
> >>
> >> arm64: zynqmp: Describe TI phy as ethernet-phy-id
> >
> > I don't see anything obviously problematic with that commit. (The
> > #phy-cells property added is wrong as ethernet phys don't use the phy
> > binding, but that should just be ignored). I'd check if the phy probed
> > and has a DT node associated with it.
>
> I think the answer is "no, the phy was not probed". Without reverting
> that commit, there is absolutely nothing in /sys/bus/mdio_bus/devices.
> There is no phy device link under /sys/bus/mdio_bus/drivers/"TI DP83867",
> and there is no mdio_bus under /sys/bus/platform/devices/ff0e0000.ethernet.
>
> When I revert that commit, I can locate the phy device under all these
> locations.
>
> > fw_devlink tracks parent-child dependencies and maybe changing to
> > parent-grandchild affected that. We don't yet track 'phy-handle'
> > dependencies, but we'd have a circular one here if we did (though that
> > should be handled). Does "fw_devlink=off" help?
>
> Booting with fw_devlink=off results in no obvious change in behaviour.

I think we need to rewind a tad.

My understanding is that this uses the Cadence macb driver.

In your original message, you said that the ethernet driver wasn't
being bound to the driver.

Since the ethernet driver is responsible for spotting the "mdio"
sub-node and creating the MDIO bus, if the driver isn't being
successfully bound, then the MDIO bus and the PHYs on the bus won't be
created, so you won't find them in /sys/bus/mdio_bus/devices.

Moreover, the Cadence macb driver, and this doesn't care about the
presence of the PHY at probe time, only when the network interface is
brought up. See macb_phylink_connect() which is called from
macb_open().

So, I think that the deferred probing has nothing to do with PHYs, and
that's just a wild goose chase.

I think instead we need to be concentrating on what's going on with
the ethernet driver, and why the ethernet driver is deferring its
probe. Is macb_probe() getting called at all? How far through
macb_probe() do we get before we defer?

I think those are the key questions that need answering.

Maybe, if you can get access to the machine while the driver is
deferring, /sys/kernel/debug/devices_deferred might give some
useful information, but that's just a hope.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

2023-08-05 03:06:50

by Saravana Kannan

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On Fri, Aug 4, 2023 at 1:22 PM Rob Herring <[email protected]> wrote:
>
> +Saravana

I'll look into this next week and reply.

-Saravana

>
> On Fri, Aug 4, 2023 at 11:52 AM Nick Bowler <[email protected]> wrote:
> >
> > On 2023-08-04, Rob Herring <[email protected]> wrote:
> > > On Fri, Aug 4, 2023 at 10:54 AM Nick Bowler <[email protected]> wrote:
> > >> Oh, I get it, to include this driver I need to also enable:
> > >>
> > >> CONFIG_RESET_CONTROLLER=y
> > >>
> > >> Setting this fixes 6.4. Perhaps CONFIG_ARCH_ZYNQMP should select it?
> > >
> > > Maybe. Do other platforms do that?
> >
> > Of the ~40 platforms in arch/arm64/Kconfig.platforms, there appear to
> > be 5 that do select it.
>
> Then selecting should be okay. Unless there's a desire for resets to
> remain optional (which is going to rely on the timeout).
>
> > >> However, even with this option enabled, 6.5-rc4 remains broken (no
> > >> change in behaviour wrt. the network device). I will bisect this
> > >> now.
> > >
> > > It would be good to know why the deferred probe timeout doesn't work.
> > > If you disable modules, the kernel shouldn't wait past late_initcall.
> > > Though this functionality keeps getting tweaked, so I may be off on
> > > the current behavior.
> >
> > I don't know about the deferred probe timeout, but I bisected the 6.5-rc4
> > breakage to this commit:
> >
> > commit c720a1f5e6ee8cb39c28435efc0819cec84d6ee2
> > Author: Michal Simek <[email protected]>
> > Date: Mon May 22 16:59:48 2023 +0200
> >
> > arm64: zynqmp: Describe TI phy as ethernet-phy-id
>
> I don't see anything obviously problematic with that commit. (The
> #phy-cells property added is wrong as ethernet phys don't use the phy
> binding, but that should just be ignored). I'd check if the phy probed
> and has a DT node associated with it.
>
> fw_devlink tracks parent-child dependencies and maybe changing to
> parent-grandchild affected that. We don't yet track 'phy-handle'
> dependencies, but we'd have a circular one here if we did (though that
> should be handled). Does "fw_devlink=off" help?
>
> > So, reverting that on master appears to correct the issue (together with
> > setting CONFIG_RESET_CONTROLLER=y).

2023-08-05 07:23:53

by Andrew Lunn

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

> When I revert that commit, I can locate the phy device under all these
> locations.

Please do what Russell requested, but also try commenting out the PHY
reset-gpios line in DT which was added by
c720a1f5e6ee8cb39c28435efc0819cec84d6ee2. It was also commented out
before that change. It could be that gpio controller is missing. Do
you have the driver for the tca6416 in your kernel configuration?

Andrew

2023-08-05 07:41:23

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On 2023-08-04, Russell King (Oracle) <[email protected]> wrote:
> I think we need to rewind a tad.
>
> My understanding is that this uses the Cadence macb driver.

Correct.

> In your original message, you said that the ethernet driver wasn't
> being bound to the driver.
[...]
> So, I think that the deferred probing has nothing to do with PHYs, and
> that's just a wild goose chase.
>
> I think instead we need to be concentrating on what's going on with
> the ethernet driver, and why the ethernet driver is deferring its
> probe. Is macb_probe() getting called at all?

I added some prints to the driver. The macb_probe is called five times
on this one device initially at boot, then ten seconds later it is
called one last time, returning -EPROBE_DEFER each time.

> How far through macb_probe() do we get before we defer?

The result is the same for all six calls. The macb_mdiobus_register
function returns -EPROBE_DEFER, which comes from the topmost call
to of_mdiobus_register within that function. That is, this is the
part that returns -EPROBE_DEFER:

child = of_get_child_by_name(np, "mdio");
if (child) {
int ret = of_mdiobus_register(bp->mii_bus, child);

of_node_put(child);
return ret;
}

> I think those are the key questions that need answering.
>
> Maybe, if you can get access to the machine while the driver is
> deferring, /sys/kernel/debug/devices_deferred might give some
> useful information, but that's just a hope.

This just contains one line:

ff0e0000.ethernet

which is the name of the ethernet device that is not working.

Thanks,
Nick

2023-08-05 07:42:43

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

Hi,

On 2023-08-05, Andrew Lunn <[email protected]> wrote:
>> When I revert that commit, I can locate the phy device under all these
>> locations.
>
> Please do what Russell requested, but also try commenting out the PHY
> reset-gpios line in DT which was added by
> c720a1f5e6ee8cb39c28435efc0819cec84d6ee2.

Removing the reset-gpios line from the dts appears to be sufficient to
restore working behaviour.

> It was also commented out before that change. It could be that gpio
> controller is missing. Do you have the driver for the tca6416 in
> your kernel configuration?

I have CONFIG_GPIO_PCA953X=y which I think is the correct driver?

Thanks,
Nick

2023-08-05 07:46:16

by Andrew Lunn

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

> The result is the same for all six calls. The macb_mdiobus_register
> function returns -EPROBE_DEFER, which comes from the topmost call
> to of_mdiobus_register within that function. That is, this is the
> part that returns -EPROBE_DEFER:
>
> child = of_get_child_by_name(np, "mdio");
> if (child) {
> int ret = of_mdiobus_register(bp->mii_bus, child);
>
> of_node_put(child);
> return ret;
> }

So you need to keep going down and seeing where is EPROBE_DEFER is
coming from.

You are missing some resource somewhere, probably because you are
missing a driver. Sometimes it is worth running a distro kernel which
has nearly everything enabled as modules, so you can find out what you
actually need. Then use 'make localmodconfig' to reduce the
configuration down to just what you need.

Andrew

2023-08-05 09:12:20

by Andrew Lunn

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

> > It was also commented out before that change. It could be that gpio
> > controller is missing. Do you have the driver for the tca6416 in
> > your kernel configuration?
>
> I have CONFIG_GPIO_PCA953X=y which I think is the correct driver?

It does appear to be the correct driver. But check if it has
loaded. It is an i2c device, so maybe you are missing the I2C bus
master device?

Andrew

2023-08-05 13:00:33

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

On 2023-08-05, Andrew Lunn <[email protected]> wrote:
>> > It was also commented out before that change. It could be that gpio
>> > controller is missing. Do you have the driver for the tca6416 in
>> > your kernel configuration?
>>
>> I have CONFIG_GPIO_PCA953X=y which I think is the correct driver?
>
> It does appear to be the correct driver. But check if it has
> loaded. It is an i2c device, so maybe you are missing the I2C bus
> master device?

That's it! I needed to set

CONFIG_I2C_CADENCE=y

and now things are working again!

Thanks,
Nick

2023-08-29 17:02:07

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: PROBLEM: Broken or delayed ethernet on Xilinx ZCU104 since 5.18 (regression)

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 05.08.23 09:34, Nick Bowler wrote:
> On 2023-08-05, Andrew Lunn <[email protected]> wrote:
>>>> It was also commented out before that change. It could be that gpio
>>>> controller is missing. Do you have the driver for the tca6416 in
>>>> your kernel configuration?
>>>
>>> I have CONFIG_GPIO_PCA953X=y which I think is the correct driver?
>>
>> It does appear to be the correct driver. But check if it has
>> loaded. It is an i2c device, so maybe you are missing the I2C bus
>> master device?
>
> That's it! I needed to set
>
> CONFIG_I2C_CADENCE=y
>
> and now things are working again!

#regzbot resolve: a config change did the trick
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.