Hi Vladimir, Andrew,
sorry for the late response.
> Gesendet: Donnerstag, 09. September 2021 um 14:56 Uhr
> Von: "Vladimir Oltean" <[email protected]>
> An: "Lino Sanfilippo" <[email protected]>
> Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
> Betreff: Re: [PATCH 0/3] Fix for KSZ DSA switch shutdown
>
> On Thu, Sep 09, 2021 at 02:42:48PM +0300, Vladimir Oltean wrote:
> > I feel that something is missing in your system. Is the device link
> > created? Is it deleted before going into effect on shutdown?
>
> So in case my questions were confusing, you can check the presence of
> the device links via sysfs.
>
> On my board, eno2 is the top-level DSA master, there is a switch which
> is PCIe PF 0000:00:00.5 which is its consumer:
>
> ls -la /sys/class/net/eno2/device/consumer\:pci\:0000\:00\:00.5
> lrwxrwxrwx 1 root root 0 Jan 1 00:00 /sys/class/net/eno2/device/consumer:pci:0000:00:00.5 -> ../../../../../virtual/devlink/pci:0000:00:00.2--pci:0000:00:00.5
>
> In turn, that switch is a DSA master on two ports for SPI-attached switches:
>
> ls -la /sys/class/net/swp0/device/consumer\:spi\:spi2.*
> lrwxrwxrwx 1 root root 0 Jan 1 00:04 /sys/class/net/swp0/device/consumer:spi:spi2.0 -> ../../../../../virtual/devlink/pci:0000:00:00.5--spi:spi2.0
> lrwxrwxrwx 1 root root 0 Jan 1 00:04 /sys/class/net/swp0/device/consumer:spi:spi2.1 -> ../../../../../virtual/devlink/pci:0000:00:00.5--spi:spi2.1
>
> Do you see similar things on your 5.10 kernel?
For the master device is see
lrwxrwxrwx 1 root root 0 Sep 9 14:10 /sys/class/net/eth0/device/consumer:spi:spi3.0 -> ../../../virtual/devlink/platform:fd580000.ethernet--spi:spi3.0
>
> Please note that I don't think that particular patch with device links
> was backported to v5.10, at least I don't see it when I run:
> git tag --contains 07b90056cb15f
>
> So how did it reach your tree?
>
The kernel I use is the Raspberry Pi 5.10 kernel. The commit number in this kernel is d0b97c8cd63e37e6d4dc9fefd6381b09f6c31a67
Andrew: the switch is not on a hat, the device tree part I use is:
spi@7e204c00 {
cs-gpios = <0x0000000f 0x00000010 0x00000001 0x0000000f 0x00000012 0x00000001>;
pinctrl-0 = <0x000000e5 0x000000e6>;
pinctrl-names = "default";
compatible = "brcm,bcm2835-spi";
reg = <0x7e204c00 0x00000200>;
interrupts = <0x00000000 0x00000076 0x00000004>;
clocks = <0x00000007 0x00000014>;
#address-cells = <0x00000001>;
#size-cells = <0x00000000>;
status = "okay";
phandle = <0x000000be>;
tpm@1 {
phandle = <0x000000ed>;
status = "okay";
interrupts = <0x0000000a 0x00000008>;
#interrupt-cells = <0x00000002>;
interrupt-parent = <0x0000000f>;
spi-max-frequency = <0x000f4240>;
reg = <0x00000001>;
pinctrl-0 = <0x000000e7>;
pinctrl-names = "default";
compatible = "infineon,slb9670";
};
ksz9897@0 {
phandle = <0x000000ec>;
status = "okay";
reset-gpios = <0x000000e1 0x0000000d 0x00000001>;
spi-cpol;
spi-cpha;
spi-max-frequency = <0x01f78a40>;
reg = <0x00000000>;
compatible = "microchip,ksz9897";
ports {
#size-cells = <0x00000000>;
#address-cells = <0x00000001>;
port@2 {
label = "piright";
reg = <0x00000002>;
};
port@1 {
label = "pileft";
reg = <0x00000001>;
};
port@0 {
ethernet = <0x000000d7>;
label = "cpu";
reg = <0x00000000>;
};
};
};
Regards,
Lino
On 09.09.21 at 15:19, Lino Sanfilippo wrote:
>>
>
> The kernel I use is the Raspberry Pi 5.10 kernel. The commit number in this kernel is d0b97c8cd63e37e6d4dc9fefd6381b09f6c31a67
>
This is not correct. The kernel I use right now is based on Gregs stable linux-5.10.y.
The commit number is correct here. Sorry for the confusion.
Regards,
Lino
> This is not correct. The kernel I use right now is based on Gregs stable linux-5.10.y.
> The commit number is correct here. Sorry for the confusion.
Can you use 5.14.2?
When we understand the problem, the fixes will need to be for
net-next, which will be based on 5.15-rcX. They will then be
backported to 5.10. So you need to do some testing on a newer
kernel. Such testing will also help us figure out if it is a new
problem, or a backporting problem.
Andrew
> Andrew: the switch is not on a hat, the device tree part I use is:
And this is not an overlay. It is all there at boot?
I was just thinking that maybe the Ethernet interface gets opened at
boot, and overlay is loaded, and the interface is opened a second
time. I don't know of anybody using DSA with overlays, so that could
of been the key difference which breaks it for you.
Your decompiled DT blob looks O.K.
Andrew
On Thu, Sep 09, 2021 at 03:19:52PM +0200, Lino Sanfilippo wrote:
> > Do you see similar things on your 5.10 kernel?
>
> For the master device is see
>
> lrwxrwxrwx 1 root root 0 Sep 9 14:10 /sys/class/net/eth0/device/consumer:spi:spi3.0 -> ../../../virtual/devlink/platform:fd580000.ethernet--spi:spi3.0
So this is the worst of the worst, we have a device link but it doesn't help.
Where the device link helps is here:
__device_release_driver
while (device_links_busy(dev))
device_links_unbind_consumers(dev);
but during dev_shutdown, device_links_unbind_consumers does not get called
(actually I am not even sure whether it should).
I've reproduced your issue by making this very simple change:
diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
index 60d94e0a07d6..ec00f34cac47 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
@@ -1372,6 +1372,7 @@ static struct pci_driver enetc_pf_driver = {
.id_table = enetc_pf_id_table,
.probe = enetc_pf_probe,
.remove = enetc_pf_remove,
+ .shutdown = enetc_pf_remove,
#ifdef CONFIG_PCI_IOV
.sriov_configure = enetc_sriov_configure,
#endif
on my DSA master driver. This is what the genet driver has "special".
I was led into grave error by Documentation/driver-api/device_link.rst,
which I've based my patch on, where it clearly says that device links
are supposed to help with shutdown ordering (how?!).
So the question is, why did my DSA trees get torn down on shutdown?
Basically the short answer is that my SPI controller driver does
implement .shutdown, and calls the same code path as the .remove code,
which calls spi_unregister_controller which removes all SPI children..
When I added this device link, one of the main objectives was to not
modify all DSA drivers. I was certain based on the documentation that
device links would help, now I'm not so sure anymore.
So what happens is that the DSA master attempts to unregister its net
device on .shutdown, but DSA does not implement .shutdown, so it just
sits there holding a reference (supposedly via dev_hold, but where from?!)
to the master, which makes netdev_wait_allrefs to wait and wait.
I need more time for the denial phase to pass, and to understand what
can actually be done. I will also be away from the keyboard for the next
few days, so it might take a while. Your patches obviously offer a
solution only for KSZ switches, we need something more general. If I
understand your solution, it works not by virtue of there being any
shutdown ordering guarantee at all, but simply due to the fact that
DSA's .shutdown hook gets called eventually, and the reference to the
master gets freed eventually, which unblocks the unregister_netdevice
call from the master. I don't yet understand why DSA holds a long-term
reference to the master, that's one thing I need to figure out.
+Saravana,
On 9/9/2021 8:47 AM, Vladimir Oltean wrote:
> On Thu, Sep 09, 2021 at 03:19:52PM +0200, Lino Sanfilippo wrote:
>>> Do you see similar things on your 5.10 kernel?
>>
>> For the master device is see
>>
>> lrwxrwxrwx 1 root root 0 Sep 9 14:10 /sys/class/net/eth0/device/consumer:spi:spi3.0 -> ../../../virtual/devlink/platform:fd580000.ethernet--spi:spi3.0
>
> So this is the worst of the worst, we have a device link but it doesn't help.
>
> Where the device link helps is here:
>
> __device_release_driver
> while (device_links_busy(dev))
> device_links_unbind_consumers(dev);
>
> but during dev_shutdown, device_links_unbind_consumers does not get called
> (actually I am not even sure whether it should).
>
> I've reproduced your issue by making this very simple change:
>
> diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> index 60d94e0a07d6..ec00f34cac47 100644
> --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> @@ -1372,6 +1372,7 @@ static struct pci_driver enetc_pf_driver = {
> .id_table = enetc_pf_id_table,
> .probe = enetc_pf_probe,
> .remove = enetc_pf_remove,
> + .shutdown = enetc_pf_remove,
> #ifdef CONFIG_PCI_IOV
> .sriov_configure = enetc_sriov_configure,
> #endif
>
> on my DSA master driver. This is what the genet driver has "special".
>
> I was led into grave error by Documentation/driver-api/device_link.rst,
> which I've based my patch on, where it clearly says that device links
> are supposed to help with shutdown ordering (how?!).
I was also under the impression that device links were supposed to help
with shutdown ordering, because it does matter a lot. One thing that I
had to work before (and seems like it came back recently) is the
shutdown ordering between gpio_keys.c and the GPIO controller. If you
suspend the GPIO controller first, gpio_keys.c never gets a chance to
keep the GPIO pin configured for a wake-up interrupt, therefore no
wake-up event happens on key presses, whoops.
>
> So the question is, why did my DSA trees get torn down on shutdown?
> Basically the short answer is that my SPI controller driver does
> implement .shutdown, and calls the same code path as the .remove code,
> which calls spi_unregister_controller which removes all SPI children..
>
> When I added this device link, one of the main objectives was to not
> modify all DSA drivers. I was certain based on the documentation that
> device links would help, now I'm not so sure anymore.
>
> So what happens is that the DSA master attempts to unregister its net
> device on .shutdown, but DSA does not implement .shutdown, so it just
> sits there holding a reference (supposedly via dev_hold, but where from?!)
> to the master, which makes netdev_wait_allrefs to wait and wait.
It's not coming from of_find_net_device_by_node() that's for sure and
with OF we don't go through the code path calling
dsa_dev_to_net_device() which does call dev_hold() and then shortly
thereafter the caller calls dev_put() anyway.
>
> I need more time for the denial phase to pass, and to understand what
> can actually be done. I will also be away from the keyboard for the next
> few days, so it might take a while. Your patches obviously offer a
> solution only for KSZ switches, we need something more general. If I
> understand your solution, it works not by virtue of there being any
> shutdown ordering guarantee at all, but simply due to the fact that
> DSA's .shutdown hook gets called eventually, and the reference to the
> master gets freed eventually, which unblocks the unregister_netdevice
> call from the master. I don't yet understand why DSA holds a long-term
> reference to the master, that's one thing I need to figure out.
>
Agreed.
--
Florian
On 09.09.21 at 17:47, Vladimir Oltean wrote:
> On Thu, Sep 09, 2021 at 03:19:52PM +0200, Lino Sanfilippo wrote:
>>> Do you see similar things on your 5.10 kernel?
>>
>> For the master device is see
>>
>> lrwxrwxrwx 1 root root 0 Sep 9 14:10 /sys/class/net/eth0/device/consumer:spi:spi3.0 -> ../../../virtual/devlink/platform:fd580000.ethernet--spi:spi3.0
>
> So this is the worst of the worst, we have a device link but it doesn't help.
>
> Where the device link helps is here:
>
> __device_release_driver
> while (device_links_busy(dev))
> device_links_unbind_consumers(dev);
>
> but during dev_shutdown, device_links_unbind_consumers does not get called
> (actually I am not even sure whether it should).
>
> I've reproduced your issue by making this very simple change:
>
> diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> index 60d94e0a07d6..ec00f34cac47 100644
> --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> @@ -1372,6 +1372,7 @@ static struct pci_driver enetc_pf_driver = {
> .id_table = enetc_pf_id_table,
> .probe = enetc_pf_probe,
> .remove = enetc_pf_remove,
> + .shutdown = enetc_pf_remove,
> #ifdef CONFIG_PCI_IOV
> .sriov_configure = enetc_sriov_configure,
> #endif
>
> on my DSA master driver. This is what the genet driver has "special".
>
Ah, that is interesting.
> I was led into grave error by Documentation/driver-api/device_link.rst,
> which I've based my patch on, where it clearly says that device links
> are supposed to help with shutdown ordering (how?!).
>
> So the question is, why did my DSA trees get torn down on shutdown?
> Basically the short answer is that my SPI controller driver does
> implement .shutdown, and calls the same code path as the .remove code,
> which calls spi_unregister_controller which removes all SPI children..
>
> When I added this device link, one of the main objectives was to not
> modify all DSA drivers. I was certain based on the documentation that
> device links would help, now I'm not so sure anymore.
>
> So what happens is that the DSA master attempts to unregister its net
> device on .shutdown, but DSA does not implement .shutdown, so it just
> sits there holding a reference (supposedly via dev_hold, but where from?!)
> to the master, which makes netdev_wait_allrefs to wait and wait.
>
Right, that was also my conclusion.
> I need more time for the denial phase to pass, and to understand what
> can actually be done. I will also be away from the keyboard for the next
> few days, so it might take a while. Your patches obviously offer a
> solution only for KSZ switches, we need something more general. If I
> understand your solution, it works not by virtue of there being any
> shutdown ordering guarantee at all, but simply due to the fact that
> DSA's .shutdown hook gets called eventually, and the reference to the
> master gets freed eventually, which unblocks the unregister_netdevice
> call from the master.
Well actually the SPI shutdown hook gets called which then calls ksz9477_shutdown
(formerly ksz9477_reset_switch) which then shuts down the switch by
stopping the worker thread and tearing down the DSA tree (via dsa_tree_shutdown()).
While it is right that the patch series only fixes the KSZ case for now, the idea was that
other drivers could use a similar approach in by calling the new function dsa_tree_shutdown()
in their shutdown handler to make sure that all refs to the master device are released.
Regards,
Lino
On 09.09.21 at 17:17, Andrew Lunn wrote:
>> This is not correct. The kernel I use right now is based on Gregs stable linux-5.10.y.
>> The commit number is correct here. Sorry for the confusion.
>
> Can you use 5.14.2?
>
> When we understand the problem, the fixes will need to be for
> net-next, which will be based on 5.15-rcX. They will then be
> backported to 5.10. So you need to do some testing on a newer
> kernel. Such testing will also help us figure out if it is a new
> problem, or a backporting problem.
>
> Andrew
>
You are right, I will try to switch to a newer kernel to test future DSA issue like that
(I have the impression that DSA is an area in which a lot of progress is done from
one kernel version to the next).
Regards,
Lino
On 9/9/2021 9:37 AM, Lino Sanfilippo wrote:
> On 09.09.21 at 17:47, Vladimir Oltean wrote:
>> On Thu, Sep 09, 2021 at 03:19:52PM +0200, Lino Sanfilippo wrote:
>>>> Do you see similar things on your 5.10 kernel?
>>>
>>> For the master device is see
>>>
>>> lrwxrwxrwx 1 root root 0 Sep 9 14:10 /sys/class/net/eth0/device/consumer:spi:spi3.0 -> ../../../virtual/devlink/platform:fd580000.ethernet--spi:spi3.0
>>
>> So this is the worst of the worst, we have a device link but it doesn't help.
>>
>> Where the device link helps is here:
>>
>> __device_release_driver
>> while (device_links_busy(dev))
>> device_links_unbind_consumers(dev);
>>
>> but during dev_shutdown, device_links_unbind_consumers does not get called
>> (actually I am not even sure whether it should).
>>
>> I've reproduced your issue by making this very simple change:
>>
>> diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
>> index 60d94e0a07d6..ec00f34cac47 100644
>> --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
>> +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
>> @@ -1372,6 +1372,7 @@ static struct pci_driver enetc_pf_driver = {
>> .id_table = enetc_pf_id_table,
>> .probe = enetc_pf_probe,
>> .remove = enetc_pf_remove,
>> + .shutdown = enetc_pf_remove,
>> #ifdef CONFIG_PCI_IOV
>> .sriov_configure = enetc_sriov_configure,
>> #endif
>>
>> on my DSA master driver. This is what the genet driver has "special".
>>
>
> Ah, that is interesting.
>
>> I was led into grave error by Documentation/driver-api/device_link.rst,
>> which I've based my patch on, where it clearly says that device links
>> are supposed to help with shutdown ordering (how?!).
>>
>> So the question is, why did my DSA trees get torn down on shutdown?
>> Basically the short answer is that my SPI controller driver does
>> implement .shutdown, and calls the same code path as the .remove code,
>> which calls spi_unregister_controller which removes all SPI children..
>>
>> When I added this device link, one of the main objectives was to not
>> modify all DSA drivers. I was certain based on the documentation that
>> device links would help, now I'm not so sure anymore.
>>
>> So what happens is that the DSA master attempts to unregister its net
>> device on .shutdown, but DSA does not implement .shutdown, so it just
>> sits there holding a reference (supposedly via dev_hold, but where from?!)
>> to the master, which makes netdev_wait_allrefs to wait and wait.
>>
>
> Right, that was also my conclusion.
>
>> I need more time for the denial phase to pass, and to understand what
>> can actually be done. I will also be away from the keyboard for the next
>> few days, so it might take a while. Your patches obviously offer a
>> solution only for KSZ switches, we need something more general. If I
>> understand your solution, it works not by virtue of there being any
>> shutdown ordering guarantee at all, but simply due to the fact that
>> DSA's .shutdown hook gets called eventually, and the reference to the
>> master gets freed eventually, which unblocks the unregister_netdevice
>> call from the master.
>
> Well actually the SPI shutdown hook gets called which then calls ksz9477_shutdown
> (formerly ksz9477_reset_switch) which then shuts down the switch by
> stopping the worker thread and tearing down the DSA tree (via dsa_tree_shutdown()).
>
> While it is right that the patch series only fixes the KSZ case for now, the idea was that
> other drivers could use a similar approach in by calling the new function dsa_tree_shutdown()
> in their shutdown handler to make sure that all refs to the master device are released.
It does not scale really well to have individual drivers call
dsa_tree_shutdown() in their respective .shutdown callback, and in a
multi-switch configuration, I am not sure what the results would look like.
In premise, each driver ought to be able to call
dsa_unregister_switch(), along with all of the driver specific shutdown
and eventually, given proper device ordering the DSA tree would get
automatically torn down, and then the DSA master's .shutdown() callback
would be called.
FWIW, the reason why we call .shutdown() in bcmgenet is to turn off DMA
and clocks, which matters for kexec (DMA) as well as power savings (S5
mode).
--
Florian
On 09.09.21 at 17:11, Andrew Lunn wrote:
>> Andrew: the switch is not on a hat, the device tree part I use is:
>
> And this is not an overlay. It is all there at boot?
>
Well actually we DO use an overlay. The dev tree snipped I posted was an excerpt form
fdtdump. The concerning fragment looks like this in the overlay file:
fragment@4 {
target = <&spi6>;
__overlay__ {
#address-cells = <1>;
#size-cells = <0>;
pinctrl-names = "default";
pinctrl-0 = <&spi6_pins>, <&spi6_cs_pins>;
cs-gpios = <&gpio 16 GPIO_ACTIVE_LOW>,
<&gpio 18 GPIO_ACTIVE_LOW>;
status = "okay";
ksz9897: ksz9897@0 {
compatible = "microchip,ksz9897";
reg = <0>;
spi-max-frequency = <50000000>;
spi-cpha;
spi-cpol;
reset-gpios = <&expander_core 13 GPIO_ACTIVE_LOW>;
status = "okay";
ports {
#address-cells = <1>;
#size-cells = <0>;
/* PORT 1 */
port@0 {
reg = <0>;
label = "cpu";
ethernet = <&genet>;
};
/* PORT 2 */
port@1 {
reg = <1>;
label = "pileft";
};
/* PORT 3 */
port@2 {
reg = <2>;
label = "piright";
};
/*
* PORT 4-7 unused
*/
};
};
tpm: tpm@1 {
compatible = "infineon,slb9670";
pinctrl-names = "default";
pinctrl-0 = <&tpm_pins>;
reg = <1>;
spi-max-frequency = <1000000>;
interrupt-parent = <&gpio>;
#interrupt-cells = <2>;
interrupts = <10 IRQ_TYPE_LEVEL_LOW>;
status = "okay";
};
};
};
But probably this does not matter any more now
that Vladimir was able to reproduce the issue.
> I was just thinking that maybe the Ethernet interface gets opened at
> boot, and overlay is loaded, and the interface is opened a second
> time. I don't know of anybody using DSA with overlays, so that could
> of been the key difference which breaks it for you.
>
> Your decompiled DT blob looks O.K.
>
> Andrew
>
On 09.09.21 at 18:44, Florian Fainelli wrote:
>
>
> On 9/9/2021 9:37 AM, Lino Sanfilippo wrote:
>> On 09.09.21 at 17:47, Vladimir Oltean wrote:
>>> On Thu, Sep 09, 2021 at 03:19:52PM +0200, Lino Sanfilippo wrote:
>>>>> Do you see similar things on your 5.10 kernel?
>>>>
>>>> For the master device is see
>>>>
>>>> lrwxrwxrwx 1 root root 0 Sep 9 14:10 /sys/class/net/eth0/device/consumer:spi:spi3.0 -> ../../../virtual/devlink/platform:fd580000.ethernet--spi:spi3.0
>>>
>>> So this is the worst of the worst, we have a device link but it doesn't help.
>>>
>>> Where the device link helps is here:
>>>
>>> __device_release_driver
>>> while (device_links_busy(dev))
>>> device_links_unbind_consumers(dev);
>>>
>>> but during dev_shutdown, device_links_unbind_consumers does not get called
>>> (actually I am not even sure whether it should).
>>>
>>> I've reproduced your issue by making this very simple change:
>>>
>>> diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
>>> index 60d94e0a07d6..ec00f34cac47 100644
>>> --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
>>> +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
>>> @@ -1372,6 +1372,7 @@ static struct pci_driver enetc_pf_driver = {
>>> .id_table = enetc_pf_id_table,
>>> .probe = enetc_pf_probe,
>>> .remove = enetc_pf_remove,
>>> + .shutdown = enetc_pf_remove,
>>> #ifdef CONFIG_PCI_IOV
>>> .sriov_configure = enetc_sriov_configure,
>>> #endif
>>>
>>> on my DSA master driver. This is what the genet driver has "special".
>>>
>>
>> Ah, that is interesting.
>>
>>> I was led into grave error by Documentation/driver-api/device_link.rst,
>>> which I've based my patch on, where it clearly says that device links
>>> are supposed to help with shutdown ordering (how?!).
>>>
>>> So the question is, why did my DSA trees get torn down on shutdown?
>>> Basically the short answer is that my SPI controller driver does
>>> implement .shutdown, and calls the same code path as the .remove code,
>>> which calls spi_unregister_controller which removes all SPI children..
>>>
>>> When I added this device link, one of the main objectives was to not
>>> modify all DSA drivers. I was certain based on the documentation that
>>> device links would help, now I'm not so sure anymore.
>>>
>>> So what happens is that the DSA master attempts to unregister its net
>>> device on .shutdown, but DSA does not implement .shutdown, so it just
>>> sits there holding a reference (supposedly via dev_hold, but where from?!)
>>> to the master, which makes netdev_wait_allrefs to wait and wait.
>>>
>>
>> Right, that was also my conclusion.
>>
>>> I need more time for the denial phase to pass, and to understand what
>>> can actually be done. I will also be away from the keyboard for the next
>>> few days, so it might take a while. Your patches obviously offer a
>>> solution only for KSZ switches, we need something more general. If I
>>> understand your solution, it works not by virtue of there being any
>>> shutdown ordering guarantee at all, but simply due to the fact that
>>> DSA's .shutdown hook gets called eventually, and the reference to the
>>> master gets freed eventually, which unblocks the unregister_netdevice
>>> call from the master.
>>
>> Well actually the SPI shutdown hook gets called which then calls ksz9477_shutdown
>> (formerly ksz9477_reset_switch) which then shuts down the switch by
>> stopping the worker thread and tearing down the DSA tree (via dsa_tree_shutdown()).
>>
>> While it is right that the patch series only fixes the KSZ case for now, the idea was that
>> other drivers could use a similar approach in by calling the new function dsa_tree_shutdown()
>> in their shutdown handler to make sure that all refs to the master device are released.
> It does not scale really well to have individual drivers call dsa_tree_shutdown() in their respective .shutdown callback, and in a multi-switch configuration, I am not sure what the results would look like.
>
> In premise, each driver ought to be able to call dsa_unregister_switch(), along with all of the driver specific shutdown and eventually, given proper device ordering the DSA tree would get automatically torn down, and then the DSA master's .shutdown() callback would be called.
>
> FWIW, the reason why we call .shutdown() in bcmgenet is to turn off DMA and clocks, which matters for kexec (DMA) as well as power savings (S5 mode).
I agree with the scalability. Concerning the multi-switch case I dont know about the possible issues (I am quite new to working with DSA).
So lets wait for Vladimirs solution.
Regards,
Lino
On Thu, Sep 09, 2021 at 06:46:49PM +0200, Lino Sanfilippo wrote:
> On 09.09.21 at 17:11, Andrew Lunn wrote:
> >> Andrew: the switch is not on a hat, the device tree part I use is:
> >
> > And this is not an overlay. It is all there at boot?
> >
>
> Well actually we DO use an overlay. The dev tree snipped I posted was an excerpt form
> fdtdump. The concerning fragment looks like this in the overlay file:
Thanks for the information. Good to know somebody is using DSA like
this. The device tree description can be quite complex, especially for
some of the other switches.
> But probably this does not matter any more now that Vladimir was
> able to reproduce the issue.
Agreed.
Andrew
On Thu, Sep 09, 2021 at 07:07:33PM +0200, Lino Sanfilippo wrote:
> > It does not scale really well to have individual drivers call
> > dsa_tree_shutdown() in their respective .shutdown callback, and in a
> > multi-switch configuration, I am not sure what the results would
> > look like.
> >
> > In premise, each driver ought to be able to call
> > dsa_unregister_switch(), along with all of the driver specific
> > shutdown and eventually, given proper device ordering the DSA tree
> > would get automatically torn down, and then the DSA master's
> > .shutdown() callback would be called.
> >
> > FWIW, the reason why we call .shutdown() in bcmgenet is to turn off
> > DMA and clocks, which matters for kexec (DMA) as well as power
> > savings (S5 mode).
>
> I agree with the scalability. Concerning the multi-switch case I dont
> know about the possible issues (I am quite new to working with DSA).
> So lets wait for Vladimirs solution.
I'm back for now and was able to spend a bit more time and understand
what is happening.
So first things first: why does DSA call dev_hold long-term on the
master, and where from?
Answer: it does so since commit 2f1e8ea726e9 ("net: dsa: link interfaces
with the DSA master to get rid of lockdep warnings"), see this call path:
dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold
Ok, so since DSA holds a reference to the master interface, it is
natural that unregister_netdevice() will not finish, and it will hang
the system.
Question 2: why does bcmgenet need to unregister the net device on
shutdown?
See Florian's answer, it doesn't, strictly speaking, it just needs to
turn off the DMA and some clocks.
Question 3: can we revert commit 2f1e8ea726e9?
Answer: not so easily, we are looking at >10 commits to revert, and find
other solutions to some problems. We have built in the meantime on top
of the fact that there is an upper/lower relationship between DSA user
ports and the DSA master.
Question 4: how do other stacked interfaces deal with this?
Answer: as I said in the commit message of 2f1e8ea726e9, DSA is not
VLAN, DSA has unique challenges of its own, like a tree of struct
devices to manage, with their own lifetime. So what other drivers do is
not really relevant. Anyway, to entertain the question: VLAN watches the
NETDEV_UNREGISTER event emitted on the netdev notifier chain for its
real_dev, and effectively unregisters itself. Now this is exactly why it
is irrelevant, we can watch for NETDEV_UNREGISTER on the DSA master, but
then what? There is nothing sensible to do. Consider that in the master
unbind case (not shutdown), both the NETDEV_UNREGISTER code path will
execute, and the unbind of the DSA switch itself, due to that device
link. But let's say we delete the device link and leave only the
NETDEV_UNREGISTER code path to do something. What?
device_release_driver(ds->dev), most probably. That would effectively
force the DSA unbind path. But surprise: the DSA unbind path takes the
rtnl_mutex from quite a couple of places, and we are already under the
rtnl_lock (held by the netdev notifier chain). So, unless we schedule
the DSA device driver detach, there is an impending deadlock.
Ok, let's entertain even that: detach the DSA driver in a scheduled work
item, with the rtnl_lock not held. First off, we will trigger again the
WARN_ON solved by commit 2f1e8ea726e9 (because the unregistering of the
DSA master has "completed", but it still has an upper interface - us),
and secondly, the unregister_netdev function will have already deleted
stuff belonging to the DSA master, namely its sysfs entries. But DSA
also touches the master's sysfs, namely the "tagging" file. So NULL
pointer dereference on the master's sysfs.
So very simply put, DSA cannot unbind itself from the switch device when
the master net device unregisters. The best case scenario would be for
DSA to unbind _before_ the net device even unregisters. That was the
whole point of my attempt with the device links, to ensure shutdown
_ordering_.
Question 5: can the device core actually be patched to call
device_links_unbind_consumers() from device_shutdown()? This would
actually simplify DSA's options, and make the device links live up to
their documented expectations.
Answer: yes and no, technically it can, but it is an invasive change
which will certainly introduce regressions. See the answer to question 2
for an example. Technically .shutdown exists so that drivers can do
something lightweight to quiesce the hardware, without really caring too
much about data structure integrity (hey, the kernel is going to die
soon anyway). But some drivers, like bcmgenet, do the same thing in
.resume and .shutdown, which blurs the lines quite a lot. If the device
links were to start calling .remove at shutdown time, potentially after
.shutdown was already called, bcmgenet would effectively unregister its
net device twice. Yikes.
Question 6: How about a patch on the device core that is more lightweight?
Wouldn't it be sensible for device_shutdown() to just call ->remove if
the device's bus has no ->shutdown, and the device's driver doesn't have
a ->shutdown either?
Answer: This would sometimes work, the vast majority of DSA switch
drivers, and Ethernet controllers (in this case used as DSA masters) do
not have a .shutdown method implemented. But their bus does: PCI does,
SPI controllers do, most of the time. So it would work for limited
scenarios, but would be ineffective in the general sense.
Question 7: I said that .shutdown, as opposed to .remove, doesn't really
care so much about the integrity of data structures. So how far should
we really go to fix this issue? Should we even bother to unbind the
whole DSA tree, when the sole problem is that we are the DSA master's
upper, and that is keeping a reference on it?
Answer: Well, any solution that does unnecessary data structure teardown
only delays the reboot for nothing. Lino's patch just bluntly calls
dsa_tree_teardown() from the switch .shutdown method, and this leaks
memory, namely dst->ports. But does this really matter? Nope, so let's
extrapolate. In this case, IMO, the simplest possible solution would be
to patch bcmgenet to not unregister the net device. Then treat every
other DSA master driver in the same way as they come, one by one.
Do you need to unregister_netdevice() at shutdown? No. Then don't.
Is it nice? Probably not, but I'm not seeing alternatives.
Also, unless I'm missing something, Lino probably still sees the WARN_ON
in bcmgenet's unregister_netdevice() about eth0 getting unregistered
while having an upper interface. If not, it's by sheer luck that the DSA
switch's ->shutdown gets called before bcmgenet's ->shutdown. But for
this reason, it isn't a great solution either. If the device links can't
guarantee us some sort of shutdown ordering (what we ideally want, as
mentioned, is for the DSA switch driver to get _unbound_ (->remove)
before the DSA master gets unbound or shut down).
On Fri, Sep 10, 2021 at 01:54:57AM +0300, Vladimir Oltean wrote:
> On Thu, Sep 09, 2021 at 07:07:33PM +0200, Lino Sanfilippo wrote:
> > > It does not scale really well to have individual drivers call
> > > dsa_tree_shutdown() in their respective .shutdown callback, and in a
> > > multi-switch configuration, I am not sure what the results would
> > > look like.
> > >
> > > In premise, each driver ought to be able to call
> > > dsa_unregister_switch(), along with all of the driver specific
> > > shutdown and eventually, given proper device ordering the DSA tree
> > > would get automatically torn down, and then the DSA master's
> > > .shutdown() callback would be called.
> > >
> > > FWIW, the reason why we call .shutdown() in bcmgenet is to turn off
> > > DMA and clocks, which matters for kexec (DMA) as well as power
> > > savings (S5 mode).
> >
> > I agree with the scalability. Concerning the multi-switch case I dont
> > know about the possible issues (I am quite new to working with DSA).
> > So lets wait for Vladimirs solution.
>
> I'm back for now and was able to spend a bit more time and understand
> what is happening.
>
> So first things first: why does DSA call dev_hold long-term on the
> master, and where from?
>
> Answer: it does so since commit 2f1e8ea726e9 ("net: dsa: link interfaces
> with the DSA master to get rid of lockdep warnings"), see this call path:
>
> dsa_slave_create
> -> netdev_upper_dev_link
> -> __netdev_upper_dev_link
> -> __netdev_adjacent_dev_insert
> -> dev_hold
>
> Ok, so since DSA holds a reference to the master interface, it is
> natural that unregister_netdevice() will not finish, and it will hang
> the system.
>
> Question 2: why does bcmgenet need to unregister the net device on
> shutdown?
>
> See Florian's answer, it doesn't, strictly speaking, it just needs to
> turn off the DMA and some clocks.
>
> Question 3: can we revert commit 2f1e8ea726e9?
>
> Answer: not so easily, we are looking at >10 commits to revert, and find
> other solutions to some problems. We have built in the meantime on top
> of the fact that there is an upper/lower relationship between DSA user
> ports and the DSA master.
>
> Question 4: how do other stacked interfaces deal with this?
>
> Answer: as I said in the commit message of 2f1e8ea726e9, DSA is not
> VLAN, DSA has unique challenges of its own, like a tree of struct
> devices to manage, with their own lifetime. So what other drivers do is
> not really relevant. Anyway, to entertain the question: VLAN watches the
> NETDEV_UNREGISTER event emitted on the netdev notifier chain for its
> real_dev, and effectively unregisters itself. Now this is exactly why it
> is irrelevant, we can watch for NETDEV_UNREGISTER on the DSA master, but
> then what? There is nothing sensible to do. Consider that in the master
> unbind case (not shutdown), both the NETDEV_UNREGISTER code path will
> execute, and the unbind of the DSA switch itself, due to that device
> link. But let's say we delete the device link and leave only the
> NETDEV_UNREGISTER code path to do something. What?
> device_release_driver(ds->dev), most probably. That would effectively
> force the DSA unbind path. But surprise: the DSA unbind path takes the
> rtnl_mutex from quite a couple of places, and we are already under the
> rtnl_lock (held by the netdev notifier chain). So, unless we schedule
> the DSA device driver detach, there is an impending deadlock.
> Ok, let's entertain even that: detach the DSA driver in a scheduled work
> item, with the rtnl_lock not held. First off, we will trigger again the
> WARN_ON solved by commit 2f1e8ea726e9 (because the unregistering of the
> DSA master has "completed", but it still has an upper interface - us),
> and secondly, the unregister_netdev function will have already deleted
> stuff belonging to the DSA master, namely its sysfs entries. But DSA
> also touches the master's sysfs, namely the "tagging" file. So NULL
> pointer dereference on the master's sysfs.
> So very simply put, DSA cannot unbind itself from the switch device when
> the master net device unregisters. The best case scenario would be for
> DSA to unbind _before_ the net device even unregisters. That was the
> whole point of my attempt with the device links, to ensure shutdown
> _ordering_.
>
> Question 5: can the device core actually be patched to call
> device_links_unbind_consumers() from device_shutdown()? This would
> actually simplify DSA's options, and make the device links live up to
> their documented expectations.
>
> Answer: yes and no, technically it can, but it is an invasive change
> which will certainly introduce regressions. See the answer to question 2
> for an example. Technically .shutdown exists so that drivers can do
> something lightweight to quiesce the hardware, without really caring too
> much about data structure integrity (hey, the kernel is going to die
> soon anyway). But some drivers, like bcmgenet, do the same thing in
> .resume and .shutdown, which blurs the lines quite a lot. If the device
> links were to start calling .remove at shutdown time, potentially after
> .shutdown was already called, bcmgenet would effectively unregister its
> net device twice. Yikes.
>
> Question 6: How about a patch on the device core that is more lightweight?
> Wouldn't it be sensible for device_shutdown() to just call ->remove if
> the device's bus has no ->shutdown, and the device's driver doesn't have
> a ->shutdown either?
>
> Answer: This would sometimes work, the vast majority of DSA switch
> drivers, and Ethernet controllers (in this case used as DSA masters) do
> not have a .shutdown method implemented. But their bus does: PCI does,
> SPI controllers do, most of the time. So it would work for limited
> scenarios, but would be ineffective in the general sense.
>
> Question 7: I said that .shutdown, as opposed to .remove, doesn't really
> care so much about the integrity of data structures. So how far should
> we really go to fix this issue? Should we even bother to unbind the
> whole DSA tree, when the sole problem is that we are the DSA master's
> upper, and that is keeping a reference on it?
>
> Answer: Well, any solution that does unnecessary data structure teardown
> only delays the reboot for nothing. Lino's patch just bluntly calls
> dsa_tree_teardown() from the switch .shutdown method, and this leaks
> memory, namely dst->ports. But does this really matter? Nope, so let's
> extrapolate. In this case, IMO, the simplest possible solution would be
> to patch bcmgenet to not unregister the net device. Then treat every
> other DSA master driver in the same way as they come, one by one.
> Do you need to unregister_netdevice() at shutdown? No. Then don't.
> Is it nice? Probably not, but I'm not seeing alternatives.
>
> Also, unless I'm missing something, Lino probably still sees the WARN_ON
> in bcmgenet's unregister_netdevice() about eth0 getting unregistered
> while having an upper interface. If not, it's by sheer luck that the DSA
> switch's ->shutdown gets called before bcmgenet's ->shutdown. But for
> this reason, it isn't a great solution either. If the device links can't
> guarantee us some sort of shutdown ordering (what we ideally want, as
> mentioned, is for the DSA switch driver to get _unbound_ (->remove)
> before the DSA master gets unbound or shut down).
I forgot about this, for completeness:
Question 8: Ok, so this is an even more lightweight variant of question 6.
To patch device_shutdown here:
if (dev->bus && dev->bus->shutdown) {
if (initcall_debug)
dev_info(dev, "shutdown\n");
dev->bus->shutdown(dev);
} else if (dev->driver && dev->driver->shutdown) {
if (initcall_debug)
dev_info(dev, "shutdown\n");
dev->driver->shutdown(dev);
+ } else {
+ __device_release_driver(dev, parent);
}
would go towards helping DSA in general, but it wouldn't help the situation at hand,
and it would introduce regressions.
So what about patching bcmgenet (and other drivers) to implement .shutdown in the following way:
device_release_driver(&pdev->dev);
basically this should force-unbind the driver from the device, which
would quite nicely make the device link go into action and make DSA
unbind too.
Answer: device_release_driver calls device_lock(dev), and device_shutdown
also holds that lock when it calls our ->shutdown method. So unless the
device core would be so nice so as to provide a generic shutdown method
that just unbinds the device using an unlocked version of device_release_driver,
we are back to square one with this solution. Anything that needs to patch
the device core is more or less disqualified, especially for a bug fix.
On Fri, Sep 10, 2021 at 02:23:58AM +0300, Vladimir Oltean wrote:
> On Fri, Sep 10, 2021 at 01:54:57AM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 09, 2021 at 07:07:33PM +0200, Lino Sanfilippo wrote:
> > > > It does not scale really well to have individual drivers call
> > > > dsa_tree_shutdown() in their respective .shutdown callback, and in a
> > > > multi-switch configuration, I am not sure what the results would
> > > > look like.
> > > >
> > > > In premise, each driver ought to be able to call
> > > > dsa_unregister_switch(), along with all of the driver specific
> > > > shutdown and eventually, given proper device ordering the DSA tree
> > > > would get automatically torn down, and then the DSA master's
> > > > .shutdown() callback would be called.
> > > >
> > > > FWIW, the reason why we call .shutdown() in bcmgenet is to turn off
> > > > DMA and clocks, which matters for kexec (DMA) as well as power
> > > > savings (S5 mode).
> > >
> > > I agree with the scalability. Concerning the multi-switch case I dont
> > > know about the possible issues (I am quite new to working with DSA).
> > > So lets wait for Vladimirs solution.
> >
> > I'm back for now and was able to spend a bit more time and understand
> > what is happening.
> >
> > So first things first: why does DSA call dev_hold long-term on the
> > master, and where from?
> >
> > Answer: it does so since commit 2f1e8ea726e9 ("net: dsa: link interfaces
> > with the DSA master to get rid of lockdep warnings"), see this call path:
> >
> > dsa_slave_create
> > -> netdev_upper_dev_link
> > -> __netdev_upper_dev_link
> > -> __netdev_adjacent_dev_insert
> > -> dev_hold
> >
> > Ok, so since DSA holds a reference to the master interface, it is
> > natural that unregister_netdevice() will not finish, and it will hang
> > the system.
> >
> > Question 2: why does bcmgenet need to unregister the net device on
> > shutdown?
> >
> > See Florian's answer, it doesn't, strictly speaking, it just needs to
> > turn off the DMA and some clocks.
> >
> > Question 3: can we revert commit 2f1e8ea726e9?
> >
> > Answer: not so easily, we are looking at >10 commits to revert, and find
> > other solutions to some problems. We have built in the meantime on top
> > of the fact that there is an upper/lower relationship between DSA user
> > ports and the DSA master.
> >
> > Question 4: how do other stacked interfaces deal with this?
> >
> > Answer: as I said in the commit message of 2f1e8ea726e9, DSA is not
> > VLAN, DSA has unique challenges of its own, like a tree of struct
> > devices to manage, with their own lifetime. So what other drivers do is
> > not really relevant. Anyway, to entertain the question: VLAN watches the
> > NETDEV_UNREGISTER event emitted on the netdev notifier chain for its
> > real_dev, and effectively unregisters itself. Now this is exactly why it
> > is irrelevant, we can watch for NETDEV_UNREGISTER on the DSA master, but
> > then what? There is nothing sensible to do. Consider that in the master
> > unbind case (not shutdown), both the NETDEV_UNREGISTER code path will
> > execute, and the unbind of the DSA switch itself, due to that device
> > link. But let's say we delete the device link and leave only the
> > NETDEV_UNREGISTER code path to do something. What?
> > device_release_driver(ds->dev), most probably. That would effectively
> > force the DSA unbind path. But surprise: the DSA unbind path takes the
> > rtnl_mutex from quite a couple of places, and we are already under the
> > rtnl_lock (held by the netdev notifier chain). So, unless we schedule
> > the DSA device driver detach, there is an impending deadlock.
> > Ok, let's entertain even that: detach the DSA driver in a scheduled work
> > item, with the rtnl_lock not held. First off, we will trigger again the
> > WARN_ON solved by commit 2f1e8ea726e9 (because the unregistering of the
> > DSA master has "completed", but it still has an upper interface - us),
> > and secondly, the unregister_netdev function will have already deleted
> > stuff belonging to the DSA master, namely its sysfs entries. But DSA
> > also touches the master's sysfs, namely the "tagging" file. So NULL
> > pointer dereference on the master's sysfs.
> > So very simply put, DSA cannot unbind itself from the switch device when
> > the master net device unregisters. The best case scenario would be for
> > DSA to unbind _before_ the net device even unregisters. That was the
> > whole point of my attempt with the device links, to ensure shutdown
> > _ordering_.
> >
> > Question 5: can the device core actually be patched to call
> > device_links_unbind_consumers() from device_shutdown()? This would
> > actually simplify DSA's options, and make the device links live up to
> > their documented expectations.
> >
> > Answer: yes and no, technically it can, but it is an invasive change
> > which will certainly introduce regressions. See the answer to question 2
> > for an example. Technically .shutdown exists so that drivers can do
> > something lightweight to quiesce the hardware, without really caring too
> > much about data structure integrity (hey, the kernel is going to die
> > soon anyway). But some drivers, like bcmgenet, do the same thing in
> > .resume and .shutdown, which blurs the lines quite a lot. If the device
> > links were to start calling .remove at shutdown time, potentially after
> > .shutdown was already called, bcmgenet would effectively unregister its
> > net device twice. Yikes.
> >
> > Question 6: How about a patch on the device core that is more lightweight?
> > Wouldn't it be sensible for device_shutdown() to just call ->remove if
> > the device's bus has no ->shutdown, and the device's driver doesn't have
> > a ->shutdown either?
> >
> > Answer: This would sometimes work, the vast majority of DSA switch
> > drivers, and Ethernet controllers (in this case used as DSA masters) do
> > not have a .shutdown method implemented. But their bus does: PCI does,
> > SPI controllers do, most of the time. So it would work for limited
> > scenarios, but would be ineffective in the general sense.
> >
> > Question 7: I said that .shutdown, as opposed to .remove, doesn't really
> > care so much about the integrity of data structures. So how far should
> > we really go to fix this issue? Should we even bother to unbind the
> > whole DSA tree, when the sole problem is that we are the DSA master's
> > upper, and that is keeping a reference on it?
> >
> > Answer: Well, any solution that does unnecessary data structure teardown
> > only delays the reboot for nothing. Lino's patch just bluntly calls
> > dsa_tree_teardown() from the switch .shutdown method, and this leaks
> > memory, namely dst->ports. But does this really matter? Nope, so let's
> > extrapolate. In this case, IMO, the simplest possible solution would be
> > to patch bcmgenet to not unregister the net device. Then treat every
> > other DSA master driver in the same way as they come, one by one.
> > Do you need to unregister_netdevice() at shutdown? No. Then don't.
> > Is it nice? Probably not, but I'm not seeing alternatives.
> >
> > Also, unless I'm missing something, Lino probably still sees the WARN_ON
> > in bcmgenet's unregister_netdevice() about eth0 getting unregistered
> > while having an upper interface. If not, it's by sheer luck that the DSA
> > switch's ->shutdown gets called before bcmgenet's ->shutdown. But for
> > this reason, it isn't a great solution either. If the device links can't
> > guarantee us some sort of shutdown ordering (what we ideally want, as
> > mentioned, is for the DSA switch driver to get _unbound_ (->remove)
> > before the DSA master gets unbound or shut down).
>
> I forgot about this, for completeness:
>
> Question 8: Ok, so this is an even more lightweight variant of question 6.
> To patch device_shutdown here:
>
> if (dev->bus && dev->bus->shutdown) {
> if (initcall_debug)
> dev_info(dev, "shutdown\n");
> dev->bus->shutdown(dev);
> } else if (dev->driver && dev->driver->shutdown) {
> if (initcall_debug)
> dev_info(dev, "shutdown\n");
> dev->driver->shutdown(dev);
> + } else {
> + __device_release_driver(dev, parent);
> }
>
> would go towards helping DSA in general, but it wouldn't help the situation at hand,
> and it would introduce regressions.
>
> So what about patching bcmgenet (and other drivers) to implement .shutdown in the following way:
>
> device_release_driver(&pdev->dev);
>
> basically this should force-unbind the driver from the device, which
> would quite nicely make the device link go into action and make DSA
> unbind too.
>
> Answer: device_release_driver calls device_lock(dev), and device_shutdown
> also holds that lock when it calls our ->shutdown method. So unless the
> device core would be so nice so as to provide a generic shutdown method
> that just unbinds the device using an unlocked version of device_release_driver,
> we are back to square one with this solution. Anything that needs to patch
> the device core is more or less disqualified, especially for a bug fix.
Question 9: can Lino's patch set be generalized to all DSA switches,
i.e. can all DSA drivers redirect their ->shutdown method to their
->remove method?
Answer: Apart from the fact that mdio_driver_register() does not provide
for a ->shutdown() method, which can be trivially addressed, we still
have issues. The fundamental problem is that some bus device drivers
implement ->shutdown as ->remove for their own device, see dspi_shutdown()
and dspi_remove() for an example.
When that happens, the DSA switch device attached to that bus will be
(a) once unbound from its driver (by dspi_shutdown -> dspi_remove ->
spi_unregister_controller -> device_for_each_child(...unregister), and
(b) once shut down (by its own ->shutdown method).
With the "ds" structure being naturally destroyed by the first call to
the ->remove function, a second call to ->remove is impossible to
succeed without tripping some NULL pointer, from the device's ->shutdown path.
Simply said, DSA is not designed to support an unbalanced number of
calls to dsa_register_switch and dsa_unregister_switch.
In fact, if Lino's ksz9897 switch was attached to the dspi driver for a
SPI controller, even his own patches would not work and result in this
unbalance of calls that I mentioned earlier. If we're going to fix it,
let's think of something that covers all cases at least.
Simply put, it looks like we need some guidance (again) from driver core
maintainers as to what should a ->suspend method _not_ do. Specifically,
if it is okay to redirect ->suspend to ->remove. It looks like in the
case of buses doing that, and child devices doing that too, the results
are not quite sane. And maybe that would give us some clue as to what to
do about the genet driver which does the same thing.
On Thu, Sep 9, 2021 at 9:00 AM Florian Fainelli <[email protected]> wrote:
>
> +Saravana,
>
> On 9/9/2021 8:47 AM, Vladimir Oltean wrote:
> > On Thu, Sep 09, 2021 at 03:19:52PM +0200, Lino Sanfilippo wrote:
> >>> Do you see similar things on your 5.10 kernel?
> >>
> >> For the master device is see
> >>
> >> lrwxrwxrwx 1 root root 0 Sep 9 14:10 /sys/class/net/eth0/device/consumer:spi:spi3.0 -> ../../../virtual/devlink/platform:fd580000.ethernet--spi:spi3.0
> >
> > So this is the worst of the worst, we have a device link but it doesn't help.
> >
> > Where the device link helps is here:
> >
> > __device_release_driver
> > while (device_links_busy(dev))
> > device_links_unbind_consumers(dev);
> >
> > but during dev_shutdown, device_links_unbind_consumers does not get called
> > (actually I am not even sure whether it should).
> >
> > I've reproduced your issue by making this very simple change:
> >
> > diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> > index 60d94e0a07d6..ec00f34cac47 100644
> > --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> > +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c
> > @@ -1372,6 +1372,7 @@ static struct pci_driver enetc_pf_driver = {
> > .id_table = enetc_pf_id_table,
> > .probe = enetc_pf_probe,
> > .remove = enetc_pf_remove,
> > + .shutdown = enetc_pf_remove,
> > #ifdef CONFIG_PCI_IOV
> > .sriov_configure = enetc_sriov_configure,
> > #endif
> >
> > on my DSA master driver. This is what the genet driver has "special".
> >
> > I was led into grave error by Documentation/driver-api/device_link.rst,
> > which I've based my patch on, where it clearly says that device links
> > are supposed to help with shutdown ordering (how?!).
>
> I was also under the impression that device links were supposed to help
> with shutdown ordering, because it does matter a lot. One thing that I
> had to work before (and seems like it came back recently) is the
> shutdown ordering between gpio_keys.c and the GPIO controller. If you
> suspend the GPIO controller first, gpio_keys.c never gets a chance to
> keep the GPIO pin configured for a wake-up interrupt, therefore no
> wake-up event happens on key presses, whoops.
This is more of a Rafael question. Adding him. I haven't looked too
closely at device links and shutdown.
-Saravana
>
> >
> > So the question is, why did my DSA trees get torn down on shutdown?
> > Basically the short answer is that my SPI controller driver does
> > implement .shutdown, and calls the same code path as the .remove code,
> > which calls spi_unregister_controller which removes all SPI children..
> >
> > When I added this device link, one of the main objectives was to not
> > modify all DSA drivers. I was certain based on the documentation that
> > device links would help, now I'm not so sure anymore.
> >
> > So what happens is that the DSA master attempts to unregister its net
> > device on .shutdown, but DSA does not implement .shutdown, so it just
> > sits there holding a reference (supposedly via dev_hold, but where from?!)
> > to the master, which makes netdev_wait_allrefs to wait and wait.
>
> It's not coming from of_find_net_device_by_node() that's for sure and
> with OF we don't go through the code path calling
> dsa_dev_to_net_device() which does call dev_hold() and then shortly
> thereafter the caller calls dev_put() anyway.
>
> >
> > I need more time for the denial phase to pass, and to understand what
> > can actually be done. I will also be away from the keyboard for the next
> > few days, so it might take a while. Your patches obviously offer a
> > solution only for KSZ switches, we need something more general. If I
> > understand your solution, it works not by virtue of there being any
> > shutdown ordering guarantee at all, but simply due to the fact that
> > DSA's .shutdown hook gets called eventually, and the reference to the
> > master gets freed eventually, which unblocks the unregister_netdevice
> > call from the master. I don't yet understand why DSA holds a long-term
> > reference to the master, that's one thing I need to figure out.
> >
>
> Agreed.
> --
> Florian
On 9/9/2021 3:54 PM, Vladimir Oltean wrote:
[snip]
> Question 6: How about a patch on the device core that is more lightweight?
> Wouldn't it be sensible for device_shutdown() to just call ->remove if
> the device's bus has no ->shutdown, and the device's driver doesn't have
> a ->shutdown either?
>
> Answer: This would sometimes work, the vast majority of DSA switch
> drivers, and Ethernet controllers (in this case used as DSA masters) do
> not have a .shutdown method implemented. But their bus does: PCI does,
> SPI controllers do, most of the time. So it would work for limited
> scenarios, but would be ineffective in the general sense.
Having wondered about that question as well, I don't really see a
compelling reason as to why we do not default to calling .remove() when
.shutdown() is not implemented. In almost all of the cases the semantics
of .remove() are superior to those required by .shutdown().
>
> Question 7: I said that .shutdown, as opposed to .remove, doesn't really
> care so much about the integrity of data structures. So how far should
> we really go to fix this issue? Should we even bother to unbind the
> whole DSA tree, when the sole problem is that we are the DSA master's
> upper, and that is keeping a reference on it?
>
> Answer: Well, any solution that does unnecessary data structure teardown
> only delays the reboot for nothing. Lino's patch just bluntly calls
> dsa_tree_teardown() from the switch .shutdown method, and this leaks
> memory, namely dst->ports. But does this really matter? Nope, so let's
> extrapolate. In this case, IMO, the simplest possible solution would be
> to patch bcmgenet to not unregister the net device. Then treat every
> other DSA master driver in the same way as they come, one by one.
> Do you need to unregister_netdevice() at shutdown? No. Then don't.
> Is it nice? Probably not, but I'm not seeing alternatives.
It does not really scale but we also don't have that many DSA masters to
support, I believe I can name them all: bcmgenet, stmmac, bcmsysport,
enetc, mv643xx_eth, cpsw, macb. If you want me to patch bcmgenet, give
me a few days to test and make sure there is no power management
regression, that's the primary concern I have.
>
> Also, unless I'm missing something, Lino probably still sees the WARN_ON
> in bcmgenet's unregister_netdevice() about eth0 getting unregistered
> while having an upper interface. If not, it's by sheer luck that the DSA
> switch's ->shutdown gets called before bcmgenet's ->shutdown. But for
> this reason, it isn't a great solution either. If the device links can't
> guarantee us some sort of shutdown ordering (what we ideally want, as
> mentioned, is for the DSA switch driver to get _unbound_ (->remove)
> before the DSA master gets unbound or shut down).
>
All of your questions are good and I don't have answers to any of them,
however I would like you and others to reason about .shutdown() not just
in the context of a reboot, or kexec'd kernel but also in the context of
putting the system into ACPI S5 (via poweroff). In that case the goal is
not only to quiesce the device, the goal is also to put it in a low
power mode.
For bcmgenet specifically the code path that leads to a driver remove is
well tested and is guaranteeing the network device registration, thus
putting the PHY into suspend, shutting down DMAs, turning off clocks.
This is a big hammer, but it gets the job done and does not introduce
yet another code path to test, it's the same as the module removal.
--
Florian
> It does not really scale but we also don't have that many DSA masters to
> support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
> mv643xx_eth, cpsw, macb.
fec, mvneta, mvpp2, i210/igb.
Andrew
On Fri, Sep 10, 2021 at 01:51:56PM +0200, Andrew Lunn wrote:
> > It does not really scale but we also don't have that many DSA masters to
> > support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
> > mv643xx_eth, cpsw, macb.
>
> fec, mvneta, mvpp2, i210/igb.
I can probably double that list only with Freescale/NXP Ethernet
drivers, some of which are not even submitted to mainline. To name some
mainline drivers: gianfar, dpaa-eth, dpaa2-eth, dpaa2-switch, ucc_geth.
Also consider that DSA/switchdev drivers can also be DSA masters of
their own, we have boards doing that too.
Anyway, I've decided to at least try and accept the fact that DSA
masters will unregister their net_device on shutdown, and attempt to do
something sane for all DSA switches in that case.
Attached are two patches (they are fairly big so I won't paste them
inline, and I would like initial feedback before posting them to the
list).
As mentioned in those patches, the shutdown ordering guarantee is still
very important, I still have no clue what goes on there, what we need to
do, etc.
On Fri, Sep 10, 2021 at 05:58:52PM +0300, Vladimir Oltean wrote:
> On Fri, Sep 10, 2021 at 01:51:56PM +0200, Andrew Lunn wrote:
> > > It does not really scale but we also don't have that many DSA masters to
> > > support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
> > > mv643xx_eth, cpsw, macb.
> >
> > fec, mvneta, mvpp2, i210/igb.
>
> I can probably double that list only with Freescale/NXP Ethernet
> drivers, some of which are not even submitted to mainline. To name some
> mainline drivers: gianfar, dpaa-eth, dpaa2-eth, dpaa2-switch, ucc_geth.
> Also consider that DSA/switchdev drivers can also be DSA masters of
> their own, we have boards doing that too.
>
> Anyway, I've decided to at least try and accept the fact that DSA
> masters will unregister their net_device on shutdown, and attempt to do
> something sane for all DSA switches in that case.
>
> Attached are two patches (they are fairly big so I won't paste them
> inline, and I would like initial feedback before posting them to the
> list).
>
> As mentioned in those patches, the shutdown ordering guarantee is still
> very important, I still have no clue what goes on there, what we need to
> do, etc.
So to answer my own question, there is a comment above device_link_add:
* A side effect of the link creation is re-ordering of dpm_list and the
* devices_kset list by moving the consumer device and all devices depending
* on it to the ends of these lists (that does not happen to devices that have
* not been registered when this function is called).
so the fact that DSA uses device_link_add towards its master is not
exactly for nothing. device_shutdown() walks devices_kset from the back,
so this is our guarantee that DSA's shutdown happens before the master's
shutdown.
So these patches should be okay. Any other comments? If not, I will
formally submit them tomorrow towards the net tree.
Hi,
On 10.09.21 at 16:58, Vladimir Oltean wrote:
> On Fri, Sep 10, 2021 at 01:51:56PM +0200, Andrew Lunn wrote:
>>> It does not really scale but we also don't have that many DSA masters to
>>> support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
>>> mv643xx_eth, cpsw, macb.
>>
>> fec, mvneta, mvpp2, i210/igb.
>
> I can probably double that list only with Freescale/NXP Ethernet
> drivers, some of which are not even submitted to mainline. To name some
> mainline drivers: gianfar, dpaa-eth, dpaa2-eth, dpaa2-switch, ucc_geth.
> Also consider that DSA/switchdev drivers can also be DSA masters of
> their own, we have boards doing that too.
>
> Anyway, I've decided to at least try and accept the fact that DSA
> masters will unregister their net_device on shutdown, and attempt to do
> something sane for all DSA switches in that case.
>
> Attached are two patches (they are fairly big so I won't paste them
> inline, and I would like initial feedback before posting them to the
> list).
>
> As mentioned in those patches, the shutdown ordering guarantee is still
> very important, I still have no clue what goes on there, what we need to
> do, etc.
>
I tested these patches with my 5.10 kernel (based on Gregs 5.10.27 stable
kernel) and while I do not see the message "unregister_netdevice: waiting
for eth0 to become free. Usage count = 2." any more the shutdown/reboot hangs, too.
After a few attempts without any error messages on the console I was able to get a
stack trace. Something still seems to go wrong in bcm2835_spi_shutdown() (see attachment).
I have not had the time yet to investigate this further (or to test the patches
with a newer kernel).
Regards,
Lino
On Sun, Sep 12, 2021 at 10:19:24PM +0200, Lino Sanfilippo wrote:
>
> Hi,
>
> On 10.09.21 at 16:58, Vladimir Oltean wrote:
> > On Fri, Sep 10, 2021 at 01:51:56PM +0200, Andrew Lunn wrote:
> >>> It does not really scale but we also don't have that many DSA masters to
> >>> support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
> >>> mv643xx_eth, cpsw, macb.
> >>
> >> fec, mvneta, mvpp2, i210/igb.
> >
> > I can probably double that list only with Freescale/NXP Ethernet
> > drivers, some of which are not even submitted to mainline. To name some
> > mainline drivers: gianfar, dpaa-eth, dpaa2-eth, dpaa2-switch, ucc_geth.
> > Also consider that DSA/switchdev drivers can also be DSA masters of
> > their own, we have boards doing that too.
> >
> > Anyway, I've decided to at least try and accept the fact that DSA
> > masters will unregister their net_device on shutdown, and attempt to do
> > something sane for all DSA switches in that case.
> >
> > Attached are two patches (they are fairly big so I won't paste them
> > inline, and I would like initial feedback before posting them to the
> > list).
> >
> > As mentioned in those patches, the shutdown ordering guarantee is still
> > very important, I still have no clue what goes on there, what we need to
> > do, etc.
> >
>
> I tested these patches with my 5.10 kernel (based on Gregs 5.10.27 stable
> kernel) and while I do not see the message "unregister_netdevice: waiting
> for eth0 to become free. Usage count = 2." any more the shutdown/reboot hangs, too.
> After a few attempts without any error messages on the console I was able to get a
> stack trace. Something still seems to go wrong in bcm2835_spi_shutdown() (see attachment).
> I have not had the time yet to investigate this further (or to test the patches
> with a newer kernel).
Could you post the full kernel output? The picture you've posted is
truncated and only shows a WARN_ON in rpi_firmware_transaction and is
probably a symptom and not the issue (which is above and not shown).
Hi,
> Gesendet: Sonntag, 12. September 2021 um 22:29 Uhr
> Von: "Vladimir Oltean" <[email protected]>
> An: "Lino Sanfilippo" <[email protected]>
> Cc: "Andrew Lunn" <[email protected]>, "Florian Fainelli" <[email protected]>, "Saravana Kannan" <[email protected]>, "Rafael J. Wysocki" <[email protected]>, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
> Betreff: Re: [PATCH 0/3] Fix for KSZ DSA switch shutdown
>
> On Sun, Sep 12, 2021 at 10:19:24PM +0200, Lino Sanfilippo wrote:
> >
> > Hi,
> >
> > On 10.09.21 at 16:58, Vladimir Oltean wrote:
> > > On Fri, Sep 10, 2021 at 01:51:56PM +0200, Andrew Lunn wrote:
> > >>> It does not really scale but we also don't have that many DSA masters to
> > >>> support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
> > >>> mv643xx_eth, cpsw, macb.
> > >>
> > >> fec, mvneta, mvpp2, i210/igb.
> > >
> > > I can probably double that list only with Freescale/NXP Ethernet
> > > drivers, some of which are not even submitted to mainline. To name some
> > > mainline drivers: gianfar, dpaa-eth, dpaa2-eth, dpaa2-switch, ucc_geth.
> > > Also consider that DSA/switchdev drivers can also be DSA masters of
> > > their own, we have boards doing that too.
> > >
> > > Anyway, I've decided to at least try and accept the fact that DSA
> > > masters will unregister their net_device on shutdown, and attempt to do
> > > something sane for all DSA switches in that case.
> > >
> > > Attached are two patches (they are fairly big so I won't paste them
> > > inline, and I would like initial feedback before posting them to the
> > > list).
> > >
> > > As mentioned in those patches, the shutdown ordering guarantee is still
> > > very important, I still have no clue what goes on there, what we need to
> > > do, etc.
> > >
> >
> > I tested these patches with my 5.10 kernel (based on Gregs 5.10.27 stable
> > kernel) and while I do not see the message "unregister_netdevice: waiting
> > for eth0 to become free. Usage count = 2." any more the shutdown/reboot hangs, too.
> > After a few attempts without any error messages on the console I was able to get a
> > stack trace. Something still seems to go wrong in bcm2835_spi_shutdown() (see attachment).
> > I have not had the time yet to investigate this further (or to test the patches
> > with a newer kernel).
>
> Could you post the full kernel output? The picture you've posted is
> truncated and only shows a WARN_ON in rpi_firmware_transaction and is
> probably a symptom and not the issue (which is above and not shown).
>
Unfortunately I dont see anything in the kernel log. The console output is all I get,
thats why I made the photo.
Regards,
Lino