This patch series fixes a system hang I got each time i tried to shutdown
or reboot a system that uses a KSZ9897 as a DSA switch with a broadcom
GENET network device as the DSA master device. At the time the system hangs
the message "unregister_netdevice: waiting for eth0 to become free. Usage
count = 2." is dumped periodically to the console.
After some investigation I found the reason to be unreleased references to
the master device which are still held by the slave devices at the time the
system is shut down (I have two slave devices in use).
While these references are supposed to be released in ksz_switch_remove()
this function never gets the chance to be called due to the system hang at
the master device deregistration which happens before ksz_switch_remove()
is called.
The fix is to make sure that the master device references are already
released when the device is unregistered. For this reason PATCH1 provides
a new function dsa_tree_shutdown() that can be called by DSA drivers to
untear the DSA switch at shutdown. PATCH2 uses this function in a new
helper function for KSZ switches to properly shutdown the KSZ switch.
PATCH 3 uses the new helper function in the KSZ9477 shutdown handler.
Theses patches have been tested on a Raspberry PI 5.10 kernel with a
KSZ9897. The patches have been adjusted to apply against net-next and are
compile tested with next-next.
Lino Sanfilippo (3):
net: dsa: introduce function dsa_tree_shutdown()
net: dsa: microchip: provide the function ksz_switch_shutdown()
net: dsa: microchip: tear down DSA tree at system shutdown
drivers/net/dsa/microchip/ksz9477.c | 12 +++++++++++-
drivers/net/dsa/microchip/ksz_common.c | 13 +++++++++++++
drivers/net/dsa/microchip/ksz_common.h | 1 +
include/net/dsa.h | 1 +
net/dsa/dsa2.c | 8 ++++++++
5 files changed, 34 insertions(+), 1 deletion(-)
base-commit: 626bf91a292e2035af5b9d9cce35c5c138dfe06d
--
2.33.0
On Thu, Sep 09, 2021 at 11:53:21AM +0200, Lino Sanfilippo wrote:
> This patch series fixes a system hang I got each time i tried to shutdown
> or reboot a system that uses a KSZ9897 as a DSA switch with a broadcom
> GENET network device as the DSA master device. At the time the system hangs
> the message "unregister_netdevice: waiting for eth0 to become free. Usage
> count = 2." is dumped periodically to the console.
>
> After some investigation I found the reason to be unreleased references to
> the master device which are still held by the slave devices at the time the
> system is shut down (I have two slave devices in use).
>
> While these references are supposed to be released in ksz_switch_remove()
> this function never gets the chance to be called due to the system hang at
> the master device deregistration which happens before ksz_switch_remove()
> is called.
>
> The fix is to make sure that the master device references are already
> released when the device is unregistered. For this reason PATCH1 provides
> a new function dsa_tree_shutdown() that can be called by DSA drivers to
> untear the DSA switch at shutdown. PATCH2 uses this function in a new
> helper function for KSZ switches to properly shutdown the KSZ switch.
> PATCH 3 uses the new helper function in the KSZ9477 shutdown handler.
>
> Theses patches have been tested on a Raspberry PI 5.10 kernel with a
> KSZ9897. The patches have been adjusted to apply against net-next and are
> compile tested with next-next.
Can you try this patch
commit 07b90056cb15ff9877dca0d8f1b6583d1051f724
Author: Vladimir Oltean <[email protected]>
Date: Tue Jan 12 01:09:43 2021 +0200
net: dsa: unbind all switches from tree when DSA master unbinds
Currently the following happens when a DSA master driver unbinds while
there are DSA switches attached to it:
Hi,
On 09.09.21 at 12:14, Vladimir Oltean wrote:
>
> Can you try this patch
>
> commit 07b90056cb15ff9877dca0d8f1b6583d1051f724
> Author: Vladimir Oltean <[email protected]>
> Date: Tue Jan 12 01:09:43 2021 +0200
>
> net: dsa: unbind all switches from tree when DSA master unbinds
>
> Currently the following happens when a DSA master driver unbinds while
> there are DSA switches attached to it:
>
This patch is already part of the kernel which shows the described shutdown issues.
Regards,
Lino
On Thu, Sep 09, 2021 at 01:08:26PM +0200, Lino Sanfilippo wrote:
> Hi,
>
> On 09.09.21 at 12:14, Vladimir Oltean wrote:
> >
> > Can you try this patch
> >
> > commit 07b90056cb15ff9877dca0d8f1b6583d1051f724
> > Author: Vladimir Oltean <[email protected]>
> > Date: Tue Jan 12 01:09:43 2021 +0200
> >
> > net: dsa: unbind all switches from tree when DSA master unbinds
> >
> > Currently the following happens when a DSA master driver unbinds while
> > there are DSA switches attached to it:
> >
>
> This patch is already part of the kernel which shows the described shutdown issues.
How can I reproduce this issue?
When I test with sysrq-o, I do see the various DSA trees getting torn
down:
[ 16.731468] sysrq: HELP : loglevel(0-9) reboot(b) crash(c) show-all-locks(d) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) show-blocked-tasks(w) dump-ftrace-buffer(z)
[ 29.912535] sysrq: Power Off
[ 29.917806] kvm: exiting hardware virtualization
[ 29.988036] device swp0 left promiscuous mode
[ 30.370424] sja1105 spi2.0: Link is Down
[ 30.402790] DSA: tree 1 torn down
[ 30.495096] device swp2 left promiscuous mode
[ 31.011576] sja1105 spi2.1: Link is Down
[ 31.032925] DSA: tree 2 torn down
[ 31.074226] reboot: Power down
I feel that something is missing in your system. Is the device link
created? Is it deleted before going into effect on shutdown?
On Thu, Sep 09, 2021 at 11:53:21AM +0200, Lino Sanfilippo wrote:
> This patch series fixes a system hang I got each time i tried to shutdown
> or reboot a system that uses a KSZ9897 as a DSA switch with a broadcom
> GENET network device as the DSA master device. At the time the system hangs
> the message "unregister_netdevice: waiting for eth0 to become free. Usage
> count = 2." is dumped periodically to the console.
>
> After some investigation I found the reason to be unreleased references to
> the master device which are still held by the slave devices at the time the
> system is shut down (I have two slave devices in use).
>
> While these references are supposed to be released in ksz_switch_remove()
> this function never gets the chance to be called due to the system hang at
> the master device deregistration which happens before ksz_switch_remove()
> is called.
>
> The fix is to make sure that the master device references are already
> released when the device is unregistered. For this reason PATCH1 provides
> a new function dsa_tree_shutdown() that can be called by DSA drivers to
> untear the DSA switch at shutdown. PATCH2 uses this function in a new
> helper function for KSZ switches to properly shutdown the KSZ switch.
> PATCH 3 uses the new helper function in the KSZ9477 shutdown handler.
I agree with Vladimir here. Shutdown works without issue on mv88e6xxx,
i do it frequently. I'm sure other developers shutdown there devices
at the end of the edit/compile/test cycle. If there was a generic
problem, we would probably know about it. So it seems like there is
something specific to your system which breaks the reference
counting. We need to understand that first, then we can see how we fix
it.
>
> Theses patches have been tested on a Raspberry PI 5.10 kernel with a
> KSZ9897. The patches have been adjusted to apply against net-next and are
> compile tested with next-next.
Is the switch on a hat? Are you using DT overlays?
Andrew
On Thu, Sep 09, 2021 at 02:42:48PM +0300, Vladimir Oltean wrote:
> I feel that something is missing in your system. Is the device link
> created? Is it deleted before going into effect on shutdown?
So in case my questions were confusing, you can check the presence of
the device links via sysfs.
On my board, eno2 is the top-level DSA master, there is a switch which
is PCIe PF 0000:00:00.5 which is its consumer:
ls -la /sys/class/net/eno2/device/consumer\:pci\:0000\:00\:00.5
lrwxrwxrwx 1 root root 0 Jan 1 00:00 /sys/class/net/eno2/device/consumer:pci:0000:00:00.5 -> ../../../../../virtual/devlink/pci:0000:00:00.2--pci:0000:00:00.5
In turn, that switch is a DSA master on two ports for SPI-attached switches:
ls -la /sys/class/net/swp0/device/consumer\:spi\:spi2.*
lrwxrwxrwx 1 root root 0 Jan 1 00:04 /sys/class/net/swp0/device/consumer:spi:spi2.0 -> ../../../../../virtual/devlink/pci:0000:00:00.5--spi:spi2.0
lrwxrwxrwx 1 root root 0 Jan 1 00:04 /sys/class/net/swp0/device/consumer:spi:spi2.1 -> ../../../../../virtual/devlink/pci:0000:00:00.5--spi:spi2.1
Do you see similar things on your 5.10 kernel?
Please note that I don't think that particular patch with device links
was backported to v5.10, at least I don't see it when I run:
git tag --contains 07b90056cb15f
So how did it reach your tree?
On Mon, Sep 13, 2021 at 12:32:14PM +0200, Lino Sanfilippo wrote:
> Hi,
>
> > Gesendet: Sonntag, 12. September 2021 um 22:29 Uhr
> > Von: "Vladimir Oltean" <[email protected]>
> > An: "Lino Sanfilippo" <[email protected]>
> > Cc: "Andrew Lunn" <[email protected]>, "Florian Fainelli" <[email protected]>, "Saravana Kannan" <[email protected]>, "Rafael J. Wysocki" <[email protected]>, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
> > Betreff: Re: [PATCH 0/3] Fix for KSZ DSA switch shutdown
> >
> > On Sun, Sep 12, 2021 at 10:19:24PM +0200, Lino Sanfilippo wrote:
> > >
> > > Hi,
> > >
> > > On 10.09.21 at 16:58, Vladimir Oltean wrote:
> > > > On Fri, Sep 10, 2021 at 01:51:56PM +0200, Andrew Lunn wrote:
> > > >>> It does not really scale but we also don't have that many DSA masters to
> > > >>> support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
> > > >>> mv643xx_eth, cpsw, macb.
> > > >>
> > > >> fec, mvneta, mvpp2, i210/igb.
> > > >
> > > > I can probably double that list only with Freescale/NXP Ethernet
> > > > drivers, some of which are not even submitted to mainline. To name some
> > > > mainline drivers: gianfar, dpaa-eth, dpaa2-eth, dpaa2-switch, ucc_geth.
> > > > Also consider that DSA/switchdev drivers can also be DSA masters of
> > > > their own, we have boards doing that too.
> > > >
> > > > Anyway, I've decided to at least try and accept the fact that DSA
> > > > masters will unregister their net_device on shutdown, and attempt to do
> > > > something sane for all DSA switches in that case.
> > > >
> > > > Attached are two patches (they are fairly big so I won't paste them
> > > > inline, and I would like initial feedback before posting them to the
> > > > list).
> > > >
> > > > As mentioned in those patches, the shutdown ordering guarantee is still
> > > > very important, I still have no clue what goes on there, what we need to
> > > > do, etc.
> > > >
> > >
> > > I tested these patches with my 5.10 kernel (based on Gregs 5.10.27 stable
> > > kernel) and while I do not see the message "unregister_netdevice: waiting
> > > for eth0 to become free. Usage count = 2." any more the shutdown/reboot hangs, too.
> > > After a few attempts without any error messages on the console I was able to get a
> > > stack trace. Something still seems to go wrong in bcm2835_spi_shutdown() (see attachment).
> > > I have not had the time yet to investigate this further (or to test the patches
> > > with a newer kernel).
> >
> > Could you post the full kernel output? The picture you've posted is
> > truncated and only shows a WARN_ON in rpi_firmware_transaction and is
> > probably a symptom and not the issue (which is above and not shown).
> >
>
> Unfortunately I dont see anything in the kernel log. The console output is all I get,
> thats why I made the photo.
To clarify, are you saying nothing above this line gets printed? Because
the part of the log you've posted in the picture is pretty much
unworkable:
[ 99.375389] [<bf0dc56c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863ca0>] (platform_drv_shutdown+0x2c/0x30)
How do you access the device's serial console? Use a program with a
scrollback buffer like GNU screen or something.
> Gesendet: Montag, 13. September 2021 um 12:44 Uhr
> Von: "Vladimir Oltean" <[email protected]>
> An: "Lino Sanfilippo" <[email protected]>
> Cc: "Andrew Lunn" <[email protected]>, "Florian Fainelli" <[email protected]>, "Saravana Kannan" <[email protected]>, "Rafael J. Wysocki" <[email protected]>, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
> Betreff: Re: [PATCH 0/3] Fix for KSZ DSA switch shutdown
>
> On Mon, Sep 13, 2021 at 12:32:14PM +0200, Lino Sanfilippo wrote:
> > Hi,
> >
> > > Gesendet: Sonntag, 12. September 2021 um 22:29 Uhr
> > > Von: "Vladimir Oltean" <[email protected]>
> > > An: "Lino Sanfilippo" <[email protected]>
> > > Cc: "Andrew Lunn" <[email protected]>, "Florian Fainelli" <[email protected]>, "Saravana Kannan" <[email protected]>, "Rafael J. Wysocki" <[email protected]>, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
> > > Betreff: Re: [PATCH 0/3] Fix for KSZ DSA switch shutdown
> > >
> > > On Sun, Sep 12, 2021 at 10:19:24PM +0200, Lino Sanfilippo wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 10.09.21 at 16:58, Vladimir Oltean wrote:
> > > > > On Fri, Sep 10, 2021 at 01:51:56PM +0200, Andrew Lunn wrote:
> > > > >>> It does not really scale but we also don't have that many DSA masters to
> > > > >>> support, I believe I can name them all: bcmgenet, stmmac, bcmsysport, enetc,
> > > > >>> mv643xx_eth, cpsw, macb.
> > > > >>
> > > > >> fec, mvneta, mvpp2, i210/igb.
> > > > >
> > > > > I can probably double that list only with Freescale/NXP Ethernet
> > > > > drivers, some of which are not even submitted to mainline. To name some
> > > > > mainline drivers: gianfar, dpaa-eth, dpaa2-eth, dpaa2-switch, ucc_geth.
> > > > > Also consider that DSA/switchdev drivers can also be DSA masters of
> > > > > their own, we have boards doing that too.
> > > > >
> > > > > Anyway, I've decided to at least try and accept the fact that DSA
> > > > > masters will unregister their net_device on shutdown, and attempt to do
> > > > > something sane for all DSA switches in that case.
> > > > >
> > > > > Attached are two patches (they are fairly big so I won't paste them
> > > > > inline, and I would like initial feedback before posting them to the
> > > > > list).
> > > > >
> > > > > As mentioned in those patches, the shutdown ordering guarantee is still
> > > > > very important, I still have no clue what goes on there, what we need to
> > > > > do, etc.
> > > > >
> > > >
> > > > I tested these patches with my 5.10 kernel (based on Gregs 5.10.27 stable
> > > > kernel) and while I do not see the message "unregister_netdevice: waiting
> > > > for eth0 to become free. Usage count = 2." any more the shutdown/reboot hangs, too.
> > > > After a few attempts without any error messages on the console I was able to get a
> > > > stack trace. Something still seems to go wrong in bcm2835_spi_shutdown() (see attachment).
> > > > I have not had the time yet to investigate this further (or to test the patches
> > > > with a newer kernel).
> > >
> > > Could you post the full kernel output? The picture you've posted is
> > > truncated and only shows a WARN_ON in rpi_firmware_transaction and is
> > > probably a symptom and not the issue (which is above and not shown).
> > >
> >
> > Unfortunately I dont see anything in the kernel log. The console output is all I get,
> > thats why I made the photo.
>
> To clarify, are you saying nothing above this line gets printed? Because
> the part of the log you've posted in the picture is pretty much
> unworkable:
>
> [ 99.375389] [<bf0dc56c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863ca0>] (platform_drv_shutdown+0x2c/0x30)
>
> How do you access the device's serial console? Use a program with a
> scrollback buffer like GNU screen or something.
>
Ah no, this is not over a serial console. This is what I see via hdmi. I do not have a working serial connection yet.
Sorry I know this trace part is not very useful, I will try to get a full dump.
On Mon, Sep 13, 2021 at 01:01:20PM +0200, Lino Sanfilippo wrote:
> > > > Could you post the full kernel output? The picture you've posted is
> > > > truncated and only shows a WARN_ON in rpi_firmware_transaction and is
> > > > probably a symptom and not the issue (which is above and not shown).
> > > >
> > >
> > > Unfortunately I dont see anything in the kernel log. The console output is all I get,
> > > thats why I made the photo.
> >
> > To clarify, are you saying nothing above this line gets printed? Because
> > the part of the log you've posted in the picture is pretty much
> > unworkable:
> >
> > [ 99.375389] [<bf0dc56c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863ca0>] (platform_drv_shutdown+0x2c/0x30)
> >
> > How do you access the device's serial console? Use a program with a
> > scrollback buffer like GNU screen or something.
> >
>
> Ah no, this is not over a serial console. This is what I see via hdmi. I do not have a working serial connection yet.
> Sorry I know this trace part is not very useful, I will try to get a full dump.
Lino, are you going to provide a kernel output so I could look at your new breakage?
If you could set up a pstore logger with a ramoops region, you could
dump the log after the fact. Or if HDMI is all you have, you could use
an HDMI capture card to record it. Or just record the screen you're
looking at, as long as you don't have very shaky hands, whatever...
Hi,
On 14.09.21 at 20:48, Vladimir Oltean wrote:
> On Mon, Sep 13, 2021 at 01:01:20PM +0200, Lino Sanfilippo wrote:
>>>>> Could you post the full kernel output? The picture you've posted is
>>>>> truncated and only shows a WARN_ON in rpi_firmware_transaction and is
>>>>> probably a symptom and not the issue (which is above and not shown).
>>>>>
>>>>
>>>> Unfortunately I dont see anything in the kernel log. The console output is all I get,
>>>> thats why I made the photo.
>>>
>>> To clarify, are you saying nothing above this line gets printed? Because
>>> the part of the log you've posted in the picture is pretty much
>>> unworkable:
>>>
>>> [ 99.375389] [<bf0dc56c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863ca0>] (platform_drv_shutdown+0x2c/0x30)
>>>
>>> How do you access the device's serial console? Use a program with a
>>> scrollback buffer like GNU screen or something.
>>>
>>
>> Ah no, this is not over a serial console. This is what I see via hdmi. I do not have a working serial connection yet.
>> Sorry I know this trace part is not very useful, I will try to get a full dump.
>
> Lino, are you going to provide a kernel output so I could look at your new breakage?
> If you could set up a pstore logger with a ramoops region, you could
> dump the log after the fact. Or if HDMI is all you have, you could use
> an HDMI capture card to record it. Or just record the screen you're
> looking at, as long as you don't have very shaky hands, whatever...
>
Yes, I will try to get something useful. I have already set up a serial connection
now. I still see the shutdown stopping with your patch but I have not seen the
kernel dump any more. I will try further and provide a dump as soon as I am successful.
Regards,
Lino
Hi Vladimir,
On 15.09.21 at 07:42, Lino Sanfilippo wrote:
> On 14.09.21 at 20:48, Vladimir Oltean wrote:
>> On Mon, Sep 13, 2021 at 01:01:20PM +0200, Lino Sanfilippo wrote:
>>>>>> Could you post the full kernel output? The picture you've posted is
>>>>>> truncated and only shows a WARN_ON in rpi_firmware_transaction and is
>>>>>> probably a symptom and not the issue (which is above and not shown).
>>>>>>
>>>>>
>>>>> Unfortunately I dont see anything in the kernel log. The console output is all I get,
>>>>> thats why I made the photo.
>>>>
>>>> To clarify, are you saying nothing above this line gets printed? Because
>>>> the part of the log you've posted in the picture is pretty much
>>>> unworkable:
>>>>
>>>> [ 99.375389] [<bf0dc56c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863ca0>] (platform_drv_shutdown+0x2c/0x30)
>>>>
>>>> How do you access the device's serial console? Use a program with a
>>>> scrollback buffer like GNU screen or something.
>>>>
>>>
>>> Ah no, this is not over a serial console. This is what I see via hdmi. I do not have a working serial connection yet.
>>> Sorry I know this trace part is not very useful, I will try to get a full dump.
>>
>> Lino, are you going to provide a kernel output so I could look at your new breakage?
>> If you could set up a pstore logger with a ramoops region, you could
>> dump the log after the fact. Or if HDMI is all you have, you could use
>> an HDMI capture card to record it. Or just record the screen you're
>> looking at, as long as you don't have very shaky hands, whatever...
>>
>
> Yes, I will try to get something useful. I have already set up a serial connection
> now. I still see the shutdown stopping with your patch but I have not seen the
> kernel dump any more. I will try further and provide a dump as soon as I am successful.
>
Sorry for the delay. I was finally able to do some tests and get a dump via the serial console.
I tested with the latest Raspberry Pi kernel 5.10.y. Based on commit
4117cba235d24a7c4630dc38cb55cc80a04f5cf3. I applied your patches and got the following result
at shutdown:
raspberrypi login: [ 58.754533] ------------[ cut here ]------------
[ 58.760053] kernel BUG at drivers/net/phy/mdio_bus.c:651!
[ 58.766361] Internal error: Oops - BUG: 0 [#1] SMP ARM
[ 58.772376] Modules linked in: 8021q garp at24 tag_ksz tpm_tis_spi ksz9477_spi tpm_tis_core ksz9477 ksz_common tpm rts
[ 58.837539] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G C 5.10.63-RP_PURE_510_VLADFIX+ #3
[ 58.848388] Hardware name: BCM2711
[ 58.852875] PC is at mdiobus_free+0x4c/0x50
[ 58.858143] LR is at devm_mdiobus_free+0x1c/0x20
[ 58.863853] pc : [<c08c9218>] lr : [<c08c1898>] psr: 80000013
[ 58.871212] sp : c18fdc38 ip : c18fdc48 fp : c18fdc44
[ 58.877505] r10: 00000000 r9 : c0867104 r8 : c18fdc5c
[ 58.883823] r7 : 00000013 r6 : c31c8000 r5 : c3a50000 r4 : c379db80
[ 58.891442] r3 : c2ab4000 r2 : 00000002 r1 : c379dbc0 r0 : c2ab4000
[ 58.899037] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 58.907297] Control: 30c5383d Table: 03ac92c0 DAC: 55555555
[ 58.914139] Process systemd-shutdow (pid: 1, stack limit = 0xff8113c1)
[ 58.921774] Stack: (0xc18fdc38 to 0xc18fe000)
[ 58.927285] dc20: c18fdc54 c18fdc48
[ 58.936601] dc40: c08c1898 c08c91d8 c18fdc94 c18fdc58 c0866dac c08c1888 c31c819c c3527180
[ 58.945921] dc60: c332d200 c1405048 c32f8800 c31c8000 00000000 bf191010 00000000 c32f8800
[ 58.955289] dc80: c1095f3c c1aa6454 c18fdcac c18fdc98 c086715c c0866be8 c31c8000 00000000
[ 58.964644] dca0: c18fdccc c18fdcb0 c0862c7c c0867128 c1a42e30 c31c8000 c14f7cf0 00000000
[ 58.974018] dcc0: c18fdcdc c18fdcd0 c0862d40 c0862b68 c18fdcfc c18fdce0 c08613dc c0862d2c
[ 58.983391] dce0: c31c8000 00000a68 c08ba6cc 00000000 c18fdd44 c18fdd00 c085c710 c086130c
[ 58.992778] dd00: c0331394 c0332604 60000013 c18fdd74 c3656294 c1405048 c31c8000 c31c8000
[ 59.002140] dd20: 00000000 c08ba6cc c160657c c155c018 c1095f3c c1aa6454 c18fdd5c c18fdd48
[ 59.011521] dd40: c08ba6a8 c085c58c 00000000 00000000 c18fdd6c c18fdd60 c08ba6e4 c08ba670
[ 59.020921] dd60: c18fdd9c c18fdd70 c085bc84 c08ba6d8 c18fdd8c c3656200 c3656394 c1405048
[ 59.030334] dd80: c18fdda4 c32f8800 c32f8800 00000003 c18fddbc c18fdda0 c08bab7c c085bc20
[ 59.039737] dda0: c32f8b80 c32f8800 00000000 c160657c c18fdddc c18fddc0 bf182554 c08bab4c
[ 59.049164] ddc0: c1aa6400 c1a6e810 c1aa6410 c160657c c18fddf4 c18fdde0 bf1825a8 bf18252c
[ 59.058602] dde0: c1aa6414 c1a6e810 c18fde04 c18fddf8 c0863dec bf182598 c18fde3c c18fde08
[ 59.068057] de00: c085fd9c c0863dcc c18fde3c c1095f2c c024865c 00000000 00000000 620bef00
[ 59.077487] de20: c140f510 fee1dead c18fc000 00000058 c18fde4c c18fde40 c0249c84 c085fc0c
[ 59.086920] de40: c18fde64 c18fde50 c0249d74 c0249c4c 01234567 00000000 c18fdf94 c18fde68
[ 59.096386] de60: c024a018 c0249d64 c18fded4 c31b0c00 00000024 c18fdf58 00000005 c0441cec
[ 59.105852] de80: c18fdec4 c18fde90 c0441b30 c049852c 00000000 c18fdea0 c073ad04 00000024
[ 59.115330] dea0: c31b0c00 c18fdf58 c18fded4 c31b0c00 00000005 00000000 c18fdf4c c18fdec8
[ 59.124821] dec0: c0441cec c0425cb0 c18fded0 c18fded4 00000000 00000005 00000000 00000024
[ 59.134317] dee0: c18fdeec 00000005 c0200074 bec45250 00000004 bec45f62 00000010 bec45264
[ 59.143792] df00: 00000005 bec4531c 0000000a b6d10040 00000001 c0200e70 ffffe000 c1546a80
[ 59.153282] df20: 00000000 c0467268 c18fdf4c c1405048 c31b0c00 bec4528c 00000000 00000000
[ 59.162787] df40: c18fdf94 c18fdf50 c0441e6c c0441c50 00000000 00000000 00000000 00000000
[ 59.172269] df60: c18fdf94 c1405048 c0331394 c1405048 bec4531c 00000000 00000000 00000000
[ 59.181763] df80: 00000058 c0200204 c18fdfa4 c18fdf98 c024a16c c0249f10 00000000 c18fdfa8
[ 59.191250] dfa0: c0200040 c024a160 00000000 00000000 fee1dead 28121969 01234567 620bef00
[ 59.200735] dfc0: 00000000 00000000 00000000 00000058 00000fff bec45be8 00000000 00476b80
[ 59.210245] dfe0: 00488e3c bec45b68 004734a8 b6e4ca38 60000010 fee1dead 00000000 00000000
[ 59.219759] Backtrace:
[ 59.223546] [<c08c91cc>] (mdiobus_free) from [<c08c1898>] (devm_mdiobus_free+0x1c/0x20)
[ 59.232909] [<c08c187c>] (devm_mdiobus_free) from [<c0866dac>] (release_nodes+0x1d0/0x220)
[ 59.242551] [<c0866bdc>] (release_nodes) from [<c086715c>] (devres_release_all+0x40/0x60)
[ 59.252132] r10:c1aa6454 r9:c1095f3c r8:c32f8800 r7:00000000 r6:bf191010 r5:00000000
[ 59.261338] r4:c31c8000
[ 59.265239] [<c086711c>] (devres_release_all) from [<c0862c7c>] (device_release_driver_internal+0x120/0x1c4)
[ 59.276479] r5:00000000 r4:c31c8000
[ 59.281440] [<c0862b5c>] (device_release_driver_internal) from [<c0862d40>] (device_release_driver+0x20/0x24)
[ 59.292802] r7:00000000 r6:c14f7cf0 r5:c31c8000 r4:c1a42e30
[ 59.299900] [<c0862d20>] (device_release_driver) from [<c08613dc>] (bus_remove_device+0xdc/0x108)
[ 59.310267] [<c0861300>] (bus_remove_device) from [<c085c710>] (device_del+0x190/0x428)
[ 59.319748] r7:00000000 r6:c08ba6cc r5:00000a68 r4:c31c8000
[ 59.326896] [<c085c580>] (device_del) from [<c08ba6a8>] (spi_unregister_device+0x44/0x68)
[ 59.336583] r10:c1aa6454 r9:c1095f3c r8:c155c018 r7:c160657c r6:c08ba6cc r5:00000000
[ 59.345924] r4:c31c8000
[ 59.349971] [<c08ba664>] (spi_unregister_device) from [<c08ba6e4>] (__unregister+0x18/0x20)
[ 59.359870] r5:00000000 r4:00000000
[ 59.364972] [<c08ba6cc>] (__unregister) from [<c085bc84>] (device_for_each_child+0x70/0xb4)
[ 59.374899] [<c085bc14>] (device_for_each_child) from [<c08bab7c>] (spi_unregister_controller+0x3c/0x128)
[ 59.385979] r6:00000003 r5:c32f8800 r4:c32f8800
[ 59.392086] [<c08bab40>] (spi_unregister_controller) from [<bf182554>] (bcm2835_spi_remove+0x34/0x6c [spi_bcm2835])
[ 59.404000] r7:c160657c r6:00000000 r5:c32f8800 r4:c32f8b80
[ 59.411084] [<bf182520>] (bcm2835_spi_remove [spi_bcm2835]) from [<bf1825a8>] (bcm2835_spi_shutdown+0x1c/0x38 [spi_bc)
[ 59.423755] r7:c160657c r6:c1aa6410 r5:c1a6e810 r4:c1aa6400
[ 59.430847] [<bf18258c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863dec>] (platform_drv_shutdown+0x2c/0x30)
[ 59.442613] r5:c1a6e810 r4:c1aa6414
[ 59.447635] [<c0863dc0>] (platform_drv_shutdown) from [<c085fd9c>] (device_shutdown+0x19c/0x24c)
[ 59.457932] [<c085fc00>] (device_shutdown) from [<c0249c84>] (kernel_restart_prepare+0x44/0x48)
[ 59.468135] r10:00000058 r9:c18fc000 r8:fee1dead r7:c140f510 r6:620bef00 r5:00000000
[ 59.477470] r4:00000000
[ 59.481509] [<c0249c40>] (kernel_restart_prepare) from [<c0249d74>] (kernel_restart+0x1c/0x60)
[ 59.491653] [<c0249d58>] (kernel_restart) from [<c024a018>] (__do_sys_reboot+0x114/0x1f8)
[ 59.501359] r5:00000000 r4:01234567
[ 59.506447] [<c0249f04>] (__do_sys_reboot) from [<c024a16c>] (sys_reboot+0x18/0x1c)
[ 59.515628] r8:c0200204 r7:00000058 r6:00000000 r5:00000000 r4:00000000
[ 59.523857] [<c024a154>] (sys_reboot) from [<c0200040>] (ret_fast_syscall+0x0/0x28)
[ 59.533038] Exception stack(0xc18fdfa8 to 0xc18fdff0)
[ 59.539607] dfa0: 00000000 00000000 fee1dead 28121969 01234567 620bef00
[ 59.549318] dfc0: 00000000 00000000 00000000 00000058 00000fff bec45be8 00000000 00476b80
[ 59.559026] dfe0: 00488e3c bec45b68 004734a8 b6e4ca38
[ 59.565596] Code: ebfe49f5 e89da800 ebed72a3 e89da800 (e7f001f2)
[ 59.573246] ---[ end trace 7d800ce7b5664bb6 ]---
[ 59.579413] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 59.588634] Rebooting in 10 seconds..
The concerning source code line 651 is in my case:
void mdiobus_free(struct mii_bus *bus)
{
/* For compatibility with error handling in drivers. */
if (bus->state == MDIOBUS_ALLOCATED) {
kfree(bus);
return;
}
651< BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
bus->state = MDIOBUS_RELEASED;
put_device(&bus->dev);
}
EXPORT_SYMBOL(mdiobus_free);
I tested with both versions of your patchset, with the same result. I also tested
with a RP 5.14 kernel (the latest RP kernel) but I did not see the original issue
(i.e. the system hang) here for some reason.
I then tried to get the net-next kernel running on my system but without success so far. So for
now the result with the RP 5.10 is all I can offer. I hope that helps a bit nevertheless.
Regards,
Lino
On Sat, Sep 18, 2021 at 09:37:17PM +0200, Lino Sanfilippo wrote:
> Hi Vladimir,
>
> On 15.09.21 at 07:42, Lino Sanfilippo wrote:
> > On 14.09.21 at 20:48, Vladimir Oltean wrote:
> >> On Mon, Sep 13, 2021 at 01:01:20PM +0200, Lino Sanfilippo wrote:
> >>>>>> Could you post the full kernel output? The picture you've posted is
> >>>>>> truncated and only shows a WARN_ON in rpi_firmware_transaction and is
> >>>>>> probably a symptom and not the issue (which is above and not shown).
> >>>>>>
> >>>>>
> >>>>> Unfortunately I dont see anything in the kernel log. The console output is all I get,
> >>>>> thats why I made the photo.
> >>>>
> >>>> To clarify, are you saying nothing above this line gets printed? Because
> >>>> the part of the log you've posted in the picture is pretty much
> >>>> unworkable:
> >>>>
> >>>> [ 99.375389] [<bf0dc56c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863ca0>] (platform_drv_shutdown+0x2c/0x30)
> >>>>
> >>>> How do you access the device's serial console? Use a program with a
> >>>> scrollback buffer like GNU screen or something.
> >>>>
> >>>
> >>> Ah no, this is not over a serial console. This is what I see via hdmi. I do not have a working serial connection yet.
> >>> Sorry I know this trace part is not very useful, I will try to get a full dump.
> >>
> >> Lino, are you going to provide a kernel output so I could look at your new breakage?
> >> If you could set up a pstore logger with a ramoops region, you could
> >> dump the log after the fact. Or if HDMI is all you have, you could use
> >> an HDMI capture card to record it. Or just record the screen you're
> >> looking at, as long as you don't have very shaky hands, whatever...
> >>
> >
> > Yes, I will try to get something useful. I have already set up a serial connection
> > now. I still see the shutdown stopping with your patch but I have not seen the
> > kernel dump any more. I will try further and provide a dump as soon as I am successful.
> >
>
> Sorry for the delay. I was finally able to do some tests and get a dump via the serial console.
> I tested with the latest Raspberry Pi kernel 5.10.y. Based on commit
> 4117cba235d24a7c4630dc38cb55cc80a04f5cf3. I applied your patches and got the following result
> at shutdown:
>
> raspberrypi login: [ 58.754533] ------------[ cut here ]------------
> [ 58.760053] kernel BUG at drivers/net/phy/mdio_bus.c:651!
> [ 58.766361] Internal error: Oops - BUG: 0 [#1] SMP ARM
> [ 58.772376] Modules linked in: 8021q garp at24 tag_ksz tpm_tis_spi ksz9477_spi tpm_tis_core ksz9477 ksz_common tpm rts
> [ 58.837539] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G C 5.10.63-RP_PURE_510_VLADFIX+ #3
> [ 58.848388] Hardware name: BCM2711
> [ 58.852875] PC is at mdiobus_free+0x4c/0x50
> [ 58.858143] LR is at devm_mdiobus_free+0x1c/0x20
> [ 58.863853] pc : [<c08c9218>] lr : [<c08c1898>] psr: 80000013
> [ 58.871212] sp : c18fdc38 ip : c18fdc48 fp : c18fdc44
> [ 58.877505] r10: 00000000 r9 : c0867104 r8 : c18fdc5c
> [ 58.883823] r7 : 00000013 r6 : c31c8000 r5 : c3a50000 r4 : c379db80
> [ 58.891442] r3 : c2ab4000 r2 : 00000002 r1 : c379dbc0 r0 : c2ab4000
> [ 58.899037] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> [ 58.907297] Control: 30c5383d Table: 03ac92c0 DAC: 55555555
> [ 58.914139] Process systemd-shutdow (pid: 1, stack limit = 0xff8113c1)
> [ 58.921774] Stack: (0xc18fdc38 to 0xc18fe000)
> [ 58.927285] dc20: c18fdc54 c18fdc48
> [ 58.936601] dc40: c08c1898 c08c91d8 c18fdc94 c18fdc58 c0866dac c08c1888 c31c819c c3527180
> [ 58.945921] dc60: c332d200 c1405048 c32f8800 c31c8000 00000000 bf191010 00000000 c32f8800
> [ 58.955289] dc80: c1095f3c c1aa6454 c18fdcac c18fdc98 c086715c c0866be8 c31c8000 00000000
> [ 58.964644] dca0: c18fdccc c18fdcb0 c0862c7c c0867128 c1a42e30 c31c8000 c14f7cf0 00000000
> [ 58.974018] dcc0: c18fdcdc c18fdcd0 c0862d40 c0862b68 c18fdcfc c18fdce0 c08613dc c0862d2c
> [ 58.983391] dce0: c31c8000 00000a68 c08ba6cc 00000000 c18fdd44 c18fdd00 c085c710 c086130c
> [ 58.992778] dd00: c0331394 c0332604 60000013 c18fdd74 c3656294 c1405048 c31c8000 c31c8000
> [ 59.002140] dd20: 00000000 c08ba6cc c160657c c155c018 c1095f3c c1aa6454 c18fdd5c c18fdd48
> [ 59.011521] dd40: c08ba6a8 c085c58c 00000000 00000000 c18fdd6c c18fdd60 c08ba6e4 c08ba670
> [ 59.020921] dd60: c18fdd9c c18fdd70 c085bc84 c08ba6d8 c18fdd8c c3656200 c3656394 c1405048
> [ 59.030334] dd80: c18fdda4 c32f8800 c32f8800 00000003 c18fddbc c18fdda0 c08bab7c c085bc20
> [ 59.039737] dda0: c32f8b80 c32f8800 00000000 c160657c c18fdddc c18fddc0 bf182554 c08bab4c
> [ 59.049164] ddc0: c1aa6400 c1a6e810 c1aa6410 c160657c c18fddf4 c18fdde0 bf1825a8 bf18252c
> [ 59.058602] dde0: c1aa6414 c1a6e810 c18fde04 c18fddf8 c0863dec bf182598 c18fde3c c18fde08
> [ 59.068057] de00: c085fd9c c0863dcc c18fde3c c1095f2c c024865c 00000000 00000000 620bef00
> [ 59.077487] de20: c140f510 fee1dead c18fc000 00000058 c18fde4c c18fde40 c0249c84 c085fc0c
> [ 59.086920] de40: c18fde64 c18fde50 c0249d74 c0249c4c 01234567 00000000 c18fdf94 c18fde68
> [ 59.096386] de60: c024a018 c0249d64 c18fded4 c31b0c00 00000024 c18fdf58 00000005 c0441cec
> [ 59.105852] de80: c18fdec4 c18fde90 c0441b30 c049852c 00000000 c18fdea0 c073ad04 00000024
> [ 59.115330] dea0: c31b0c00 c18fdf58 c18fded4 c31b0c00 00000005 00000000 c18fdf4c c18fdec8
> [ 59.124821] dec0: c0441cec c0425cb0 c18fded0 c18fded4 00000000 00000005 00000000 00000024
> [ 59.134317] dee0: c18fdeec 00000005 c0200074 bec45250 00000004 bec45f62 00000010 bec45264
> [ 59.143792] df00: 00000005 bec4531c 0000000a b6d10040 00000001 c0200e70 ffffe000 c1546a80
> [ 59.153282] df20: 00000000 c0467268 c18fdf4c c1405048 c31b0c00 bec4528c 00000000 00000000
> [ 59.162787] df40: c18fdf94 c18fdf50 c0441e6c c0441c50 00000000 00000000 00000000 00000000
> [ 59.172269] df60: c18fdf94 c1405048 c0331394 c1405048 bec4531c 00000000 00000000 00000000
> [ 59.181763] df80: 00000058 c0200204 c18fdfa4 c18fdf98 c024a16c c0249f10 00000000 c18fdfa8
> [ 59.191250] dfa0: c0200040 c024a160 00000000 00000000 fee1dead 28121969 01234567 620bef00
> [ 59.200735] dfc0: 00000000 00000000 00000000 00000058 00000fff bec45be8 00000000 00476b80
> [ 59.210245] dfe0: 00488e3c bec45b68 004734a8 b6e4ca38 60000010 fee1dead 00000000 00000000
> [ 59.219759] Backtrace:
> [ 59.223546] [<c08c91cc>] (mdiobus_free) from [<c08c1898>] (devm_mdiobus_free+0x1c/0x20)
> [ 59.232909] [<c08c187c>] (devm_mdiobus_free) from [<c0866dac>] (release_nodes+0x1d0/0x220)
> [ 59.242551] [<c0866bdc>] (release_nodes) from [<c086715c>] (devres_release_all+0x40/0x60)
> [ 59.252132] r10:c1aa6454 r9:c1095f3c r8:c32f8800 r7:00000000 r6:bf191010 r5:00000000
> [ 59.261338] r4:c31c8000
> [ 59.265239] [<c086711c>] (devres_release_all) from [<c0862c7c>] (device_release_driver_internal+0x120/0x1c4)
> [ 59.276479] r5:00000000 r4:c31c8000
> [ 59.281440] [<c0862b5c>] (device_release_driver_internal) from [<c0862d40>] (device_release_driver+0x20/0x24)
> [ 59.292802] r7:00000000 r6:c14f7cf0 r5:c31c8000 r4:c1a42e30
> [ 59.299900] [<c0862d20>] (device_release_driver) from [<c08613dc>] (bus_remove_device+0xdc/0x108)
> [ 59.310267] [<c0861300>] (bus_remove_device) from [<c085c710>] (device_del+0x190/0x428)
> [ 59.319748] r7:00000000 r6:c08ba6cc r5:00000a68 r4:c31c8000
> [ 59.326896] [<c085c580>] (device_del) from [<c08ba6a8>] (spi_unregister_device+0x44/0x68)
> [ 59.336583] r10:c1aa6454 r9:c1095f3c r8:c155c018 r7:c160657c r6:c08ba6cc r5:00000000
> [ 59.345924] r4:c31c8000
> [ 59.349971] [<c08ba664>] (spi_unregister_device) from [<c08ba6e4>] (__unregister+0x18/0x20)
> [ 59.359870] r5:00000000 r4:00000000
> [ 59.364972] [<c08ba6cc>] (__unregister) from [<c085bc84>] (device_for_each_child+0x70/0xb4)
> [ 59.374899] [<c085bc14>] (device_for_each_child) from [<c08bab7c>] (spi_unregister_controller+0x3c/0x128)
> [ 59.385979] r6:00000003 r5:c32f8800 r4:c32f8800
> [ 59.392086] [<c08bab40>] (spi_unregister_controller) from [<bf182554>] (bcm2835_spi_remove+0x34/0x6c [spi_bcm2835])
> [ 59.404000] r7:c160657c r6:00000000 r5:c32f8800 r4:c32f8b80
> [ 59.411084] [<bf182520>] (bcm2835_spi_remove [spi_bcm2835]) from [<bf1825a8>] (bcm2835_spi_shutdown+0x1c/0x38 [spi_bc)
> [ 59.423755] r7:c160657c r6:c1aa6410 r5:c1a6e810 r4:c1aa6400
> [ 59.430847] [<bf18258c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863dec>] (platform_drv_shutdown+0x2c/0x30)
> [ 59.442613] r5:c1a6e810 r4:c1aa6414
> [ 59.447635] [<c0863dc0>] (platform_drv_shutdown) from [<c085fd9c>] (device_shutdown+0x19c/0x24c)
> [ 59.457932] [<c085fc00>] (device_shutdown) from [<c0249c84>] (kernel_restart_prepare+0x44/0x48)
> [ 59.468135] r10:00000058 r9:c18fc000 r8:fee1dead r7:c140f510 r6:620bef00 r5:00000000
> [ 59.477470] r4:00000000
> [ 59.481509] [<c0249c40>] (kernel_restart_prepare) from [<c0249d74>] (kernel_restart+0x1c/0x60)
> [ 59.491653] [<c0249d58>] (kernel_restart) from [<c024a018>] (__do_sys_reboot+0x114/0x1f8)
> [ 59.501359] r5:00000000 r4:01234567
> [ 59.506447] [<c0249f04>] (__do_sys_reboot) from [<c024a16c>] (sys_reboot+0x18/0x1c)
> [ 59.515628] r8:c0200204 r7:00000058 r6:00000000 r5:00000000 r4:00000000
> [ 59.523857] [<c024a154>] (sys_reboot) from [<c0200040>] (ret_fast_syscall+0x0/0x28)
> [ 59.533038] Exception stack(0xc18fdfa8 to 0xc18fdff0)
> [ 59.539607] dfa0: 00000000 00000000 fee1dead 28121969 01234567 620bef00
> [ 59.549318] dfc0: 00000000 00000000 00000000 00000058 00000fff bec45be8 00000000 00476b80
> [ 59.559026] dfe0: 00488e3c bec45b68 004734a8 b6e4ca38
> [ 59.565596] Code: ebfe49f5 e89da800 ebed72a3 e89da800 (e7f001f2)
> [ 59.573246] ---[ end trace 7d800ce7b5664bb6 ]---
> [ 59.579413] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 59.588634] Rebooting in 10 seconds..
>
> The concerning source code line 651 is in my case:
>
> void mdiobus_free(struct mii_bus *bus)
> {
> /* For compatibility with error handling in drivers. */
> if (bus->state == MDIOBUS_ALLOCATED) {
> kfree(bus);
> return;
> }
>
> 651< BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
> bus->state = MDIOBUS_RELEASED;
>
> put_device(&bus->dev);
> }
> EXPORT_SYMBOL(mdiobus_free);
>
> I tested with both versions of your patchset, with the same result. I also tested
> with a RP 5.14 kernel (the latest RP kernel) but I did not see the original issue
> (i.e. the system hang) here for some reason.
>
> I then tried to get the net-next kernel running on my system but without success so far. So for
> now the result with the RP 5.10 is all I can offer. I hope that helps a bit nevertheless.
Thank you Lino, this is a very valuable report. I will send a v3 soon (not sure if today).
On Sun, Sep 19, 2021 at 01:04:12AM +0300, Vladimir Oltean wrote:
> On Sat, Sep 18, 2021 at 09:37:17PM +0200, Lino Sanfilippo wrote:
> > Hi Vladimir,
> >
> > On 15.09.21 at 07:42, Lino Sanfilippo wrote:
> > > On 14.09.21 at 20:48, Vladimir Oltean wrote:
> > >> On Mon, Sep 13, 2021 at 01:01:20PM +0200, Lino Sanfilippo wrote:
> > >>>>>> Could you post the full kernel output? The picture you've posted is
> > >>>>>> truncated and only shows a WARN_ON in rpi_firmware_transaction and is
> > >>>>>> probably a symptom and not the issue (which is above and not shown).
> > >>>>>>
> > >>>>>
> > >>>>> Unfortunately I dont see anything in the kernel log. The console output is all I get,
> > >>>>> thats why I made the photo.
> > >>>>
> > >>>> To clarify, are you saying nothing above this line gets printed? Because
> > >>>> the part of the log you've posted in the picture is pretty much
> > >>>> unworkable:
> > >>>>
> > >>>> [ 99.375389] [<bf0dc56c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863ca0>] (platform_drv_shutdown+0x2c/0x30)
> > >>>>
> > >>>> How do you access the device's serial console? Use a program with a
> > >>>> scrollback buffer like GNU screen or something.
> > >>>>
> > >>>
> > >>> Ah no, this is not over a serial console. This is what I see via hdmi. I do not have a working serial connection yet.
> > >>> Sorry I know this trace part is not very useful, I will try to get a full dump.
> > >>
> > >> Lino, are you going to provide a kernel output so I could look at your new breakage?
> > >> If you could set up a pstore logger with a ramoops region, you could
> > >> dump the log after the fact. Or if HDMI is all you have, you could use
> > >> an HDMI capture card to record it. Or just record the screen you're
> > >> looking at, as long as you don't have very shaky hands, whatever...
> > >>
> > >
> > > Yes, I will try to get something useful. I have already set up a serial connection
> > > now. I still see the shutdown stopping with your patch but I have not seen the
> > > kernel dump any more. I will try further and provide a dump as soon as I am successful.
> > >
> >
> > Sorry for the delay. I was finally able to do some tests and get a dump via the serial console.
> > I tested with the latest Raspberry Pi kernel 5.10.y. Based on commit
> > 4117cba235d24a7c4630dc38cb55cc80a04f5cf3. I applied your patches and got the following result
> > at shutdown:
> >
> > raspberrypi login: [ 58.754533] ------------[ cut here ]------------
> > [ 58.760053] kernel BUG at drivers/net/phy/mdio_bus.c:651!
> > [ 58.766361] Internal error: Oops - BUG: 0 [#1] SMP ARM
> > [ 58.772376] Modules linked in: 8021q garp at24 tag_ksz tpm_tis_spi ksz9477_spi tpm_tis_core ksz9477 ksz_common tpm rts
> > [ 58.837539] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G C 5.10.63-RP_PURE_510_VLADFIX+ #3
> > [ 58.848388] Hardware name: BCM2711
> > [ 58.852875] PC is at mdiobus_free+0x4c/0x50
> > [ 58.858143] LR is at devm_mdiobus_free+0x1c/0x20
> > [ 58.863853] pc : [<c08c9218>] lr : [<c08c1898>] psr: 80000013
> > [ 58.871212] sp : c18fdc38 ip : c18fdc48 fp : c18fdc44
> > [ 58.877505] r10: 00000000 r9 : c0867104 r8 : c18fdc5c
> > [ 58.883823] r7 : 00000013 r6 : c31c8000 r5 : c3a50000 r4 : c379db80
> > [ 58.891442] r3 : c2ab4000 r2 : 00000002 r1 : c379dbc0 r0 : c2ab4000
> > [ 58.899037] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> > [ 58.907297] Control: 30c5383d Table: 03ac92c0 DAC: 55555555
> > [ 58.914139] Process systemd-shutdow (pid: 1, stack limit = 0xff8113c1)
> > [ 58.921774] Stack: (0xc18fdc38 to 0xc18fe000)
> > [ 58.927285] dc20: c18fdc54 c18fdc48
> > [ 58.936601] dc40: c08c1898 c08c91d8 c18fdc94 c18fdc58 c0866dac c08c1888 c31c819c c3527180
> > [ 58.945921] dc60: c332d200 c1405048 c32f8800 c31c8000 00000000 bf191010 00000000 c32f8800
> > [ 58.955289] dc80: c1095f3c c1aa6454 c18fdcac c18fdc98 c086715c c0866be8 c31c8000 00000000
> > [ 58.964644] dca0: c18fdccc c18fdcb0 c0862c7c c0867128 c1a42e30 c31c8000 c14f7cf0 00000000
> > [ 58.974018] dcc0: c18fdcdc c18fdcd0 c0862d40 c0862b68 c18fdcfc c18fdce0 c08613dc c0862d2c
> > [ 58.983391] dce0: c31c8000 00000a68 c08ba6cc 00000000 c18fdd44 c18fdd00 c085c710 c086130c
> > [ 58.992778] dd00: c0331394 c0332604 60000013 c18fdd74 c3656294 c1405048 c31c8000 c31c8000
> > [ 59.002140] dd20: 00000000 c08ba6cc c160657c c155c018 c1095f3c c1aa6454 c18fdd5c c18fdd48
> > [ 59.011521] dd40: c08ba6a8 c085c58c 00000000 00000000 c18fdd6c c18fdd60 c08ba6e4 c08ba670
> > [ 59.020921] dd60: c18fdd9c c18fdd70 c085bc84 c08ba6d8 c18fdd8c c3656200 c3656394 c1405048
> > [ 59.030334] dd80: c18fdda4 c32f8800 c32f8800 00000003 c18fddbc c18fdda0 c08bab7c c085bc20
> > [ 59.039737] dda0: c32f8b80 c32f8800 00000000 c160657c c18fdddc c18fddc0 bf182554 c08bab4c
> > [ 59.049164] ddc0: c1aa6400 c1a6e810 c1aa6410 c160657c c18fddf4 c18fdde0 bf1825a8 bf18252c
> > [ 59.058602] dde0: c1aa6414 c1a6e810 c18fde04 c18fddf8 c0863dec bf182598 c18fde3c c18fde08
> > [ 59.068057] de00: c085fd9c c0863dcc c18fde3c c1095f2c c024865c 00000000 00000000 620bef00
> > [ 59.077487] de20: c140f510 fee1dead c18fc000 00000058 c18fde4c c18fde40 c0249c84 c085fc0c
> > [ 59.086920] de40: c18fde64 c18fde50 c0249d74 c0249c4c 01234567 00000000 c18fdf94 c18fde68
> > [ 59.096386] de60: c024a018 c0249d64 c18fded4 c31b0c00 00000024 c18fdf58 00000005 c0441cec
> > [ 59.105852] de80: c18fdec4 c18fde90 c0441b30 c049852c 00000000 c18fdea0 c073ad04 00000024
> > [ 59.115330] dea0: c31b0c00 c18fdf58 c18fded4 c31b0c00 00000005 00000000 c18fdf4c c18fdec8
> > [ 59.124821] dec0: c0441cec c0425cb0 c18fded0 c18fded4 00000000 00000005 00000000 00000024
> > [ 59.134317] dee0: c18fdeec 00000005 c0200074 bec45250 00000004 bec45f62 00000010 bec45264
> > [ 59.143792] df00: 00000005 bec4531c 0000000a b6d10040 00000001 c0200e70 ffffe000 c1546a80
> > [ 59.153282] df20: 00000000 c0467268 c18fdf4c c1405048 c31b0c00 bec4528c 00000000 00000000
> > [ 59.162787] df40: c18fdf94 c18fdf50 c0441e6c c0441c50 00000000 00000000 00000000 00000000
> > [ 59.172269] df60: c18fdf94 c1405048 c0331394 c1405048 bec4531c 00000000 00000000 00000000
> > [ 59.181763] df80: 00000058 c0200204 c18fdfa4 c18fdf98 c024a16c c0249f10 00000000 c18fdfa8
> > [ 59.191250] dfa0: c0200040 c024a160 00000000 00000000 fee1dead 28121969 01234567 620bef00
> > [ 59.200735] dfc0: 00000000 00000000 00000000 00000058 00000fff bec45be8 00000000 00476b80
> > [ 59.210245] dfe0: 00488e3c bec45b68 004734a8 b6e4ca38 60000010 fee1dead 00000000 00000000
> > [ 59.219759] Backtrace:
> > [ 59.223546] [<c08c91cc>] (mdiobus_free) from [<c08c1898>] (devm_mdiobus_free+0x1c/0x20)
> > [ 59.232909] [<c08c187c>] (devm_mdiobus_free) from [<c0866dac>] (release_nodes+0x1d0/0x220)
> > [ 59.242551] [<c0866bdc>] (release_nodes) from [<c086715c>] (devres_release_all+0x40/0x60)
> > [ 59.252132] r10:c1aa6454 r9:c1095f3c r8:c32f8800 r7:00000000 r6:bf191010 r5:00000000
> > [ 59.261338] r4:c31c8000
> > [ 59.265239] [<c086711c>] (devres_release_all) from [<c0862c7c>] (device_release_driver_internal+0x120/0x1c4)
> > [ 59.276479] r5:00000000 r4:c31c8000
> > [ 59.281440] [<c0862b5c>] (device_release_driver_internal) from [<c0862d40>] (device_release_driver+0x20/0x24)
> > [ 59.292802] r7:00000000 r6:c14f7cf0 r5:c31c8000 r4:c1a42e30
> > [ 59.299900] [<c0862d20>] (device_release_driver) from [<c08613dc>] (bus_remove_device+0xdc/0x108)
> > [ 59.310267] [<c0861300>] (bus_remove_device) from [<c085c710>] (device_del+0x190/0x428)
> > [ 59.319748] r7:00000000 r6:c08ba6cc r5:00000a68 r4:c31c8000
> > [ 59.326896] [<c085c580>] (device_del) from [<c08ba6a8>] (spi_unregister_device+0x44/0x68)
> > [ 59.336583] r10:c1aa6454 r9:c1095f3c r8:c155c018 r7:c160657c r6:c08ba6cc r5:00000000
> > [ 59.345924] r4:c31c8000
> > [ 59.349971] [<c08ba664>] (spi_unregister_device) from [<c08ba6e4>] (__unregister+0x18/0x20)
> > [ 59.359870] r5:00000000 r4:00000000
> > [ 59.364972] [<c08ba6cc>] (__unregister) from [<c085bc84>] (device_for_each_child+0x70/0xb4)
> > [ 59.374899] [<c085bc14>] (device_for_each_child) from [<c08bab7c>] (spi_unregister_controller+0x3c/0x128)
> > [ 59.385979] r6:00000003 r5:c32f8800 r4:c32f8800
> > [ 59.392086] [<c08bab40>] (spi_unregister_controller) from [<bf182554>] (bcm2835_spi_remove+0x34/0x6c [spi_bcm2835])
> > [ 59.404000] r7:c160657c r6:00000000 r5:c32f8800 r4:c32f8b80
> > [ 59.411084] [<bf182520>] (bcm2835_spi_remove [spi_bcm2835]) from [<bf1825a8>] (bcm2835_spi_shutdown+0x1c/0x38 [spi_bc)
> > [ 59.423755] r7:c160657c r6:c1aa6410 r5:c1a6e810 r4:c1aa6400
> > [ 59.430847] [<bf18258c>] (bcm2835_spi_shutdown [spi_bcm2835]) from [<c0863dec>] (platform_drv_shutdown+0x2c/0x30)
> > [ 59.442613] r5:c1a6e810 r4:c1aa6414
> > [ 59.447635] [<c0863dc0>] (platform_drv_shutdown) from [<c085fd9c>] (device_shutdown+0x19c/0x24c)
> > [ 59.457932] [<c085fc00>] (device_shutdown) from [<c0249c84>] (kernel_restart_prepare+0x44/0x48)
> > [ 59.468135] r10:00000058 r9:c18fc000 r8:fee1dead r7:c140f510 r6:620bef00 r5:00000000
> > [ 59.477470] r4:00000000
> > [ 59.481509] [<c0249c40>] (kernel_restart_prepare) from [<c0249d74>] (kernel_restart+0x1c/0x60)
> > [ 59.491653] [<c0249d58>] (kernel_restart) from [<c024a018>] (__do_sys_reboot+0x114/0x1f8)
> > [ 59.501359] r5:00000000 r4:01234567
> > [ 59.506447] [<c0249f04>] (__do_sys_reboot) from [<c024a16c>] (sys_reboot+0x18/0x1c)
> > [ 59.515628] r8:c0200204 r7:00000058 r6:00000000 r5:00000000 r4:00000000
> > [ 59.523857] [<c024a154>] (sys_reboot) from [<c0200040>] (ret_fast_syscall+0x0/0x28)
> > [ 59.533038] Exception stack(0xc18fdfa8 to 0xc18fdff0)
> > [ 59.539607] dfa0: 00000000 00000000 fee1dead 28121969 01234567 620bef00
> > [ 59.549318] dfc0: 00000000 00000000 00000000 00000058 00000fff bec45be8 00000000 00476b80
> > [ 59.559026] dfe0: 00488e3c bec45b68 004734a8 b6e4ca38
> > [ 59.565596] Code: ebfe49f5 e89da800 ebed72a3 e89da800 (e7f001f2)
> > [ 59.573246] ---[ end trace 7d800ce7b5664bb6 ]---
> > [ 59.579413] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> > [ 59.588634] Rebooting in 10 seconds..
> >
> > The concerning source code line 651 is in my case:
> >
> > void mdiobus_free(struct mii_bus *bus)
> > {
> > /* For compatibility with error handling in drivers. */
> > if (bus->state == MDIOBUS_ALLOCATED) {
> > kfree(bus);
> > return;
> > }
> >
> > 651< BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
> > bus->state = MDIOBUS_RELEASED;
> >
> > put_device(&bus->dev);
> > }
> > EXPORT_SYMBOL(mdiobus_free);
> >
> > I tested with both versions of your patchset, with the same result. I also tested
> > with a RP 5.14 kernel (the latest RP kernel) but I did not see the original issue
> > (i.e. the system hang) here for some reason.
> >
> > I then tried to get the net-next kernel running on my system but without success so far. So for
> > now the result with the RP 5.10 is all I can offer. I hope that helps a bit nevertheless.
>
> Thank you Lino, this is a very valuable report. I will send a v3 soon (not sure if today).
Actually, no, I will not send a v3, because the fact that devres can now
call devm_mdiobus_free without devm_mdiobus_unregister has nothing to do
with this series and touches much more than DSA, it is an issue introduced by
commit ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()").
I will deal with it separately, the basic idea being that
devm_mdiobus_alloc + plain mdiobus_register is bonkers, and Bartosz
didn't care enough to fix the existing users of that pattern before he
just went ahead to make _devm_mdiobus_free stop calling mdiobus_unregister.
In fact, this patch really got me wondering at the time, was the rtl8366
driver so broken at the time?
https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
No, it stems from the exact same patch from Bartosz, devres used to
unregister MDIO buses when freeing them, now it isn't.
So while I don't disagree with Bartosz' overall idea, the execution kinda sucks.
In your case, what happens is that your driver's ->shutdown method gets
called, and this (as discussed) means that your driver's ->remove method
will be a no-op (to avoid unregistering DSA structures twice). But since
SPI bus drivers which implement their own ->shutdown as ->remove are a
thing, the fact that you don't run anything on ->remove means you won't
unregister an MDIO bus structure that was allocated under devres. Even
worse, the ksz9477_spi driver does _not_ allocate any MDIO bus structure
under devres, but the DSA core does, it's that pesky ds->slave_mii_bus
for which DSA is too helpful for its own good. Unregistering that MDIO
bus happens only from the dsa_unregister_switch call path, not from
dsa_switch_shutdown. But even if your SPI device driver does a no-op on
->remove, it doesn't mean the device_del on it won't be called. And we
can't stop the devres callbacks from running, which inevitably means
that an MDIO bus structure which is still registered will get freed.
My personal conclusion is that either you go with devres all the way
(devm_mdiobus_alloc + devm_mdiobus_register) or with no devres all the
way (mdiobus_alloc + mdiobus_register), otherwise you've just signed up
to learning things you never really wanted to learn. The pattern of
devres for the mdiobus_alloc and then no devres for the mdiobus_register
is very old, which explains why devm_mdiobus_register was only added
very recently (between kernel v5.7 and v5.8). It used to be fine in the
sense that it worked, but now it's completely broken. Some poor soul
needs to audit that pattern across the whole kernel after Bartosz'
patch, and it looks like that somebody is me...