2022-06-01 20:34:14

by H. Nikolaus Schaller

[permalink] [raw]
Subject: BUG in mmc: core: Disable card detect during shutdown

Hi Ulf,
users did report a strange issue that the OMAP5 based Pyra does not
shutdown if a kernel 5.10.116 is used.

Someone did a bisect and found that reverting

0d66b395210c5084c2b7324945062c1d1f95487a

resp. upstream

66c915d09b942fb3b2b0cb2f56562180901fba17

solves it.

I could now confirm that it also happens with v5.18.0.
But interestingly only on the Pyra handheld device and not
on the omap5evm (which is supported by mainline).

The symptom is:

a) without revert

root@letux:~# poweroff

Broadcast message from root@letux (console) (Sat Jan 1 01:08:25 2000):

The system is going down for system halt NOW!
INIT: Sending processes the TERM signal
root@letux:~# [info] Using makefile-style concurrent boot in runlevel 0.
[....] Stopping cgroup management proxy daemon: cgproxy[....] Stopping cgroup management daemon: cgmanager[....] Stop[ ok bluetooth: /usr/sbin/bluetoothd.
[FAIL] Stopping ISC DHCP server: dhcpd failed!
dhcpcd[3055]: sending signal 15 to pid 2976
dhcpcd[3055]: waiting for pid 2976 to exit
[ ok ] Shutting down ALSA...done.
[ ok ] Asking all remaining processes to terminate...done.
[ ok ] All processes ended within 2 seconds...done.
[ ok [[c[....] Stopping enhanced syslogd: rsyslogd.
[ ok ....] Deconfiguring network interfaces...done.
^[[c[info] Saving the system clock.
[info] Hardware Clock updated to Sat Jan 1 01:08:30 UTC 2000.
[ ok ] Deactivating swap...done.
^[[c[ 77.289332] EXT4-fs (mmcblk0p2): re-mounted. Quota mode: none.
[info] Will now halt.

b) with reverting your patch

root@letux:~# uname -a
Linux letux 5.18.0-letux-lpae+ #9678 SMP PREEMPT Mon May 30 18:02:28 CEST 2022 armv7l GNU/Linux
root@letux:~# poweroff

Broadcast message from root@letux (console) (Sat Jan 1 01:39:15 2000):

The system is going down for system halt NOW!
INIT: Sending processes the TERM signal
root@letux:~# [info] Using makefile-style concurrent boot in runlevel 0.
[FAIL] Stopping cgroup management proxy daemon: cgproxy[....] Stopping ISC DHCP server: dhcpd failed!
[....] Stopping cgroup management daemon: cgmanagerdhcpcd[3100]: sending signal 15 to pid 3013
dhcpcd[3100]: waiting for pid 3013 to exit
[ ok ] Stopping bluetooth: /usr/sbin/bluetoothd.
[ ok ] Shutting down ALSA...done.
[ ok ] Asking all remaining processes to terminate...done.
[ ok ] All processes ended within 3 seconds...done.
[ ok [[c[....] Stopping enhanced syslogd: rsyslogd.
[ ok ....] Deconfiguring network interfaces...done.
^[[c[info] Saving the system clock.
[info] Hardware Clock updated to Sat Jan 1 01:39:21 UTC 2000.
[ ok ] Deactivating swap...done.
^[[c[ 44.563256] EXT4-fs (mmcblk0p2): re-mounted. Quota mode: none.
[info] Will now halt.
[ 46.917534] reboot: Power down


What I suspect is that we have multiple mmc interfaces and have
card detect wired up in the Pyra while it is ignored in the
EVM. Is it possible that __mmc_stop_host() never returns in
.shutdown_pre if card detect is set up (and potentially
shut down earlier)?

Setup of mmc is done in omap5-board-common.dtsi and omap5.dtsi.

Out Pyra has a non-upstream device tree where we use
omap5-board-common.dtsi and overwrite it by e.g.

&mmc4 { /* second (u)SD slot (SDIO capable) */
status = "okay";
vmmc-supply = <&ldo2_reg>;
pinctrl-names = "default";
pinctrl-0 = <&mmc4_pins>;
bus-width = <4>;
cd-gpios = <&gpio3 13 GPIO_ACTIVE_LOW>; /* gpio3_77 */
wp-gpios = <&gpio3 15 GPIO_ACTIVE_HIGH>; /* gpio3_79 */
};

But I have tried to remove the cd-gpois and wp-gpois. Or make the
mmc interface being disabled (but I may not have catched everything
in first place).

Then I added some printk to mmc_stop_host() and __mmc_stop_host().

mmc_stop_host() is not called but __mmc_stop_host() is called 4 times.
There are 4 active MMC interfaces in the Pyra - 3 for (µ)SD slots
and one for an SDIO WLAN module.

Now it looks as if 3 of them are properly teared down (two of them
seem to have host->slot.cd_irq >= 0) but on the fourth call
cancel_delayed_work_sync(&host->detect); does not return. This is
likely the location of the stall why we don't see a "reboot: Power down"

Any ideas?

BR and thanks,
Nikolaus

printk hack:

void __mmc_stop_host(struct mmc_host *host)
{
printk("%s 1\n", __func__);
if (host->slot.cd_irq >= 0) {
printk("%s 2\n", __func__);
mmc_gpio_set_cd_wake(host, false);
printk("%s 3\n", __func__);
disable_irq(host->slot.cd_irq);
printk("%s 4\n", __func__);
}

host->rescan_disable = 1;
printk("%s 5\n", __func__);
cancel_delayed_work_sync(&host->detect);
printk("%s 6\n", __func__);
}

resulting log:

[info] Will now halt.
[ 282.780929] __mmc_stop_host 1
[ 282.784276] __mmc_stop_host 2
[ 282.787735] __mmc_stop_host 3
[ 282.791030] __mmc_stop_host 4
[ 282.794235] __mmc_stop_host 5
[ 282.797369] __mmc_stop_host 6
[ 282.800918] __mmc_stop_host 1
[ 282.804269] __mmc_stop_host 5
[ 282.807541] __mmc_stop_host 6
[ 282.810715] __mmc_stop_host 1
[ 282.813842] __mmc_stop_host 2
[ 282.816984] __mmc_stop_host 3
[ 282.820175] __mmc_stop_host 4
[ 282.823302] __mmc_stop_host 5
[ 282.826449] __mmc_stop_host 6
[ 282.830941] __mmc_stop_host 1
[ 282.834076] __mmc_stop_host 5

--- here should be another __mmc_stop_host 6
--- and reboot: Power down


2022-06-06 05:55:20

by H. Nikolaus Schaller

[permalink] [raw]
Subject: Re: BUG in mmc: core: Disable card detect during shutdown

Hi,

> Am 03.06.2022 um 12:46 schrieb Ulf Hansson <[email protected]>:
>
> On Mon, 30 May 2022 at 18:55, H. Nikolaus Schaller <[email protected]> wrote:
>>
>> Hi Ulf,
>> users did report a strange issue that the OMAP5 based Pyra does not
>> shutdown if a kernel 5.10.116 is used.
>>

...

>> mmc_stop_host() is not called but __mmc_stop_host() is called 4 times.
>> There are 4 active MMC interfaces in the Pyra - 3 for (µ)SD slots
>> and one for an SDIO WLAN module.
>>
>> Now it looks as if 3 of them are properly teared down (two of them
>> seem to have host->slot.cd_irq >= 0) but on the fourth call
>> cancel_delayed_work_sync(&host->detect); does not return. This is
>> likely the location of the stall why we don't see a "reboot: Power down"
>>
>> Any ideas?
>
> I guess the call to cancel_delayed_work_sync() in __mmc_stop_host()
> hangs for one of the mmc hosts. This shouldn't happen - and indicates
> that there is something else being wrong.

Yes, you were right...

>
> See more suggestions below.
>
>>
>> BR and thanks,
>> Nikolaus
>>
>> printk hack:
>>
>> void __mmc_stop_host(struct mmc_host *host)
>> {
>> printk("%s 1\n", __func__);
>> if (host->slot.cd_irq >= 0) {
>> printk("%s 2\n", __func__);
>> mmc_gpio_set_cd_wake(host, false);
>> printk("%s 3\n", __func__);
>> disable_irq(host->slot.cd_irq);
>> printk("%s 4\n", __func__);
>> }
>>
>> host->rescan_disable = 1;
>> printk("%s 5\n", __func__);
>
> My guess is that it's the same mmc host that causes the hang. I
> suggest you print the name of the host too, to verify that. Something
> along the lines of the below.
>
> printk("%s: %s 5\n", mmc_hostname(host), __func__);

To my surprise, this did report an mmc6 host port where the OMAP5 only has 4...

Yes, we have a special driver for the txs02612 sdio switch and voltage translator
chip to make two ports out of the single mmc2 port of the OMAP5 SoC.

This driver was begun ca. 7 years ago but never finished...

The idea is to make a mmc port have several subports. For the Pyra handheld hardware
we needed 5 mmc/sdio interfaces but the omap5 only has 4 of them available to us.

So the txs02612 drivers is sitting between the omap5 mmc2 host pins and switches
between an µSD slot and an eMMC.

Therefore, the driver is a mmc client driver (like e.g. the driver of some WiFi chip
connected to some SDIO port) and provides multiple mmc host interfaces.

It should intercept data transfer requests to its multiple mmc hosts, synchronize
(or enqueue) them, control the switch gpio and forward requests to the processor's
mmc host port so that they are processed (after switching).

We never continued to make this work...

What remained is simple code to manually throw the switch through some /sysfs
control file after doing an eject and before a fresh partprobe.

Still, the probe function of the txs02612 driver does two calls to mmc_add_host().
These seem to make

>
>> cancel_delayed_work_sync(&host->detect);

get stuck. Most likely because the initialization is not complete for handling
card detection.

>>
>> --- here should be another __mmc_stop_host 6
>> --- and reboot: Power down
>
> When/if you figured out that it's the same host that hangs, you could
> try to disable that host through the DTS files (add status =
> "disabled" in the device node, for example) - and see if that works.

When not calling mmc_add_host() in our txs02612 driver fragment we can
properly shut down the OMAP5. That is the solution with the least efforts.
The other would be to make the txs02612 properly work...

So in summary there is no bug upstream. It is in our tree.

If you are interested in how our code fragment for the txs02612 looks like:

https://git.goldelico.com/?p=letux-kernel.git;a=shortlog;h=refs/heads/letux/txs02612

Maybe you have some suggestions to make it work?

>
> Kind regards
> Uffe

BR and thanks,
Nikolaus

2022-06-06 06:05:42

by Ulf Hansson

[permalink] [raw]
Subject: Re: BUG in mmc: core: Disable card detect during shutdown

On Mon, 30 May 2022 at 18:55, H. Nikolaus Schaller <[email protected]> wrote:
>
> Hi Ulf,
> users did report a strange issue that the OMAP5 based Pyra does not
> shutdown if a kernel 5.10.116 is used.
>
> Someone did a bisect and found that reverting
>
> 0d66b395210c5084c2b7324945062c1d1f95487a
>
> resp. upstream
>
> 66c915d09b942fb3b2b0cb2f56562180901fba17
>
> solves it.
>
> I could now confirm that it also happens with v5.18.0.
> But interestingly only on the Pyra handheld device and not
> on the omap5evm (which is supported by mainline).
>
> The symptom is:
>
> a) without revert
>
> root@letux:~# poweroff
>
> Broadcast message from root@letux (console) (Sat Jan 1 01:08:25 2000):
>
> The system is going down for system halt NOW!
> INIT: Sending processes the TERM signal
> root@letux:~# [info] Using makefile-style concurrent boot in runlevel 0.
> [....] Stopping cgroup management proxy daemon: cgproxy[....] Stopping cgroup management daemon: cgmanager[....] Stop[ ok bluetooth: /usr/sbin/bluetoothd.
> [FAIL] Stopping ISC DHCP server: dhcpd failed!
> dhcpcd[3055]: sending signal 15 to pid 2976
> dhcpcd[3055]: waiting for pid 2976 to exit
> [ ok ] Shutting down ALSA...done.
> [ ok ] Asking all remaining processes to terminate...done.
> [ ok ] All processes ended within 2 seconds...done.
> [ ok [[c[....] Stopping enhanced syslogd: rsyslogd.
> [ ok ....] Deconfiguring network interfaces...done.
> ^[[c[info] Saving the system clock.
> [info] Hardware Clock updated to Sat Jan 1 01:08:30 UTC 2000.
> [ ok ] Deactivating swap...done.
> ^[[c[ 77.289332] EXT4-fs (mmcblk0p2): re-mounted. Quota mode: none.
> [info] Will now halt.
>
> b) with reverting your patch
>
> root@letux:~# uname -a
> Linux letux 5.18.0-letux-lpae+ #9678 SMP PREEMPT Mon May 30 18:02:28 CEST 2022 armv7l GNU/Linux
> root@letux:~# poweroff
>
> Broadcast message from root@letux (console) (Sat Jan 1 01:39:15 2000):
>
> The system is going down for system halt NOW!
> INIT: Sending processes the TERM signal
> root@letux:~# [info] Using makefile-style concurrent boot in runlevel 0.
> [FAIL] Stopping cgroup management proxy daemon: cgproxy[....] Stopping ISC DHCP server: dhcpd failed!
> [....] Stopping cgroup management daemon: cgmanagerdhcpcd[3100]: sending signal 15 to pid 3013
> dhcpcd[3100]: waiting for pid 3013 to exit
> [ ok ] Stopping bluetooth: /usr/sbin/bluetoothd.
> [ ok ] Shutting down ALSA...done.
> [ ok ] Asking all remaining processes to terminate...done.
> [ ok ] All processes ended within 3 seconds...done.
> [ ok [[c[....] Stopping enhanced syslogd: rsyslogd.
> [ ok ....] Deconfiguring network interfaces...done.
> ^[[c[info] Saving the system clock.
> [info] Hardware Clock updated to Sat Jan 1 01:39:21 UTC 2000.
> [ ok ] Deactivating swap...done.
> ^[[c[ 44.563256] EXT4-fs (mmcblk0p2): re-mounted. Quota mode: none.
> [info] Will now halt.
> [ 46.917534] reboot: Power down
>
>
> What I suspect is that we have multiple mmc interfaces and have
> card detect wired up in the Pyra while it is ignored in the
> EVM. Is it possible that __mmc_stop_host() never returns in
> .shutdown_pre if card detect is set up (and potentially
> shut down earlier)?
>
> Setup of mmc is done in omap5-board-common.dtsi and omap5.dtsi.
>
> Out Pyra has a non-upstream device tree where we use
> omap5-board-common.dtsi and overwrite it by e.g.
>
> &mmc4 { /* second (u)SD slot (SDIO capable) */
> status = "okay";
> vmmc-supply = <&ldo2_reg>;
> pinctrl-names = "default";
> pinctrl-0 = <&mmc4_pins>;
> bus-width = <4>;
> cd-gpios = <&gpio3 13 GPIO_ACTIVE_LOW>; /* gpio3_77 */
> wp-gpios = <&gpio3 15 GPIO_ACTIVE_HIGH>; /* gpio3_79 */
> };
>
> But I have tried to remove the cd-gpois and wp-gpois. Or make the
> mmc interface being disabled (but I may not have catched everything
> in first place).
>
> Then I added some printk to mmc_stop_host() and __mmc_stop_host().
>
> mmc_stop_host() is not called but __mmc_stop_host() is called 4 times.
> There are 4 active MMC interfaces in the Pyra - 3 for (µ)SD slots
> and one for an SDIO WLAN module.
>
> Now it looks as if 3 of them are properly teared down (two of them
> seem to have host->slot.cd_irq >= 0) but on the fourth call
> cancel_delayed_work_sync(&host->detect); does not return. This is
> likely the location of the stall why we don't see a "reboot: Power down"
>
> Any ideas?

I guess the call to cancel_delayed_work_sync() in __mmc_stop_host()
hangs for one of the mmc hosts. This shouldn't happen - and indicates
that there is something else being wrong.

See more suggestions below.

>
> BR and thanks,
> Nikolaus
>
> printk hack:
>
> void __mmc_stop_host(struct mmc_host *host)
> {
> printk("%s 1\n", __func__);
> if (host->slot.cd_irq >= 0) {
> printk("%s 2\n", __func__);
> mmc_gpio_set_cd_wake(host, false);
> printk("%s 3\n", __func__);
> disable_irq(host->slot.cd_irq);
> printk("%s 4\n", __func__);
> }
>
> host->rescan_disable = 1;
> printk("%s 5\n", __func__);

My guess is that it's the same mmc host that causes the hang. I
suggest you print the name of the host too, to verify that. Something
along the lines of the below.

printk("%s: %s 5\n", mmc_hostname(host), __func__);

> cancel_delayed_work_sync(&host->detect);
> printk("%s 6\n", __func__);

Ditto.

> }
>
> resulting log:
>
> [info] Will now halt.
> [ 282.780929] __mmc_stop_host 1
> [ 282.784276] __mmc_stop_host 2
> [ 282.787735] __mmc_stop_host 3
> [ 282.791030] __mmc_stop_host 4
> [ 282.794235] __mmc_stop_host 5
> [ 282.797369] __mmc_stop_host 6
> [ 282.800918] __mmc_stop_host 1
> [ 282.804269] __mmc_stop_host 5
> [ 282.807541] __mmc_stop_host 6
> [ 282.810715] __mmc_stop_host 1
> [ 282.813842] __mmc_stop_host 2
> [ 282.816984] __mmc_stop_host 3
> [ 282.820175] __mmc_stop_host 4
> [ 282.823302] __mmc_stop_host 5
> [ 282.826449] __mmc_stop_host 6
> [ 282.830941] __mmc_stop_host 1
> [ 282.834076] __mmc_stop_host 5
>
> --- here should be another __mmc_stop_host 6
> --- and reboot: Power down

When/if you figured out that it's the same host that hangs, you could
try to disable that host through the DTS files (add status =
"disabled" in the device node, for example) - and see if that works.

Kind regards
Uffe

2022-06-08 05:38:03

by Ulf Hansson

[permalink] [raw]
Subject: Re: BUG in mmc: core: Disable card detect during shutdown

On Sat, 4 Jun 2022 at 12:16, H. Nikolaus Schaller <[email protected]> wrote:
>
> Hi,
>
> > Am 03.06.2022 um 12:46 schrieb Ulf Hansson <[email protected]>:
> >
> > On Mon, 30 May 2022 at 18:55, H. Nikolaus Schaller <[email protected]> wrote:
> >>
> >> Hi Ulf,
> >> users did report a strange issue that the OMAP5 based Pyra does not
> >> shutdown if a kernel 5.10.116 is used.
> >>
>
> ...
>
> >> mmc_stop_host() is not called but __mmc_stop_host() is called 4 times.
> >> There are 4 active MMC interfaces in the Pyra - 3 for (µ)SD slots
> >> and one for an SDIO WLAN module.
> >>
> >> Now it looks as if 3 of them are properly teared down (two of them
> >> seem to have host->slot.cd_irq >= 0) but on the fourth call
> >> cancel_delayed_work_sync(&host->detect); does not return. This is
> >> likely the location of the stall why we don't see a "reboot: Power down"
> >>
> >> Any ideas?
> >
> > I guess the call to cancel_delayed_work_sync() in __mmc_stop_host()
> > hangs for one of the mmc hosts. This shouldn't happen - and indicates
> > that there is something else being wrong.
>
> Yes, you were right...
>
> >
> > See more suggestions below.
> >
> >>
> >> BR and thanks,
> >> Nikolaus
> >>
> >> printk hack:
> >>
> >> void __mmc_stop_host(struct mmc_host *host)
> >> {
> >> printk("%s 1\n", __func__);
> >> if (host->slot.cd_irq >= 0) {
> >> printk("%s 2\n", __func__);
> >> mmc_gpio_set_cd_wake(host, false);
> >> printk("%s 3\n", __func__);
> >> disable_irq(host->slot.cd_irq);
> >> printk("%s 4\n", __func__);
> >> }
> >>
> >> host->rescan_disable = 1;
> >> printk("%s 5\n", __func__);
> >
> > My guess is that it's the same mmc host that causes the hang. I
> > suggest you print the name of the host too, to verify that. Something
> > along the lines of the below.
> >
> > printk("%s: %s 5\n", mmc_hostname(host), __func__);
>
> To my surprise, this did report an mmc6 host port where the OMAP5 only has 4...
>
> Yes, we have a special driver for the txs02612 sdio switch and voltage translator
> chip to make two ports out of the single mmc2 port of the OMAP5 SoC.
>
> This driver was begun ca. 7 years ago but never finished...
>
> The idea is to make a mmc port have several subports. For the Pyra handheld hardware
> we needed 5 mmc/sdio interfaces but the omap5 only has 4 of them available to us.
>
> So the txs02612 drivers is sitting between the omap5 mmc2 host pins and switches
> between an µSD slot and an eMMC.
>
> Therefore, the driver is a mmc client driver (like e.g. the driver of some WiFi chip
> connected to some SDIO port) and provides multiple mmc host interfaces.
>
> It should intercept data transfer requests to its multiple mmc hosts, synchronize
> (or enqueue) them, control the switch gpio and forward requests to the processor's
> mmc host port so that they are processed (after switching).
>
> We never continued to make this work...

Well, I can imagine that it's just very difficult to make this work properly.

Moreover, the mmc core and its block layer code isn't designed to
support this type of configuration. For example, the I/O scheduling
can't work with this setup.

>
> What remained is simple code to manually throw the switch through some /sysfs
> control file after doing an eject and before a fresh partprobe.
>
> Still, the probe function of the txs02612 driver does two calls to mmc_add_host().
> These seem to make
>
> >
> >> cancel_delayed_work_sync(&host->detect);
>
> get stuck. Most likely because the initialization is not complete for handling
> card detection.
>
> >>
> >> --- here should be another __mmc_stop_host 6
> >> --- and reboot: Power down
> >
> > When/if you figured out that it's the same host that hangs, you could
> > try to disable that host through the DTS files (add status =
> > "disabled" in the device node, for example) - and see if that works.
>
> When not calling mmc_add_host() in our txs02612 driver fragment we can
> properly shut down the OMAP5. That is the solution with the least efforts.
> The other would be to make the txs02612 properly work...
>
> So in summary there is no bug upstream. It is in our tree.

Thanks for sharing the details.

>
> If you are interested in how our code fragment for the txs02612 looks like:
>
> https://git.goldelico.com/?p=letux-kernel.git;a=shortlog;h=refs/heads/letux/txs02612
>
> Maybe you have some suggestions to make it work?

Sorry, but I have lots of things to do at this point, maybe some other time.

Kind regards
Uffe