Greetings,
I've encountered a hang on shutdown on octeontx (CN8030 SoC, THUNDERX
architecture) that I bisected to commit 66c915d09b94 ("mmc: core:
Disable card detect during shutdown").
It looks like the OMP5 Pyra ran into this as well related to a
malfunctioning driver [1]
In the case of MMC_CAVIUM_THUNDERX the host controller supports
multiple slots each having their own CMD signal but shared clk/data
via the following dt:
mmc@1,4 {
compatible = "cavium,thunder-8890-mmc";
reg = <0xc00 0x00 0x00 0x00 0x00>;
#address-cells = <0x01>;
#size-cells = <0x00>;
clocks = <0x0b>;
/* eMMC */
mmc-slot@0 {
compatible = "mmc-slot";
reg = <0>;
vmmc-supply = <&mmc_supply_3v3>;
max-frequency = <35000000>;
no-1-8-v;
bus-width = <8>;
no-sdio;
no-sd;
mmc-ddr-3_3v;
cap-mmc-highspeed;
};
/* microSD */
mmc-slot@1 {
compatible = "mmc-slot";
reg = <1>;
vmmc-supply = <&mmc_supply_3v3>;
max-frequency = <35000000>;
no-1-8-v;
broken-cd;
bus-width = <4>;
cap-sd-highspeed;
};
};
mmc_add_host is only called once for mmc0 and I can't see any printk
debugging added to __mmc_stop_host (maybe because serial/console has
been disabled by that point?).
It appears that what causes this hang is the 'broken-cd' which enables
the detect change polling on mmc1. I have the ability to flip the CMD
signal routing thus making mmc0 the microSD and mmc1 the eMMC and when
I do that there isn't an issue so I think what happens is in the case
where mmc polling is enabled on mmc1 but not mmc0 (as above) the
polling causes a hang after __mmc_stop_host() is called for mmc0.
Any ideas?
Best Regards,
Tim
[1] https://lore.kernel.org/all/[email protected]/
+ Robert
On Thu, 2 Mar 2023 at 00:32, Tim Harvey <[email protected]> wrote:
>
> Greetings,
>
> I've encountered a hang on shutdown on octeontx (CN8030 SoC, THUNDERX
> architecture) that I bisected to commit 66c915d09b94 ("mmc: core:
> Disable card detect during shutdown").
>
> It looks like the OMP5 Pyra ran into this as well related to a
> malfunctioning driver [1]
>
> In the case of MMC_CAVIUM_THUNDERX the host controller supports
> multiple slots each having their own CMD signal but shared clk/data
> via the following dt:
>
> mmc@1,4 {
> compatible = "cavium,thunder-8890-mmc";
> reg = <0xc00 0x00 0x00 0x00 0x00>;
> #address-cells = <0x01>;
> #size-cells = <0x00>;
> clocks = <0x0b>;
>
> /* eMMC */
> mmc-slot@0 {
> compatible = "mmc-slot";
> reg = <0>;
> vmmc-supply = <&mmc_supply_3v3>;
> max-frequency = <35000000>;
> no-1-8-v;
> bus-width = <8>;
> no-sdio;
> no-sd;
> mmc-ddr-3_3v;
> cap-mmc-highspeed;
> };
>
> /* microSD */
> mmc-slot@1 {
> compatible = "mmc-slot";
> reg = <1>;
> vmmc-supply = <&mmc_supply_3v3>;
> max-frequency = <35000000>;
> no-1-8-v;
> broken-cd;
> bus-width = <4>;
> cap-sd-highspeed;
> };
> };
>
> mmc_add_host is only called once for mmc0 and I can't see any printk
That looks wrong. There needs to be one mmc host registered per slot,
otherwise things will, for sure, not work.
I suggest you have a closer look to see what goes on in thunder_mmc_probe().
> debugging added to __mmc_stop_host (maybe because serial/console has
> been disabled by that point?).
The serial console should work fine at this point, at least on those
systems that I have tested this code with.
Perhaps you added the debug print too late in the function, if the
calls to disable_irq() or cancel_delayed_work_sync() are hanging?
>
> It appears that what causes this hang is the 'broken-cd' which enables
> the detect change polling on mmc1. I have the ability to flip the CMD
> signal routing thus making mmc0 the microSD and mmc1 the eMMC and when
> I do that there isn't an issue so I think what happens is in the case
> where mmc polling is enabled on mmc1 but not mmc0 (as above) the
> polling causes a hang after __mmc_stop_host() is called for mmc0.
The code in __mmc_stop_host() has been tested for both polling and
gpio card detections. That said, it looks to me that there is
something weird going on in the cavium mmc driver.
What makes this even tricker, is that it's uncommon and not
recommended to use more than one mmc slot per host instance.
>
> Any ideas?
I hope the above thoughts can point you in a direction to narrow down
this problem.
>
> Best Regards,
>
> Tim
>
> [1] https://lore.kernel.org/all/[email protected]/
Kind regards
Uffe
On Thu, Mar 2, 2023 at 2:37 AM Ulf Hansson <[email protected]> wrote:
>
> + Robert
>
> On Thu, 2 Mar 2023 at 00:32, Tim Harvey <[email protected]> wrote:
> >
> > Greetings,
> >
> > I've encountered a hang on shutdown on octeontx (CN8030 SoC, THUNDERX
> > architecture) that I bisected to commit 66c915d09b94 ("mmc: core:
> > Disable card detect during shutdown").
> >
> > It looks like the OMP5 Pyra ran into this as well related to a
> > malfunctioning driver [1]
> >
> > In the case of MMC_CAVIUM_THUNDERX the host controller supports
> > multiple slots each having their own CMD signal but shared clk/data
> > via the following dt:
> >
> > mmc@1,4 {
> > compatible = "cavium,thunder-8890-mmc";
> > reg = <0xc00 0x00 0x00 0x00 0x00>;
> > #address-cells = <0x01>;
> > #size-cells = <0x00>;
> > clocks = <0x0b>;
> >
> > /* eMMC */
> > mmc-slot@0 {
> > compatible = "mmc-slot";
> > reg = <0>;
> > vmmc-supply = <&mmc_supply_3v3>;
> > max-frequency = <35000000>;
> > no-1-8-v;
> > bus-width = <8>;
> > no-sdio;
> > no-sd;
> > mmc-ddr-3_3v;
> > cap-mmc-highspeed;
> > };
> >
> > /* microSD */
> > mmc-slot@1 {
> > compatible = "mmc-slot";
> > reg = <1>;
> > vmmc-supply = <&mmc_supply_3v3>;
> > max-frequency = <35000000>;
> > no-1-8-v;
> > broken-cd;
> > bus-width = <4>;
> > cap-sd-highspeed;
> > };
> > };
> >
> > mmc_add_host is only called once for mmc0 and I can't see any printk
>
> That looks wrong. There needs to be one mmc host registered per slot,
> otherwise things will, for sure, not work.
>
> I suggest you have a closer look to see what goes on in thunder_mmc_probe().
>
Ulf,
Sorry, I was mistaken. Each slot does get its own mmc host.
I find that with thunderx_mmc I can reproduce this hang on shutdown
even if I just have a single slot with broken-cd defined.
I wonder if it has to do with thunder_mmc_probe getting called
multiple times because it defers due to gpio/regulator not yet being
available:
[ 6.846262] thunderx_mmc 0000:01:01.4: Adding to iommu group 1
[ 6.852143] thunder_mmc_probe
[ 6.855622] thunder_mmc_probe scanning slots
[ 6.860137] mmc_alloc_host: mmc0 init delayed work
[ 6.864938] cvm_mmc_of_slot_probe mmc0
[ 6.868695] cvm_mmc_of_slot_probe mmc0 Failed: EPROBE_DEFER
[ 6.874269] mmc_free_host: mmc0
[ 6.877481] thunder_mmc_probe Failed: EPROBE_DEFER
...
[ 7.737536] gpio_thunderx 0000:00:06.0: Adding to iommu group 16
[ 7.745252] gpio gpiochip0: (gpio_thunderx): not an immutable chip,
please consider fixing it!
[ 7.754096] gpio_thunderx 0000:00:06.0: ThunderX GPIO: 48 lines
with base 512.
...
[ 7.946636] thunder_mmc_probe
[ 7.950125] thunder_mmc_probe scanning slots
[ 7.954597] mmc_alloc_host: mmc0 init delayed work
[ 7.959399] cvm_mmc_of_slot_probe mmc0
[ 7.963158] cvm_mmc_of_slot_probe mmc0 Failed: EPROBE_DEFER
[ 7.968732] mmc_free_host: mmc0
[ 7.971963] thunder_mmc_probe Failed: EPROBE_DEFER
...
[ 7.998271] reg_fixed_voltage_probe
[ 8.001773] reg-fixed-voltage mmc_supply_3v3: reg_fixed_voltage_probe
[ 8.008360] reg-fixed-voltage mmc_supply_3v3: mmc_supply_3v3
supplying 3300000uV
[ 8.015851] thunder_mmc_probe
[ 8.019318] thunder_mmc_probe scanning slots
[ 8.023794] mmc_alloc_host: mmc0 init delayed work
[ 8.028596] cvm_mmc_of_slot_probe mmc0
[ 8.032488] mmc_add_host: mmc0
[ 8.060655] cvm_mmc_of_slot_probe mmc0 ok
[ 8.064678] thunderx_mmc 0000:01:01.4: probed
[ 8.069041] mmc_rescan: mmc0 irq=-22
> > debugging added to __mmc_stop_host (maybe because serial/console has
> > been disabled by that point?).
>
> The serial console should work fine at this point, at least on those
> systems that I have tested this code with.
>
> Perhaps you added the debug print too late in the function, if the
> calls to disable_irq() or cancel_delayed_work_sync() are hanging?
>
This was something to do with busybox reboot. I switched to using
sysrq (echo o > /proc/sysrq-trigger) to reboot and now I can see my
printk's
> >
> > It appears that what causes this hang is the 'broken-cd' which enables
> > the detect change polling on mmc1. I have the ability to flip the CMD
> > signal routing thus making mmc0 the microSD and mmc1 the eMMC and when
> > I do that there isn't an issue so I think what happens is in the case
> > where mmc polling is enabled on mmc1 but not mmc0 (as above) the
> > polling causes a hang after __mmc_stop_host() is called for mmc0.
>
> The code in __mmc_stop_host() has been tested for both polling and
> gpio card detections. That said, it looks to me that there is
> something weird going on in the cavium mmc driver.
>
> What makes this even tricker, is that it's uncommon and not
> recommended to use more than one mmc slot per host instance.
>
that was my mistake... there is one host instance per slot and I see
this even if I only have 1 slot as long as polling is enabled.
now that I can see my printk's I can confirm it hangs when
_mmc_stop_host calls the cancel_delayed_work_sync:
# echo o > /proc/sysrq-trigger
[ 210.370200] sysrq: Power Off
[ 210.373147] kernel_shutdown_prepare
[ 210.896927] mmc_rescan: mmc0 irq=-22
[ 213.038191] mmc_host_classdev_shutdown mmc0
[ 213.042384] __mmc_stop_host: mmc0 cd_irq=-22
[ 213.046658] __mmc_stop_host: mmc0 calling cancel_delayed_work_sync
^^^ never comes back
If I comment out the call to cancel_delayed_work_sync in
__mmc_stop_host then shutdown does not hang so I think it has
something to do with mmc_alloc_host setting up the polling multiple
times.
Best Regards,
Tim
> >
> > Any ideas?
>
> I hope the above thoughts can point you in a direction to narrow down
> this problem.
>
> >
> > Best Regards,
> >
> > Tim
> >
> > [1] https://lore.kernel.org/all/[email protected]/
>
> Kind regards
> Uffe
On Sat, 4 Mar 2023 at 00:38, Tim Harvey <[email protected]> wrote:
>
> On Thu, Mar 2, 2023 at 2:37 AM Ulf Hansson <[email protected]> wrote:
> >
> > + Robert
> >
> > On Thu, 2 Mar 2023 at 00:32, Tim Harvey <[email protected]> wrote:
> > >
> > > Greetings,
> > >
> > > I've encountered a hang on shutdown on octeontx (CN8030 SoC, THUNDERX
> > > architecture) that I bisected to commit 66c915d09b94 ("mmc: core:
> > > Disable card detect during shutdown").
> > >
> > > It looks like the OMP5 Pyra ran into this as well related to a
> > > malfunctioning driver [1]
> > >
> > > In the case of MMC_CAVIUM_THUNDERX the host controller supports
> > > multiple slots each having their own CMD signal but shared clk/data
> > > via the following dt:
> > >
> > > mmc@1,4 {
> > > compatible = "cavium,thunder-8890-mmc";
> > > reg = <0xc00 0x00 0x00 0x00 0x00>;
> > > #address-cells = <0x01>;
> > > #size-cells = <0x00>;
> > > clocks = <0x0b>;
> > >
> > > /* eMMC */
> > > mmc-slot@0 {
> > > compatible = "mmc-slot";
> > > reg = <0>;
> > > vmmc-supply = <&mmc_supply_3v3>;
> > > max-frequency = <35000000>;
> > > no-1-8-v;
> > > bus-width = <8>;
> > > no-sdio;
> > > no-sd;
> > > mmc-ddr-3_3v;
> > > cap-mmc-highspeed;
> > > };
> > >
> > > /* microSD */
> > > mmc-slot@1 {
> > > compatible = "mmc-slot";
> > > reg = <1>;
> > > vmmc-supply = <&mmc_supply_3v3>;
> > > max-frequency = <35000000>;
> > > no-1-8-v;
> > > broken-cd;
> > > bus-width = <4>;
> > > cap-sd-highspeed;
> > > };
> > > };
> > >
> > > mmc_add_host is only called once for mmc0 and I can't see any printk
> >
> > That looks wrong. There needs to be one mmc host registered per slot,
> > otherwise things will, for sure, not work.
> >
> > I suggest you have a closer look to see what goes on in thunder_mmc_probe().
> >
>
> Ulf,
>
> Sorry, I was mistaken. Each slot does get its own mmc host.
>
> I find that with thunderx_mmc I can reproduce this hang on shutdown
> even if I just have a single slot with broken-cd defined.
Okay, that's a step in the right direction to narrow down the problem!
>
> I wonder if it has to do with thunder_mmc_probe getting called
> multiple times because it defers due to gpio/regulator not yet being
> available:
> [ 6.846262] thunderx_mmc 0000:01:01.4: Adding to iommu group 1
> [ 6.852143] thunder_mmc_probe
> [ 6.855622] thunder_mmc_probe scanning slots
> [ 6.860137] mmc_alloc_host: mmc0 init delayed work
> [ 6.864938] cvm_mmc_of_slot_probe mmc0
> [ 6.868695] cvm_mmc_of_slot_probe mmc0 Failed: EPROBE_DEFER
> [ 6.874269] mmc_free_host: mmc0
> [ 6.877481] thunder_mmc_probe Failed: EPROBE_DEFER
> ...
> [ 7.737536] gpio_thunderx 0000:00:06.0: Adding to iommu group 16
> [ 7.745252] gpio gpiochip0: (gpio_thunderx): not an immutable chip,
> please consider fixing it!
> [ 7.754096] gpio_thunderx 0000:00:06.0: ThunderX GPIO: 48 lines
> with base 512.
> ...
> [ 7.946636] thunder_mmc_probe
> [ 7.950125] thunder_mmc_probe scanning slots
> [ 7.954597] mmc_alloc_host: mmc0 init delayed work
> [ 7.959399] cvm_mmc_of_slot_probe mmc0
> [ 7.963158] cvm_mmc_of_slot_probe mmc0 Failed: EPROBE_DEFER
> [ 7.968732] mmc_free_host: mmc0
> [ 7.971963] thunder_mmc_probe Failed: EPROBE_DEFER
> ...
> [ 7.998271] reg_fixed_voltage_probe
> [ 8.001773] reg-fixed-voltage mmc_supply_3v3: reg_fixed_voltage_probe
> [ 8.008360] reg-fixed-voltage mmc_supply_3v3: mmc_supply_3v3
> supplying 3300000uV
> [ 8.015851] thunder_mmc_probe
> [ 8.019318] thunder_mmc_probe scanning slots
> [ 8.023794] mmc_alloc_host: mmc0 init delayed work
> [ 8.028596] cvm_mmc_of_slot_probe mmc0
> [ 8.032488] mmc_add_host: mmc0
> [ 8.060655] cvm_mmc_of_slot_probe mmc0 ok
> [ 8.064678] thunderx_mmc 0000:01:01.4: probed
> [ 8.069041] mmc_rescan: mmc0 irq=-22
I can't really tell from the above log whether the error path in
->probe(), is working correctly. I don't see any obvious problem here.
>
> > > debugging added to __mmc_stop_host (maybe because serial/console has
> > > been disabled by that point?).
> >
> > The serial console should work fine at this point, at least on those
> > systems that I have tested this code with.
> >
> > Perhaps you added the debug print too late in the function, if the
> > calls to disable_irq() or cancel_delayed_work_sync() are hanging?
> >
>
> This was something to do with busybox reboot. I switched to using
> sysrq (echo o > /proc/sysrq-trigger) to reboot and now I can see my
> printk's
Okay.
>
> > >
> > > It appears that what causes this hang is the 'broken-cd' which enables
> > > the detect change polling on mmc1. I have the ability to flip the CMD
> > > signal routing thus making mmc0 the microSD and mmc1 the eMMC and when
> > > I do that there isn't an issue so I think what happens is in the case
> > > where mmc polling is enabled on mmc1 but not mmc0 (as above) the
> > > polling causes a hang after __mmc_stop_host() is called for mmc0.
> >
> > The code in __mmc_stop_host() has been tested for both polling and
> > gpio card detections. That said, it looks to me that there is
> > something weird going on in the cavium mmc driver.
> >
> > What makes this even tricker, is that it's uncommon and not
> > recommended to use more than one mmc slot per host instance.
> >
>
> that was my mistake... there is one host instance per slot and I see
> this even if I only have 1 slot as long as polling is enabled.
Okay.
>
> now that I can see my printk's I can confirm it hangs when
> _mmc_stop_host calls the cancel_delayed_work_sync:
> # echo o > /proc/sysrq-trigger
> [ 210.370200] sysrq: Power Off
> [ 210.373147] kernel_shutdown_prepare
> [ 210.896927] mmc_rescan: mmc0 irq=-22
> [ 213.038191] mmc_host_classdev_shutdown mmc0
> [ 213.042384] __mmc_stop_host: mmc0 cd_irq=-22
> [ 213.046658] __mmc_stop_host: mmc0 calling cancel_delayed_work_sync
> ^^^ never comes back
Unless I am missing something, that should mean that mmc_rescan() is
hanging somewhere. Before the shutdown, did you try to insert an SD
card to verify that it was detected properly?
I suggest you debug mmmc_rescan() to try to understand where exactly it hangs.
>
> If I comment out the call to cancel_delayed_work_sync in
> __mmc_stop_host then shutdown does not hang so I think it has
> something to do with mmc_alloc_host setting up the polling multiple
> times.
I am not so sure, the error path in ->probe() doesn't look that broken
to me. At least it's difficult to say, by reading the logs that you
have provided.
[...]
Kind regards
Uffe