2023-11-29 09:22:50

by Luca Ceresoli

[permalink] [raw]
Subject: Linux v6.6 sporadic reboot failures with ath9k on i.MX6Q

Hello,

since several weeks I am investigating a sporadic reboot failure on a
custom board based on i.MX6Q. There is an ATH9K Wi-Fi card connected
over PCIe, and the main suspect is the ath9k driver.

Anybody aware of this kind of bug with ath9k?

Some details about my tests follow.

This is on mainline v6.6 Linux, with only the board dts and a defconfig
added. The board dts itself is based on imx6q.dtsi and among others it
adds:

&pcie {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_pcie>;
reset-gpio = <&gpio2 20 GPIO_ACTIVE_LOW>;
status = "okay";
};

and:

&iomuxc {
/* ... */
imx6qdl-sabresd {
/* ... */
pinctrl_pcie: pciegrp {
fsl,pins = <
MX6QDL_PAD_EIM_A18__GPIO2_IO20 0x1b0b0
>;
};
/* ... */
};
};

Reboot usually works fine, but fails randomly in 1-5% of the
cases. The symptom is that the console stops producing any messages
at some random point in the shutdown sequence, even in the middle of a
line.

After about 7000 reboot attempts with different configurations it is
clear that enabling or disabling CONFIG_ATH9K is what makes the
difference:

1. kernels with CONFIG_ATH9K=n never fail
2. kernels with CONFIG_ATH9K=y do fail

Kernels built with CONFIG_ATH9K=y do fail even disabling all optional
CONFIG_ATH9K* options (rfkill, pcoem, btcoex and no_eeprom).

Similarly:

1. removing pcie from the device tree makes reboot work
2. leaving pcie in the device tree and removing all the peripherals
not required for booting, reboot does fail

On top of v6.6 I have applied all the potentially related commits from
master that appear as of now (8 in total):

git log --oneline --reverse --format=%H v6.6..origin/master -- \
drivers/net/wireless/ath/*.[ch] drivers/net/wireless/ath/ath9k/ \
| xargs git cherry-pick

and reboot still fails.

I have tested these mainline kernel versions, which no result:
v6.1.60, v5.15.137, v5.10.199, v5.10.

A first look at the ath9k driver code did not show anything obviously
wrong.

Any clues about how to further investigate would be very welcome.

I am obviously available to provide more info.

Best regards,
Luca

--
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


2023-11-30 05:11:35

by Florian Fainelli

[permalink] [raw]
Subject: Re: Linux v6.6 sporadic reboot failures with ath9k on i.MX6Q



On 11/29/2023 1:22 AM, Luca Ceresoli wrote:
> Hello,
>
> since several weeks I am investigating a sporadic reboot failure on a
> custom board based on i.MX6Q. There is an ATH9K Wi-Fi card connected
> over PCIe, and the main suspect is the ath9k driver.
>
> Anybody aware of this kind of bug with ath9k?
>
> Some details about my tests follow.
>
> This is on mainline v6.6 Linux, with only the board dts and a defconfig
> added. The board dts itself is based on imx6q.dtsi and among others it
> adds:
>
> &pcie {
> pinctrl-names = "default";
> pinctrl-0 = <&pinctrl_pcie>;
> reset-gpio = <&gpio2 20 GPIO_ACTIVE_LOW>;
> status = "okay";
> };
>
> and:
>
> &iomuxc {
> /* ... */
> imx6qdl-sabresd {
> /* ... */
> pinctrl_pcie: pciegrp {
> fsl,pins = <
> MX6QDL_PAD_EIM_A18__GPIO2_IO20 0x1b0b0
> >;
> };
> /* ... */
> };
> };
>
> Reboot usually works fine, but fails randomly in 1-5% of the
> cases. The symptom is that the console stops producing any messages
> at some random point in the shutdown sequence, even in the middle of a
> line.
>
> After about 7000 reboot attempts with different configurations it is
> clear that enabling or disabling CONFIG_ATH9K is what makes the
> difference:
>
> 1. kernels with CONFIG_ATH9K=n never fail
> 2. kernels with CONFIG_ATH9K=y do fail
>
> Kernels built with CONFIG_ATH9K=y do fail even disabling all optional
> CONFIG_ATH9K* options (rfkill, pcoem, btcoex and no_eeprom).
>
> Similarly:
>
> 1. removing pcie from the device tree makes reboot work
> 2. leaving pcie in the device tree and removing all the peripherals
> not required for booting, reboot does fail
>
> On top of v6.6 I have applied all the potentially related commits from
> master that appear as of now (8 in total):
>
> git log --oneline --reverse --format=%H v6.6..origin/master -- \
> drivers/net/wireless/ath/*.[ch] drivers/net/wireless/ath/ath9k/ \
> | xargs git cherry-pick
>
> and reboot still fails.
>
> I have tested these mainline kernel versions, which no result:
> v6.1.60, v5.15.137, v5.10.199, v5.10.
>
> A first look at the ath9k driver code did not show anything obviously
> wrong.
>
> Any clues about how to further investigate would be very welcome.
>
> I am obviously available to provide more info.

Do you have a reboot log with "initcall_debug debug" set on the kernel
command line and if so, does it always point to the PCI bus shutting
down the device drivers, pcie ports and ultimately the root complex?

We have seen something similar before with ath10k_pci and our
pcie-brcmstb driver which eventually was a result of having made
incorrect assumptions while implementing the platform_driver::shutdown
routine. There was a hard hang in ath10k_remove(), I do not recall the
details, but we were definitively doing something improper there.

imx6_pcie_shutdown() seems to much simpler, but my first guess would be
there.

Hope this helps.
--
Florian

2023-12-01 15:26:38

by Luca Ceresoli

[permalink] [raw]
Subject: Re: Linux v6.6 sporadic reboot failures with ath9k on i.MX6Q

Hello Florian,

On Wed, 29 Nov 2023 21:10:44 -0800
Florian Fainelli <[email protected]> wrote:

> Do you have a reboot log with "initcall_debug debug" set on the kernel
> command line and if so, does it always point to the PCI bus shutting
> down the device drivers, pcie ports and ultimately the root complex?
>
> We have seen something similar before with ath10k_pci and our
> pcie-brcmstb driver which eventually was a result of having made
> incorrect assumptions while implementing the platform_driver::shutdown
> routine. There was a hard hang in ath10k_remove(), I do not recall the
> details, but we were definitively doing something improper there.
>
> imx6_pcie_shutdown() seems to much simpler, but my first guess would be
> there.

I had attempted using initcall_debug but the hang was happening on a
different line across tests, so it did not reliably point to a specific
place. Perhaps the serial port just stopped working before being able
to flush the last few lines.

I will have the shutdown code, even though it did not seems to the
problematic.

Thank you for your hints.

Best regards,
Luca

--
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com