2021-05-26 15:11:42

by Dmitry Osipenko

[permalink] [raw]
Subject: [BUG] brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout (WiFi dies)

Hello,

After updating to Ubuntu 21.04 I found two problems related to the BRCMF_C_GET_ASSOCLIST using an older BCM4329 SDIO WiFi.

1. The kernel is spammed with:

ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST unsupported, err=-52
ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST unsupported, err=-52
ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST unsupported, err=-52

Which happens apparently due to a newer NetworkManager version that pokes dump_station() periodically. I sent [1] that fixes this noise.

[1] https://patchwork.kernel.org/project/linux-wireless/list/?series=480715

2. The other much worse problem is that WiFi eventually dies now with these errors:

...
ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST unsupported, err=-52
brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout
ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST unsupported, err=-110
ieee80211 phy0: brcmf_proto_bcdc_query_dcmd: brcmf_proto_bcdc_msg failed w/status -110

From this point all firmware calls start to fail with err=-110 and WiFi doesn't work anymore. This problem is reproducible with 5.13-rc and current -next, I haven't checked older kernel versions. Somehow it's worse using a recent -next, WiFi dies quicker.

What's interesting is that I see that there is always a pending signal in brcmf_sdio_dcmd_resp_wait() when timeout happens. It looks like the timeout happens when there is access to a swap partition, which stalls system for a second or two, but this is not 100%. Increasing DCMD_RESP_TIMEOUT doesn't help.

Please let me know if you have any ideas of how to fix this trouble properly or if you need need any more info.

Removing BRCMF_C_GET_ASSOCLIST firmware call entirely from the driver fixes the problem.

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index f4405d7861b6..6327cb38d6ec 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -2886,22 +2886,6 @@ brcmf_cfg80211_dump_station(struct wiphy *wiphy, struct net_device *ndev,

brcmf_dbg(TRACE, "Enter, idx %d\n", idx);

- if (idx == 0) {
- cfg->assoclist.count = cpu_to_le32(BRCMF_MAX_ASSOCLIST);
- err = brcmf_fil_cmd_data_get(ifp, BRCMF_C_GET_ASSOCLIST,
- &cfg->assoclist,
- sizeof(cfg->assoclist));
- if (err) {
- bphy_err(drvr, "BRCMF_C_GET_ASSOCLIST unsupported, err=%d\n",
- err);
- cfg->assoclist.count = 0;
- return -EOPNOTSUPP;
- }
- }
- if (idx < le32_to_cpu(cfg->assoclist.count)) {
- memcpy(mac, cfg->assoclist.mac[idx], ETH_ALEN);
- return brcmf_cfg80211_get_station(wiphy, ndev, mac, sinfo);
- }
return -ENOENT;
}





2021-05-28 00:17:07

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [BUG] brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout (WiFi dies)

27.05.2021 19:42, Arend van Spriel пишет:
> On 5/26/2021 5:10 PM, Dmitry Osipenko wrote:
>> Hello,
>>
>> After updating to Ubuntu 21.04 I found two problems related to the
>> BRCMF_C_GET_ASSOCLIST using an older BCM4329 SDIO WiFi.
>>
>> 1. The kernel is spammed with:
>>
>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>> unsupported, err=-52
>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>> unsupported, err=-52
>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>> unsupported, err=-52
>>
>> Which happens apparently due to a newer NetworkManager version that
>> pokes dump_station() periodically. I sent [1] that fixes this noise.
>>
>> [1]
>> https://patchwork.kernel.org/project/linux-wireless/list/?series=480715
>
> Right. I noticed this one and did not have anything to add to the
> review/suggestion.

Please feel free to add yours r-b to the patches if they are good to you.

>> 2. The other much worse problem is that WiFi eventually dies now with
>> these errors:
>>
>> ...
>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>> unsupported, err=-52
>>   brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout
>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>> unsupported, err=-110
>>   ieee80211 phy0: brcmf_proto_bcdc_query_dcmd: brcmf_proto_bcdc_msg
>> failed w/status -110
>>
>>  From this point all firmware calls start to fail with err=-110 and
>> WiFi doesn't work anymore. This problem is reproducible with 5.13-rc
>> and current -next, I haven't checked older kernel versions. Somehow
>> it's worse using a recent -next, WiFi dies quicker.
>>
>> What's interesting is that I see that there is always a pending signal
>> in brcmf_sdio_dcmd_resp_wait() when timeout happens. It looks like the
>> timeout happens when there is access to a swap partition, which stalls
>> system for a second or two, but this is not 100%. Increasing
>> DCMD_RESP_TIMEOUT doesn't help.
>
> The timeout error (-110) can have two root causes that I am aware off.
> Either the firmware died or the SDIO layer has gone haywire. Not sure if
> that swap partition is on eMMC device, but if so it could be related.
> You could try generating device coredump. If that also gives -110 errors
> we know it is the SDIO layer.

Coredump is a good idea, thank you. The swap partition is on external SD
card, everything else is on eMMC.

>> Please let me know if you have any ideas of how to fix this trouble
>> properly or if you need need any more info.
>>
>> Removing BRCMF_C_GET_ASSOCLIST firmware call entirely from the driver
>> fixes the problem.
>
> My guess is that reducing interaction with firmware is what is avoiding
> the issue and not so much this specific firmware command. As always it
> is good to know the conditions in which the issue occurs. What is the
> hardware platform you are running Ubuntu on? Stuff like that.

That's an older Acer A500 NVIDIA Tegra20 tablet device [1]. I may also
try to reproduce problem on Tegra30 Nexus 7 with BCM4330.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/tegra20-acer-a500-picasso.dts

Thank you very much for the suggestions. I will try to collect more info
and come back with the report.

2021-06-18 20:07:13

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [BUG] brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout (WiFi dies)

28.05.2021 01:47, Dmitry Osipenko пишет:
> 27.05.2021 19:42, Arend van Spriel пишет:
>> On 5/26/2021 5:10 PM, Dmitry Osipenko wrote:
>>> Hello,
>>>
>>> After updating to Ubuntu 21.04 I found two problems related to the
>>> BRCMF_C_GET_ASSOCLIST using an older BCM4329 SDIO WiFi.
>>>
>>> 1. The kernel is spammed with:
>>>
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>
>>> Which happens apparently due to a newer NetworkManager version that
>>> pokes dump_station() periodically. I sent [1] that fixes this noise.
>>>
>>> [1]
>>> https://patchwork.kernel.org/project/linux-wireless/list/?series=480715
>>
>> Right. I noticed this one and did not have anything to add to the
>> review/suggestion.
>
> Please feel free to add yours r-b to the patches if they are good to you.
>
>>> 2. The other much worse problem is that WiFi eventually dies now with
>>> these errors:
>>>
>>> ...
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>   brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-110
>>>   ieee80211 phy0: brcmf_proto_bcdc_query_dcmd: brcmf_proto_bcdc_msg
>>> failed w/status -110
>>>
>>>  From this point all firmware calls start to fail with err=-110 and
>>> WiFi doesn't work anymore. This problem is reproducible with 5.13-rc
>>> and current -next, I haven't checked older kernel versions. Somehow
>>> it's worse using a recent -next, WiFi dies quicker.
>>>
>>> What's interesting is that I see that there is always a pending signal
>>> in brcmf_sdio_dcmd_resp_wait() when timeout happens. It looks like the
>>> timeout happens when there is access to a swap partition, which stalls
>>> system for a second or two, but this is not 100%. Increasing
>>> DCMD_RESP_TIMEOUT doesn't help.
>>
>> The timeout error (-110) can have two root causes that I am aware off.
>> Either the firmware died or the SDIO layer has gone haywire. Not sure if
>> that swap partition is on eMMC device, but if so it could be related.
>> You could try generating device coredump. If that also gives -110 errors
>> we know it is the SDIO layer.
>
> Coredump is a good idea, thank you. The swap partition is on external SD
> card, everything else is on eMMC.
>
>>> Please let me know if you have any ideas of how to fix this trouble
>>> properly or if you need need any more info.
>>>
>>> Removing BRCMF_C_GET_ASSOCLIST firmware call entirely from the driver
>>> fixes the problem.
>>
>> My guess is that reducing interaction with firmware is what is avoiding
>> the issue and not so much this specific firmware command. As always it
>> is good to know the conditions in which the issue occurs. What is the
>> hardware platform you are running Ubuntu on? Stuff like that.
>
> That's an older Acer A500 NVIDIA Tegra20 tablet device [1]. I may also
> try to reproduce problem on Tegra30 Nexus 7 with BCM4330.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/tegra20-acer-a500-picasso.dts
>
> Thank you very much for the suggestions. I will try to collect more info
> and come back with the report.
>

I was testing this for the past weeks and the problem is not
reproducible anymore. Apparently something got fixed in linux-next. I
haven't tried to bisect the fix since it's a bit too painful to do.

Still there are occasional -110 errors when system stalls on a memory
swap, but WiFi keeps working now.