Hello folks,
I've just encountered weird bug when booting Linux v6.2.7
config: attached
dmesg: attached
I'm not sure exactly how to trigger this issue yet because it's not
stably reproducible. (just have encountered randomly when logging in)
At quick look it seems to be related to rtw89 wireless driver or network subsystem.
On 3/22/23 10:54, Hyeonggon Yoo wrote:
>
> Hello folks,
> I've just encountered weird bug when booting Linux v6.2.7
>
> config: attached
> dmesg: attached
>
> I'm not sure exactly how to trigger this issue yet because it's not
> stably reproducible. (just have encountered randomly when logging in)
>
> At quick look it seems to be related to rtw89 wireless driver or network subsystem.
Your bug is weird indeed, and it does come from rtw89_8852be. My distro has not
yet released kernel 6.2.7, but I have not seen this problem with mainline
kernels throughout the 6.2 or 6.3 development series.
One thing that would have been helpful is to include any rtw89 entries in the
dmesg log that appeared BEFORE the BUG. For instance, I have no idea what
firmware level you are running, etc.
If you are willing, I would like you to 'git clone
https://github.com/lwfinger/rtw89.git'. You would likely need to install the
packages needed to build an out-of-kernel driver as shown in README.md in the
repo. You would also need to blacklist rtw89_8852be, which would interfere with
rtw_8852be coming from the new repo. The code in the repo matches what will be
in kernel 6.3+, and it would be instructive if there is something in your system
that triggers this problem. I have never seen anything like this.
Larry
On Wed, 22 Mar 2023 at 18:03, Larry Finger <[email protected]> wrote:
>
> On 3/22/23 10:54, Hyeonggon Yoo wrote:
> >
> > Hello folks,
> > I've just encountered weird bug when booting Linux v6.2.7
> >
> > config: attached
> > dmesg: attached
> >
> > I'm not sure exactly how to trigger this issue yet because it's not
> > stably reproducible. (just have encountered randomly when logging in)
> >
> > At quick look it seems to be related to rtw89 wireless driver or network subsystem.
>
> Your bug is weird indeed, and it does come from rtw89_8852be. My distro has not
> yet released kernel 6.2.7, but I have not seen this problem with mainline
> kernels throughout the 6.2 or 6.3 development series.
Looking at the rtw89 driver's probe function, the bug is probably a
simple race condition:
int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
...
ret = rtw89_core_register(rtwdev); <- calls ieee80211_register_hw();
...
rtw89_core_napi_init(rtwdev);
...
}
so it registers the wifi device first, making it visible to userspace,
and then initializes napi.
So there is a window where a fast userspace may already try to
interact with the device before the driver got around to initializing
the napi parts, and then it explodes. At least that is my theory for
the issue.
Switching the order of these two functions should avoid it in theory,
as long as rtw89_core_napi_init() doesn't depend on anything
rtw89_core_register() does.
FWIW, registering the irq handler only after registering the device
also seems suspect, and should probably also happen before that.
Regards
Jonas
> -----Original Message-----
> From: Jonas Gorski <[email protected]>
> Sent: Thursday, March 23, 2023 4:52 AM
> To: Larry Finger <[email protected]>
> Cc: Hyeonggon Yoo <[email protected]>; [email protected]; [email protected]; Ping-Ke
> Shih <[email protected]>
> Subject: Re: [BUG v6.2.7] Hitting BUG_ON() on rtw89 wireless driver startup
>
> On Wed, 22 Mar 2023 at 18:03, Larry Finger <[email protected]> wrote:
> >
> > On 3/22/23 10:54, Hyeonggon Yoo wrote:
> > >
> > > Hello folks,
> > > I've just encountered weird bug when booting Linux v6.2.7
> > >
> > > config: attached
> > > dmesg: attached
> > >
> > > I'm not sure exactly how to trigger this issue yet because it's not
> > > stably reproducible. (just have encountered randomly when logging in)
> > >
> > > At quick look it seems to be related to rtw89 wireless driver or network subsystem.
> >
> > Your bug is weird indeed, and it does come from rtw89_8852be. My distro has not
> > yet released kernel 6.2.7, but I have not seen this problem with mainline
> > kernels throughout the 6.2 or 6.3 development series.
>
> Looking at the rtw89 driver's probe function, the bug is probably a
> simple race condition:
>
> int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> {
> ...
> ret = rtw89_core_register(rtwdev); <- calls ieee80211_register_hw();
> ...
> rtw89_core_napi_init(rtwdev);
> ...
> }
>
> so it registers the wifi device first, making it visible to userspace,
> and then initializes napi.
>
> So there is a window where a fast userspace may already try to
> interact with the device before the driver got around to initializing
> the napi parts, and then it explodes. At least that is my theory for
> the issue.
>
> Switching the order of these two functions should avoid it in theory,
> as long as rtw89_core_napi_init() doesn't depend on anything
> rtw89_core_register() does.
>
> FWIW, registering the irq handler only after registering the device
> also seems suspect, and should probably also happen before that.
Adding a 10 seconds sleep between rtw89_core_register() and
rtw89_core_napi_init(), and I can reproduce this issue:
int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
...
ret = rtw89_core_register(rtwdev);
...
msleep(10 * 100);
...
rtw89_core_napi_init(rtwdev);
...
}
And, as your suggestion, I move the rtw89_core_register() to the last step
of PCI probe. Then, it looks more reasonable that we prepare NAPI and
interrupt handlers before registering netdev. Could you give it a try with
below fix?
diff --git a/pci.c b/pci.c
index fe6c0efc..087de2e0 100644
--- a/pci.c
+++ b/pci.c
@@ -3879,25 +3879,26 @@ int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
rtw89_pci_link_cfg(rtwdev);
rtw89_pci_l1ss_cfg(rtwdev);
- ret = rtw89_core_register(rtwdev);
- if (ret) {
- rtw89_err(rtwdev, "failed to register core\n");
- goto err_clear_resource;
- }
-
rtw89_core_napi_init(rtwdev);
ret = rtw89_pci_request_irq(rtwdev, pdev);
if (ret) {
rtw89_err(rtwdev, "failed to request pci irq\n");
- goto err_unregister;
+ goto err_deinit_napi;
+ }
+
+ ret = rtw89_core_register(rtwdev);
+ if (ret) {
+ rtw89_err(rtwdev, "failed to register core\n");
+ goto err_free_irq;
}
return 0;
-err_unregister:
+err_free_irq:
+ rtw89_pci_free_irq(rtwdev, pdev);
+err_deinit_napi:
rtw89_core_napi_deinit(rtwdev);
- rtw89_core_unregister(rtwdev);
err_clear_resource:
rtw89_pci_clear_resource(rtwdev, pdev);
err_declaim_pci:
--
Ping-Ke
On 3/22/23 19:59, Ping-Ke Shih wrote:
> diff --git a/pci.c b/pci.c
> index fe6c0efc..087de2e0 100644
> --- a/pci.c
> +++ b/pci.c
> @@ -3879,25 +3879,26 @@ int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> rtw89_pci_link_cfg(rtwdev);
> rtw89_pci_l1ss_cfg(rtwdev);
>
> - ret = rtw89_core_register(rtwdev);
> - if (ret) {
> - rtw89_err(rtwdev, "failed to register core\n");
> - goto err_clear_resource;
> - }
> -
> rtw89_core_napi_init(rtwdev);
>
> ret = rtw89_pci_request_irq(rtwdev, pdev);
> if (ret) {
> rtw89_err(rtwdev, "failed to request pci irq\n");
> - goto err_unregister;
> + goto err_deinit_napi;
> + }
> +
> + ret = rtw89_core_register(rtwdev);
> + if (ret) {
> + rtw89_err(rtwdev, "failed to register core\n");
> + goto err_free_irq;
> }
>
> return 0;
>
> -err_unregister:
> +err_free_irq:
> + rtw89_pci_free_irq(rtwdev, pdev);
> +err_deinit_napi:
> rtw89_core_napi_deinit(rtwdev);
> - rtw89_core_unregister(rtwdev);
> err_clear_resource:
> rtw89_pci_clear_resource(rtwdev, pdev);
> err_declaim_pci:
Hyeonggon,
I have tested the above patch and added it to my GitHub repo that I mentioned
earlier. Using the repo, you will be able to get the new code without patching
and regenerating an entire new kernel.
Larry
On 3/22/23 19:59, Ping-Ke Shih wrote:
>
>
>> -----Original Message-----
>> From: Jonas Gorski <[email protected]>
>> Sent: Thursday, March 23, 2023 4:52 AM
>> To: Larry Finger <[email protected]>
>> Cc: Hyeonggon Yoo <[email protected]>; [email protected]; [email protected]; Ping-Ke
>> Shih <[email protected]>
>> Subject: Re: [BUG v6.2.7] Hitting BUG_ON() on rtw89 wireless driver startup
>>
>> On Wed, 22 Mar 2023 at 18:03, Larry Finger <[email protected]> wrote:
>>>
>>> On 3/22/23 10:54, Hyeonggon Yoo wrote:
>>>>
>>>> Hello folks,
>>>> I've just encountered weird bug when booting Linux v6.2.7
>>>>
>>>> config: attached
>>>> dmesg: attached
>>>>
>>>> I'm not sure exactly how to trigger this issue yet because it's not
>>>> stably reproducible. (just have encountered randomly when logging in)
>>>>
>>>> At quick look it seems to be related to rtw89 wireless driver or network subsystem.
>>>
>>> Your bug is weird indeed, and it does come from rtw89_8852be. My distro has not
>>> yet released kernel 6.2.7, but I have not seen this problem with mainline
>>> kernels throughout the 6.2 or 6.3 development series.
>>
>> Looking at the rtw89 driver's probe function, the bug is probably a
>> simple race condition:
>>
>> int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>> {
>> ...
>> ret = rtw89_core_register(rtwdev); <- calls ieee80211_register_hw();
>> ...
>> rtw89_core_napi_init(rtwdev);
>> ...
>> }
>>
>> so it registers the wifi device first, making it visible to userspace,
>> and then initializes napi.
>>
>> So there is a window where a fast userspace may already try to
>> interact with the device before the driver got around to initializing
>> the napi parts, and then it explodes. At least that is my theory for
>> the issue.
>>
>> Switching the order of these two functions should avoid it in theory,
>> as long as rtw89_core_napi_init() doesn't depend on anything
>> rtw89_core_register() does.
>>
>> FWIW, registering the irq handler only after registering the device
>> also seems suspect, and should probably also happen before that.
>
> Adding a 10 seconds sleep between rtw89_core_register() and
> rtw89_core_napi_init(), and I can reproduce this issue:
>
> int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> {
> ...
> ret = rtw89_core_register(rtwdev);
> ...
> msleep(10 * 100);
> ...
> rtw89_core_napi_init(rtwdev);
> ...
> }
>
> And, as your suggestion, I move the rtw89_core_register() to the last step
> of PCI probe. Then, it looks more reasonable that we prepare NAPI and
> interrupt handlers before registering netdev. Could you give it a try with
> below fix?
>
> diff --git a/pci.c b/pci.c
> index fe6c0efc..087de2e0 100644
> --- a/pci.c
> +++ b/pci.c
> @@ -3879,25 +3879,26 @@ int rtw89_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> rtw89_pci_link_cfg(rtwdev);
> rtw89_pci_l1ss_cfg(rtwdev);
>
> - ret = rtw89_core_register(rtwdev);
> - if (ret) {
> - rtw89_err(rtwdev, "failed to register core\n");
> - goto err_clear_resource;
> - }
> -
> rtw89_core_napi_init(rtwdev);
>
> ret = rtw89_pci_request_irq(rtwdev, pdev);
> if (ret) {
> rtw89_err(rtwdev, "failed to request pci irq\n");
> - goto err_unregister;
> + goto err_deinit_napi;
> + }
> +
> + ret = rtw89_core_register(rtwdev);
> + if (ret) {
> + rtw89_err(rtwdev, "failed to register core\n");
> + goto err_free_irq;
> }
>
> return 0;
>
> -err_unregister:
> +err_free_irq:
> + rtw89_pci_free_irq(rtwdev, pdev);
> +err_deinit_napi:
> rtw89_core_napi_deinit(rtwdev);
> - rtw89_core_unregister(rtwdev);
> err_clear_resource:
> rtw89_pci_clear_resource(rtwdev, pdev);
> err_declaim_pci:
Ping-Ke,
The patch works fine here, but I did not have the problem.
When you submit it, add a Tested-by: Larry Finger<[email protected]> and
a Reviewed-by for the same address.
@Jonas: Good catch.
Larry
> -----Original Message-----
> From: Larry Finger <[email protected]> On Behalf Of Larry Finger
> Sent: Thursday, March 23, 2023 9:41 AM
> To: Ping-Ke Shih <[email protected]>; Jonas Gorski <[email protected]>
> Cc: Hyeonggon Yoo <[email protected]>; [email protected]; [email protected]
> Subject: Re: [BUG v6.2.7] Hitting BUG_ON() on rtw89 wireless driver startup
>
> Ping-Ke,
>
> The patch works fine here, but I did not have the problem.
>
> When you submit it, add a Tested-by: Larry Finger<[email protected]> and
> a Reviewed-by for the same address.
>
Hi Larry,
Thanks for your test. I have submitted patch [1] with your Tested-by and Reviewed-by.
But, the patch includes additional error handling addressed during internal
review, so there is a little different from the patch I posted here. If you
don't agree current version, please NACK the patch [1].
Thank you.
[1] https://lore.kernel.org/linux-wireless/[email protected]/T/#u
Ping-Ke