2023-04-04 01:23:39

by Ajay Singh

[permalink] [raw]
Subject: [PATCH] wifi: wilc1000: fix kernel oops during interface down during background scan

Fix for kernel crash observed with following test procedure [1]:
while true;
do ifconfig wlan0 up;
iw dev wlan0 scan &
ifconfig wlan0 down;
done

During the above test procedure, the scan results are received from firmware
for 'iw scan' command gets queued even when the interface is going down. It
was causing the kernel oops when dereferencing the freed pointers.

For synchronization, 'mac_close()' calls flush_workqueue() to block its
execution till all pending work is completed. Afterwards 'wilc->close' flag
which is set before the flush_workqueue() should avoid adding new work.
Added 'wilc->close' check in wilc_handle_isr() which is common for
SPI/SDIO bus to ignore the interrupts from firmware that inturns adds the
work since the interface is getting closed.

1. https://lore.kernel.org/linux-wireless/[email protected]/

Reported-by: Michael Walle <[email protected]>
Signed-off-by: Ajay Singh <[email protected]>
---
drivers/net/wireless/microchip/wilc1000/netdev.c | 9 +++------
drivers/net/wireless/microchip/wilc1000/wlan.c | 3 +++
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/microchip/wilc1000/netdev.c b/drivers/net/wireless/microchip/wilc1000/netdev.c
index e9f59de31b0b..40edee10a81f 100644
--- a/drivers/net/wireless/microchip/wilc1000/netdev.c
+++ b/drivers/net/wireless/microchip/wilc1000/netdev.c
@@ -38,11 +38,6 @@ static irqreturn_t isr_bh_routine(int irq, void *userdata)
{
struct wilc *wilc = userdata;

- if (wilc->close) {
- pr_err("Can't handle BH interrupt\n");
- return IRQ_HANDLED;
- }
-
wilc_handle_isr(wilc);

return IRQ_HANDLED;
@@ -781,13 +776,15 @@ static int wilc_mac_close(struct net_device *ndev)
if (vif->ndev) {
netif_stop_queue(vif->ndev);

+ if (wl->open_ifcs == 0)
+ wl->close = 1;
+
wilc_handle_disconnect(vif);
wilc_deinit_host_int(vif->ndev);
}

if (wl->open_ifcs == 0) {
netdev_dbg(ndev, "Deinitializing wilc1000\n");
- wl->close = 1;
wilc_wlan_deinitialize(ndev);
}

diff --git a/drivers/net/wireless/microchip/wilc1000/wlan.c b/drivers/net/wireless/microchip/wilc1000/wlan.c
index 58bbf50081e4..700cb657be00 100644
--- a/drivers/net/wireless/microchip/wilc1000/wlan.c
+++ b/drivers/net/wireless/microchip/wilc1000/wlan.c
@@ -1066,6 +1066,9 @@ void wilc_handle_isr(struct wilc *wilc)
{
u32 int_status;

+ if (wilc->close)
+ return;
+
acquire_bus(wilc, WILC_BUS_ACQUIRE_AND_WAKEUP);
wilc->hif_func->hif_read_int(wilc, &int_status);

--
2.34.1


2023-04-05 11:43:14

by Michael Walle

[permalink] [raw]
Subject: Re: [PATCH] wifi: wilc1000: fix kernel oops during interface down during background scan

Hi,

[+ wireless and cfg80211 maintainers because I'm not familiar with
cfg80211 ]

> Fix for kernel crash observed with following test procedure [1]:
> while true;
> do ifconfig wlan0 up;
> iw dev wlan0 scan &
> ifconfig wlan0 down;
> done
>
> During the above test procedure, the scan results are received from
> firmware
> for 'iw scan' command gets queued even when the interface is going
> down. It
> was causing the kernel oops when dereferencing the freed pointers.
>
> For synchronization, 'mac_close()' calls flush_workqueue() to block its
> execution till all pending work is completed. Afterwards 'wilc->close'
> flag
> which is set before the flush_workqueue() should avoid adding new work.
> Added 'wilc->close' check in wilc_handle_isr() which is common for
> SPI/SDIO bus to ignore the interrupts from firmware that inturns adds
> the
> work since the interface is getting closed.

With this patch I'm now getting
wilc1000_sdio mmc0:0001:1 wlan0: Failed to send setup multicast

when you close the interface.

>
> 1.
> https://lore.kernel.org/linux-wireless/[email protected]/

should be Link:

> Reported-by: Michael Walle <[email protected]>
> Signed-off-by: Ajay Singh <[email protected]>

Missing Fixes: tag. In this regard, most of the previous wilc fixes
patches
miss a proper Fixes tag which makes the wilc1000 pretty unusable on
stable
kernels IMHO :/

> ---
> drivers/net/wireless/microchip/wilc1000/netdev.c | 9 +++------
> drivers/net/wireless/microchip/wilc1000/wlan.c | 3 +++
> 2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/wireless/microchip/wilc1000/netdev.c
> b/drivers/net/wireless/microchip/wilc1000/netdev.c
> index e9f59de31b0b..40edee10a81f 100644
> --- a/drivers/net/wireless/microchip/wilc1000/netdev.c
> +++ b/drivers/net/wireless/microchip/wilc1000/netdev.c
> @@ -38,11 +38,6 @@ static irqreturn_t isr_bh_routine(int irq, void
> *userdata)
> {
> struct wilc *wilc = userdata;
>
> - if (wilc->close) {
> - pr_err("Can't handle BH interrupt\n");
> - return IRQ_HANDLED;
> - }
> -

This check is still in the top half of the interrupt processing.
Shouldn't it be removed there, too? That way you can get rid of
the top half entirely and just let the irq subsys use the default
top half implementation.

> wilc_handle_isr(wilc);
>
> return IRQ_HANDLED;
> @@ -781,13 +776,15 @@ static int wilc_mac_close(struct net_device
> *ndev)
> if (vif->ndev) {
> netif_stop_queue(vif->ndev);
>
> + if (wl->open_ifcs == 0)
> + wl->close = 1;

Ignoring the fact that this isn't protected somehow and that
there is no write barrier (maybe I'm overthinking this and
it isn't really needed for an 'int' field), this and your
reasoning with the flush_workqueue() sounds legit.

But I'm still not convinced a lock is not required.
wilc_user_scan_req::scan_result is at least updated in
wilc_disconnect() and wilc_deinit().

wilc_disconnect() is called from the cfg80211_ops::disconnect
callback. wilc_deinit() is called from net_device_ops::ndo_stop.
Is there any lock which prevents both functions be called in
parallel? wl->close is checked in the .disconnect op, but as
mentioned above, it is not protected by any lock.

-michael

> +
> wilc_handle_disconnect(vif);
> wilc_deinit_host_int(vif->ndev);
> }
>
> if (wl->open_ifcs == 0) {
> netdev_dbg(ndev, "Deinitializing wilc1000\n");
> - wl->close = 1;
> wilc_wlan_deinitialize(ndev);
> }
>
> diff --git a/drivers/net/wireless/microchip/wilc1000/wlan.c
> b/drivers/net/wireless/microchip/wilc1000/wlan.c
> index 58bbf50081e4..700cb657be00 100644
> --- a/drivers/net/wireless/microchip/wilc1000/wlan.c
> +++ b/drivers/net/wireless/microchip/wilc1000/wlan.c
> @@ -1066,6 +1066,9 @@ void wilc_handle_isr(struct wilc *wilc)
> {
> u32 int_status;
>
> + if (wilc->close)
> + return;
> +
> acquire_bus(wilc, WILC_BUS_ACQUIRE_AND_WAKEUP);
> wilc->hif_func->hif_read_int(wilc, &int_status);
>
> --
> 2.34.1

2023-04-11 11:33:49

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] wifi: wilc1000: fix kernel oops during interface down during background scan

On Wed, 2023-04-05 at 13:40 +0200, Michael Walle wrote:
>
> wilc_disconnect() is called from the cfg80211_ops::disconnect
> callback. wilc_deinit() is called from net_device_ops::ndo_stop.
> Is there any lock which prevents both functions be called in
> parallel?

I don't _think_ there's any common lock, ndo_stop() holds the RTNL, but
cfg80211 for a normal nl80211 disconnect command will only briefly hold
the RTNL and drop it again before calling into the driver.

The internal flags here don't indicate requiring RTNL and that wouldn't
make much sense either:

{
.cmd = NL80211_CMD_DISCONNECT,
.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = nl80211_disconnect,
.flags = GENL_UNS_ADMIN_PERM,
.internal_flags = IFLAGS(NL80211_FLAG_NEED_NETDEV_UP),
},


See commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling
the driver").


johannes

2023-04-12 00:15:54

by Ajay Singh

[permalink] [raw]
Subject: Re: [PATCH] wifi: wilc1000: fix kernel oops during interface down during background scan

Hi Michael,

On 4/5/23 04:40, Michael Walle wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know
> the content is safe
>
> Hi,
>
> [+ wireless and cfg80211 maintainers because I'm not familiar with
> cfg80211 ]
>
>> Fix for kernel crash observed with following test procedure [1]:
>>   while true;
>>     do ifconfig wlan0 up;
>>     iw dev wlan0 scan &
>>     ifconfig wlan0 down;
>>   done
>>
>> During the above test procedure, the scan results are received from
>> firmware
>> for 'iw scan' command gets queued even when the interface is going
>> down. It
>> was causing the kernel oops when dereferencing the freed pointers.
>>
>> For synchronization, 'mac_close()' calls flush_workqueue() to block its
>> execution till all pending work is completed. Afterwards 'wilc->close'
>> flag
>> which is set before the flush_workqueue() should avoid adding new work.
>> Added 'wilc->close' check in wilc_handle_isr() which is common for
>> SPI/SDIO bus to ignore the interrupts from firmware that inturns adds
>> the
>> work since the interface is getting closed.
>
> With this patch I'm now getting
>    wilc1000_sdio mmc0:0001:1 wlan0: Failed to send setup multicast
>
> when you close the interface.
>

This is a false alarm. I will modify the patch to ignore this debug
message when the mac_close() is in progress.

>>
>> 1.
>> https://lore.kernel.org/linux-wireless/[email protected]/
>
> should be Link:

Okay

>
>> Reported-by: Michael Walle <[email protected]>
>> Signed-off-by: Ajay Singh <[email protected]>
>
> Missing Fixes: tag. In this regard, most of the previous wilc fixes
> patches
> miss a proper Fixes tag which makes the wilc1000 pretty unusable on
> stable
> kernels IMHO :/
>

Sure. I will include the fixes tag in updated version.

>> ---
>>  drivers/net/wireless/microchip/wilc1000/netdev.c | 9 +++------
>>  drivers/net/wireless/microchip/wilc1000/wlan.c   | 3 +++
>>  2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/wireless/microchip/wilc1000/netdev.c
>> b/drivers/net/wireless/microchip/wilc1000/netdev.c
>> index e9f59de31b0b..40edee10a81f 100644
>> --- a/drivers/net/wireless/microchip/wilc1000/netdev.c
>> +++ b/drivers/net/wireless/microchip/wilc1000/netdev.c
>> @@ -38,11 +38,6 @@ static irqreturn_t isr_bh_routine(int irq, void
>> *userdata)
>>  {
>>       struct wilc *wilc = userdata;
>>
>> -     if (wilc->close) {
>> -             pr_err("Can't handle BH interrupt\n");
>> -             return IRQ_HANDLED;
>> -     }
>> -
>
> This check is still in the top half of the interrupt processing.
> Shouldn't it be removed there, too? That way you can get rid of
> the top half entirely and just let the irq subsys use the default
> top half implementation.
>

Yeah, it makes sense. I will include this change in the updated version.

>>       wilc_handle_isr(wilc);
>>
>>       return IRQ_HANDLED;
>> @@ -781,13 +776,15 @@ static int wilc_mac_close(struct net_device
>> *ndev)
>>       if (vif->ndev) {
>>               netif_stop_queue(vif->ndev);
>>
>> +             if (wl->open_ifcs == 0)
>> +                     wl->close = 1;
>
> Ignoring the fact that this isn't protected somehow and that
> there is no write barrier (maybe I'm overthinking this and
> it isn't really needed for an 'int' field), this and your
> reasoning with the flush_workqueue() sounds legit.
>
> But I'm still not convinced a lock is not required.
> wilc_user_scan_req::scan_result is at least updated in
> wilc_disconnect() and wilc_deinit().
>
> wilc_disconnect() is called from the cfg80211_ops::disconnect
> callback. wilc_deinit() is called from net_device_ops::ndo_stop.
> Is there any lock which prevents both functions be called in
> parallel? wl->close is checked in the .disconnect op, but as
> mentioned above, it is not protected by any lock.

Sure, I will prepare a separate patch to handle this.

>
> -michael
>
>> +
>>               wilc_handle_disconnect(vif);
>>               wilc_deinit_host_int(vif->ndev);
>>       }
>>
>>       if (wl->open_ifcs == 0) {
>>               netdev_dbg(ndev, "Deinitializing wilc1000\n");
>> -             wl->close = 1;
>>               wilc_wlan_deinitialize(ndev);
>>       }
>>
>> diff --git a/drivers/net/wireless/microchip/wilc1000/wlan.c
>> b/drivers/net/wireless/microchip/wilc1000/wlan.c
>> index 58bbf50081e4..700cb657be00 100644
>> --- a/drivers/net/wireless/microchip/wilc1000/wlan.c
>> +++ b/drivers/net/wireless/microchip/wilc1000/wlan.c
>> @@ -1066,6 +1066,9 @@ void wilc_handle_isr(struct wilc *wilc)
>>  {
>>       u32 int_status;
>>
>> +     if (wilc->close)
>> +             return;
>> +
>>       acquire_bus(wilc, WILC_BUS_ACQUIRE_AND_WAKEUP);
>>       wilc->hif_func->hif_read_int(wilc, &int_status);
>>
>> --
>> 2.34.1

2023-05-05 15:52:43

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH] wifi: wilc1000: fix kernel oops during interface down during background scan

<[email protected]> writes:

> Fix for kernel crash observed with following test procedure [1]:
> while true;
> do ifconfig wlan0 up;
> iw dev wlan0 scan &
> ifconfig wlan0 down;
> done
>
> During the above test procedure, the scan results are received from firmware
> for 'iw scan' command gets queued even when the interface is going down. It
> was causing the kernel oops when dereferencing the freed pointers.
>
> For synchronization, 'mac_close()' calls flush_workqueue() to block its
> execution till all pending work is completed. Afterwards 'wilc->close' flag
> which is set before the flush_workqueue() should avoid adding new work.
> Added 'wilc->close' check in wilc_handle_isr() which is common for
> SPI/SDIO bus to ignore the interrupts from firmware that inturns adds the
> work since the interface is getting closed.
>
> 1. https://lore.kernel.org/linux-wireless/[email protected]/
>
> Reported-by: Michael Walle <[email protected]>
> Signed-off-by: Ajay Singh <[email protected]>

[...]

> @@ -781,13 +776,15 @@ static int wilc_mac_close(struct net_device *ndev)
> if (vif->ndev) {
> netif_stop_queue(vif->ndev);
>
> + if (wl->open_ifcs == 0)
> + wl->close = 1;
> +

wl-close is an int, I wonder if it's racy to int as a flag like this? In
cases like this I usually use set_bit() & co because those guarantee
atomicity, though don't know if that's overkill.

--
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2023-05-05 20:53:36

by Ajay Singh

[permalink] [raw]
Subject: Re: [PATCH] wifi: wilc1000: fix kernel oops during interface down during background scan

Hi Kalle,

On 5/5/23 08:47, Kalle Valo wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> <[email protected]> writes:
>
>> Fix for kernel crash observed with following test procedure [1]:
>> while true;
>> do ifconfig wlan0 up;
>> iw dev wlan0 scan &
>> ifconfig wlan0 down;
>> done
>>
>> During the above test procedure, the scan results are received from firmware
>> for 'iw scan' command gets queued even when the interface is going down. It
>> was causing the kernel oops when dereferencing the freed pointers.
>>
>> For synchronization, 'mac_close()' calls flush_workqueue() to block its
>> execution till all pending work is completed. Afterwards 'wilc->close' flag
>> which is set before the flush_workqueue() should avoid adding new work.
>> Added 'wilc->close' check in wilc_handle_isr() which is common for
>> SPI/SDIO bus to ignore the interrupts from firmware that inturns adds the
>> work since the interface is getting closed.
>>
>> 1. https://lore.kernel.org/linux-wireless/[email protected]/
>>
>> Reported-by: Michael Walle <[email protected]>
>> Signed-off-by: Ajay Singh <[email protected]>
>
> [...]
>
>> @@ -781,13 +776,15 @@ static int wilc_mac_close(struct net_device *ndev)
>> if (vif->ndev) {
>> netif_stop_queue(vif->ndev);
>>
>> + if (wl->open_ifcs == 0)
>> + wl->close = 1;
>> +
>
> wl-close is an int, I wonder if it's racy to int as a flag like this? In
> cases like this I usually use set_bit() & co because those guarantee
> atomicity, though don't know if that's overkill.
>

I think it's a good idea to use an atomic operation but I am not sure if
using atomic for 'wl->close' will have much impact. For instance, if any
new work gets added to the workqueue before the 'wl->close=1' is fully
completed, then that work would get executed as normal.
However, I feel it's safe to define 'wl->close' as atomic_t type. I will
prepare the conversion patch and will try to include it along with the
updated version of this patch.

Regards,
Ajay

2023-05-06 05:52:23

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH] wifi: wilc1000: fix kernel oops during interface down during background scan

<[email protected]> writes:

>>> @@ -781,13 +776,15 @@ static int wilc_mac_close(struct net_device *ndev)
>>> if (vif->ndev) {
>>> netif_stop_queue(vif->ndev);
>>>
>>> + if (wl->open_ifcs == 0)
>>> + wl->close = 1;
>>> +
>>
>> wl-close is an int, I wonder if it's racy to int as a flag like this? In
>> cases like this I usually use set_bit() & co because those guarantee
>> atomicity, though don't know if that's overkill.
>>
>
> I think it's a good idea to use an atomic operation but I am not sure if
> using atomic for 'wl->close' will have much impact. For instance, if any
> new work gets added to the workqueue before the 'wl->close=1' is fully
> completed, then that work would get executed as normal.

Sure, this is most likely a small race condition. But still a race.

> However, I feel it's safe to define 'wl->close' as atomic_t type. I will
> prepare the conversion patch and will try to include it along with the
> updated version of this patch.

Why atomic_t? You only use values 0 and 1 so test_bit() and set_bit()
sounds more approriate to me.


--
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches