2022-10-24 16:44:57

by Michael Walle

[permalink] [raw]
Subject: wilc1000 kernel crash

Hi,

I'm using the WILC1000 wifi chip in SDIO mode and with NetworkManager
which seems to be probing the network in the background. What I am
seeing is a kernel oops while processing the workqueue.

This is on a kernel 5.15.74, but it also happens with the latest next,
but not that often - I guess due to a different timing.

My reduced steps to reproduce are the following:
$ while true; do ifconfig wlan0 up; iw dev wlan0 scan & \
ifconfig wlan0 down; done

After a while I'll get the following splash:

[ 487.955326] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 487.955363] Mem abort info:
[ 487.955366] ESR = 0x96000004
[ 487.955370] EC = 0x25: DABT (current EL), IL = 32 bits
[ 487.965939] FW not responding
[ 487.971033] SET = 0, FnV = 0
[ 487.971039] EA = 0, S1PTW = 0
[ 487.971043] FSC = 0x04: level 0 translation fault
[ 487.971047] Data abort info:
[ 487.971050] ISV = 0, ISS = 0x00000004
[ 487.971053] CM = 0, WnR = 0
[ 487.971059] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000497b0000
[ 487.971066] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 487.971085] Internal error: Oops: 96000004 [#1] SMP
[ 487.971094] Modules linked in:
[ 487.971104] CPU: 1 PID: 9 Comm: kworker/u8:0 Not tainted 5.15.74-00013-g2d5897cb12ef #130
[ 487.971113] Hardware name: NXP i.MX8MNano DDR3L EVK board (DT)
[ 487.971122] Workqueue: WILC_wq handle_rcvd_ntwrk_info
[ 488.035377] wilc1000_sdio mmc1:0001:1: chipid (001003a0)
[ 487.971085] Internal error: Oops:
[ 488.041180] 96000004 [#1] SMP
[ 488.041186] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 488.041196] pc : handle_rcvd_ntwrk_info+0x7c/0xc4
[ 488.041208] lr : handle_rcvd_ntwrk_info+0x70/0xc4
[ 488.049128] wilc1000_sdio mmc1:0001:1: has_thrpt_enh3 = 1...
[ 488.057273] sp : ffff80000a20bd70
[ 488.057277] x29: ffff80000a20bd70 x28: 0000000000000000 x27: 0000000000000000
[ 488.057289] x26: ffff000000118470 x25: ffff000005059d05 x24: ffff00000de94d30
[ 488.057299] x23: 0000000000000000
[ 488.062670] wilc1000_sdio mmc1:0001:1 wlan0: ChipID [1003a0] loading firmware [atmel/wilc1000_wifi_firmware-1.bin]
[ 488.070418] x22: ffff000005059d00 x21: 0000000000000000
[ 488.070428] x20: ffff00000de94d00 x19: ffff00000de94d28 x18: 0000000000000000
[ 488.070440] x17: 0000000000000000 x16: 0000000000000000 x15: a4270000a4030001
[ 488.070450] x14: 010102f2500018dd x13: 0018dd0000010002 x12: 0546000000000000
[ 488.070461] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800008ad92a0
[ 488.150644] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000063e88c0
[ 488.157799] x5 : 0000000000000000 x4 : 0000000000000003 x3 : 0000000000000000
[ 488.164947] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000001
[ 488.172095] Call trace:
[ 488.174548] handle_rcvd_ntwrk_info+0x7c/0xc4
[ 488.175400] FW not responding
[ 488.178927] process_one_work+0x1ec/0x48c
[ 488.178941] worker_thread+0x170/0x564
[ 488.178948] kthread+0x128/0x13c
[ 488.178959] ret_from_fork+0x10/0x20
[ 488.185280] FW not responding
[ 488.185958] Code: 9415ea8e b4000060 39400401 35000201 (f94002a3)
[ 488.192042] FW not responding
[ 488.192957] ---[ end trace fa915dc840cf0355 ]---
[ 488.199700] FW not responding
[ 488.205601] Kernel panic - not syncing: Oops: Fatal exception
[ 488.205608] SMP: stopping secondary CPUs
[ 488.205630] Kernel Offset: disabled
[ 488.205634] CPU features: 0x00002001,20000846
[ 488.205642] Memory Limit: none

In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
although it doesn't contain NULL. Thus the driver is calling into a
bogus function pointer. There seems to be no locking between the
asynchronous calls within the workqueue (wilc_enqueue_work()) and when
the interface is disabled (wilc_deinit()). wilc_deinit() will free the
host_if_drv object which might still be used within the workqueue
context.

BTW, ignore the "FW not repsonding" for now, that seems to be a
different problem.

-michael


2022-10-25 20:40:06

by Ajay Singh

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

Hi Michael,

On 24/10/22 19:24, Michael Walle wrote:
> Hi,
>
> I'm using the WILC1000 wifi chip in SDIO mode and with NetworkManager
> which seems to be probing the network in the background. What I am
> seeing is a kernel oops while processing the workqueue.
>
> This is on a kernel 5.15.74, but it also happens with the latest next,
> but not that often - I guess due to a different timing.
>
> My reduced steps to reproduce are the following:
> $ while true; do ifconfig wlan0 up; iw dev wlan0 scan & \
> ifconfig wlan0 down; done

It looks like a timing issue. During the execution of above steps, I
observed that most of the time, the interface down command is executed
before the scan command. A few times, the scan command is executed
between interface up and down.

>
> After a while I'll get the following splash:
>
> [ 487.955326] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> [ 487.955363] Mem abort info:
> [ 487.955366] ESR = 0x96000004
> [ 487.955370] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 487.965939] FW not responding
> [ 487.971033] SET = 0, FnV = 0
> [ 487.971039] EA = 0, S1PTW = 0
> [ 487.971043] FSC = 0x04: level 0 translation fault
> [ 487.971047] Data abort info:
> [ 487.971050] ISV = 0, ISS = 0x00000004
> [ 487.971053] CM = 0, WnR = 0
> [ 487.971059] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000497b0000
> [ 487.971066] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> [ 487.971085] Internal error: Oops: 96000004 [#1] SMP
> [ 487.971094] Modules linked in:
> [ 487.971104] CPU: 1 PID: 9 Comm: kworker/u8:0 Not tainted 5.15.74-00013-g2d5897cb12ef #130
> [ 487.971113] Hardware name: NXP i.MX8MNano DDR3L EVK board (DT)
> [ 487.971122] Workqueue: WILC_wq handle_rcvd_ntwrk_info
> [ 488.035377] wilc1000_sdio mmc1:0001:1: chipid (001003a0)
> [ 487.971085] Internal error: Oops:
> [ 488.041180] 96000004 [#1] SMP
> [ 488.041186] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 488.041196] pc : handle_rcvd_ntwrk_info+0x7c/0xc4
> [ 488.041208] lr : handle_rcvd_ntwrk_info+0x70/0xc4
> [ 488.049128] wilc1000_sdio mmc1:0001:1: has_thrpt_enh3 = 1...
> [ 488.057273] sp : ffff80000a20bd70
> [ 488.057277] x29: ffff80000a20bd70 x28: 0000000000000000 x27: 0000000000000000
> [ 488.057289] x26: ffff000000118470 x25: ffff000005059d05 x24: ffff00000de94d30
> [ 488.057299] x23: 0000000000000000
> [ 488.062670] wilc1000_sdio mmc1:0001:1 wlan0: ChipID [1003a0] loading firmware [atmel/wilc1000_wifi_firmware-1.bin]
> [ 488.070418] x22: ffff000005059d00 x21: 0000000000000000
> [ 488.070428] x20: ffff00000de94d00 x19: ffff00000de94d28 x18: 0000000000000000
> [ 488.070440] x17: 0000000000000000 x16: 0000000000000000 x15: a4270000a4030001
> [ 488.070450] x14: 010102f2500018dd x13: 0018dd0000010002 x12: 0546000000000000
> [ 488.070461] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800008ad92a0
> [ 488.150644] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000063e88c0
> [ 488.157799] x5 : 0000000000000000 x4 : 0000000000000003 x3 : 0000000000000000
> [ 488.164947] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000001
> [ 488.172095] Call trace:
> [ 488.174548] handle_rcvd_ntwrk_info+0x7c/0xc4
> [ 488.175400] FW not responding
> [ 488.178927] process_one_work+0x1ec/0x48c
> [ 488.178941] worker_thread+0x170/0x564
> [ 488.178948] kthread+0x128/0x13c
> [ 488.178959] ret_from_fork+0x10/0x20
> [ 488.185280] FW not responding
> [ 488.185958] Code: 9415ea8e b4000060 39400401 35000201 (f94002a3)
> [ 488.192042] FW not responding
> [ 488.192957] ---[ end trace fa915dc840cf0355 ]---
> [ 488.199700] FW not responding
> [ 488.205601] Kernel panic - not syncing: Oops: Fatal exception
> [ 488.205608] SMP: stopping secondary CPUs
> [ 488.205630] Kernel Offset: disabled
> [ 488.205634] CPU features: 0x00002001,20000846
> [ 488.205642] Memory Limit: none
>
> In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
> although it doesn't contain NULL. Thus the driver is calling into a
> bogus function pointer. There seems to be no locking between the
> asynchronous calls within the workqueue (wilc_enqueue_work()) and when
> the interface is disabled (wilc_deinit()). wilc_deinit() will free the
> host_if_drv object which might still be used within the workqueue
> context.


Please try the below code changes with your test setup environment.


--- a/drivers/net/wireless/microchip/wilc1000/hif.c
+++ b/drivers/net/wireless/microchip/wilc1000/hif.c
@@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct
work_struct *work)
 {
        struct host_if_msg *msg = container_of(work, struct
host_if_msg, work);
        struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
-       struct wilc_user_scan_req *scan_req =
&msg->vif->hif_drv->usr_scan_req;
+       struct host_if_drv *hif_drv = msg->vif->hif_drv;
+       struct wilc_user_scan_req *scan_req;
        const u8 *ch_elm;
        u8 *ies;
        int ies_len;
        size_t offset;

+       if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
+               goto done;
+
+       scan_req = &hif_drv->usr_scan_req;
+
        if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
                offset = offsetof(struct ieee80211_mgmt,
u.probe_resp.variable);
        else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
@@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc,
u8 *buffer, u32 length)
                return;
        }

+       if (!hif_drv->usr_scan_req.scan_result)
+               return;
+
        msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
        if (IS_ERR(msg))
                return;

The above changes should avoid the kernel crash exception.


> BTW, ignore the "FW not repsonding" for now, that seems to be a
> different problem.

"FW not responding" log indicates the chip sleep command failure from
Host to the FW. It's a temporary failure log for specific command.
During the de-init process, this logs is often observed. IIRC, there was
a change in the latest driver that reduced its frequency but I am unable
to recall the exact change.

Regards,
Ajay

2022-10-26 09:05:21

by Michael Walle

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

Hi Ajay,

On 22/10/25 08:26, [email protected] wrote:
> > In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
> > although it doesn't contain NULL. Thus the driver is calling into a
> > bogus function pointer. There seems to be no locking between the
> > asynchronous calls within the workqueue (wilc_enqueue_work()) and when
> > the interface is disabled (wilc_deinit()). wilc_deinit() will free the
> > host_if_drv object which might still be used within the workqueue
> > context.
>
>
> Please try the below code changes with your test setup environment.

The workqueue handling and wilc_deinit() run in parallel, correct? ..

> --- a/drivers/net/wireless/microchip/wilc1000/hif.c
> +++ b/drivers/net/wireless/microchip/wilc1000/hif.c
> @@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct
> work_struct *work)
> ?{
> ??????? struct host_if_msg *msg = container_of(work, struct
> host_if_msg, work);
> ??????? struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
> -?????? struct wilc_user_scan_req *scan_req =
> &msg->vif->hif_drv->usr_scan_req;
> +?????? struct host_if_drv *hif_drv = msg->vif->hif_drv;
> +?????? struct wilc_user_scan_req *scan_req;
> ??????? const u8 *ch_elm;
> ??????? u8 *ies;
> ??????? int ies_len;
> ??????? size_t offset;
>
> +?????? if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
> +?????????????? goto done;

.. So what if hif_drv will be set to NULL right after this check?

I don't think you'll get around using proper locking here.

> +
> +?????? scan_req = &hif_drv->usr_scan_req;
> +
> ??????? if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
> ??????????????? offset = offsetof(struct ieee80211_mgmt,
> u.probe_resp.variable);
> ??????? else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
> @@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc,
> u8 *buffer, u32 length)
> ??????????????? return;
> ??????? }
>
> +?????? if (!hif_drv->usr_scan_req.scan_result)
> +?????????????? return;
> +

This is also racy. What if scan_result is cleared right after this
check? Then the work item will still get added to the work queue.

Here is the call tree:

isr_bh_routine() [interrupt thread ctx]
wilc_handle_isr()
mutex_lock(hif_cs)
wilc_wlan_handle_isr_ext()
wilc_wlan_handle_rxq()
wilc_wlan_handle_rx_buff()
wilc_wlan_cfg_indicate_rx()
wilc_network_info_received()
wilc_enqueue_work(handle_rcvd_ntwrk_info)
mutex_unlock(hif_cs)

handle_rcvd_ntwrk_info [hif_workqueue ctx]
if (scan_result)
scan_result()

wilc_mac_close() [ioctl ctx?]
wilc_deinit_host_int()
wilc_deinit()
mutex_lock(deinit_lock)
if (scan_result)
scan_result()
scan_result = NULL
kfree(hif_drv)
hif_drv = NULL
mutex_unlock(deinit_lock)

I don't see any synchronization mechanisms, between these threads.

> ??????? msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
> ??????? if (IS_ERR(msg))
> ??????????????? return;
>
> The above changes should avoid the kernel crash exception.

As mentioned above, I think this will just decrease the chance that
it is happening. Nonetheless, I've tried your changes but it doesn't
fix the crash.

> > BTW, ignore the "FW not repsonding" for now, that seems to be a
> > different problem.
>
> "FW not responding" log indicates the chip sleep command failure from
> Host to the FW. It's a temporary failure log for specific command.
> During the de-init process, this logs is often observed. IIRC, there was
> a change in the latest driver that reduced its frequency but I am unable
> to recall the exact change.

So what is a "temporary failure"? It is a pr_warn() and customers
(rightfully) complains about them. Apart from that, it only happens on
the second "ifconfig wlan0 up". There are no messages during the first
one. So I suspect there is still something fishy.

-michael

2022-12-09 12:12:18

by Michael Walle

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

On 22/10/26 10:54, Michael Walle wrote:
> Hi Ajay,
>
> On 22/10/25 08:26, [email protected] wrote:
> > > In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
> > > although it doesn't contain NULL. Thus the driver is calling into a
> > > bogus function pointer. There seems to be no locking between the
> > > asynchronous calls within the workqueue (wilc_enqueue_work()) and when
> > > the interface is disabled (wilc_deinit()). wilc_deinit() will free the
> > > host_if_drv object which might still be used within the workqueue
> > > context.
> >
> >
> > Please try the below code changes with your test setup environment.
>
> The workqueue handling and wilc_deinit() run in parallel, correct? ..
>
> > --- a/drivers/net/wireless/microchip/wilc1000/hif.c
> > +++ b/drivers/net/wireless/microchip/wilc1000/hif.c
> > @@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct
> > work_struct *work)
> > ?{
> > ??????? struct host_if_msg *msg = container_of(work, struct
> > host_if_msg, work);
> > ??????? struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
> > -?????? struct wilc_user_scan_req *scan_req =
> > &msg->vif->hif_drv->usr_scan_req;
> > +?????? struct host_if_drv *hif_drv = msg->vif->hif_drv;
> > +?????? struct wilc_user_scan_req *scan_req;
> > ??????? const u8 *ch_elm;
> > ??????? u8 *ies;
> > ??????? int ies_len;
> > ??????? size_t offset;
> >
> > +?????? if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
> > +?????????????? goto done;
>
> .. So what if hif_drv will be set to NULL right after this check?
>
> I don't think you'll get around using proper locking here.
>
> > +
> > +?????? scan_req = &hif_drv->usr_scan_req;
> > +
> > ??????? if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
> > ??????????????? offset = offsetof(struct ieee80211_mgmt,
> > u.probe_resp.variable);
> > ??????? else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
> > @@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc,
> > u8 *buffer, u32 length)
> > ??????????????? return;
> > ??????? }
> >
> > +?????? if (!hif_drv->usr_scan_req.scan_result)
> > +?????????????? return;
> > +
>
> This is also racy. What if scan_result is cleared right after this
> check? Then the work item will still get added to the work queue.
>
> Here is the call tree:
>
> isr_bh_routine() [interrupt thread ctx]
> wilc_handle_isr()
> mutex_lock(hif_cs)
> wilc_wlan_handle_isr_ext()
> wilc_wlan_handle_rxq()
> wilc_wlan_handle_rx_buff()
> wilc_wlan_cfg_indicate_rx()
> wilc_network_info_received()
> wilc_enqueue_work(handle_rcvd_ntwrk_info)
> mutex_unlock(hif_cs)
>
> handle_rcvd_ntwrk_info [hif_workqueue ctx]
> if (scan_result)
> scan_result()
>
> wilc_mac_close() [ioctl ctx?]
> wilc_deinit_host_int()
> wilc_deinit()
> mutex_lock(deinit_lock)
> if (scan_result)
> scan_result()
> scan_result = NULL
> kfree(hif_drv)
> hif_drv = NULL
> mutex_unlock(deinit_lock)
>
> I don't see any synchronization mechanisms, between these threads.
>
> > ??????? msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
> > ??????? if (IS_ERR(msg))
> > ??????????????? return;
> >
> > The above changes should avoid the kernel crash exception.
>
> As mentioned above, I think this will just decrease the chance that
> it is happening. Nonetheless, I've tried your changes but it doesn't
> fix the crash.

Any news here?

Thanks,
-michael

> > > BTW, ignore the "FW not repsonding" for now, that seems to be a
> > > different problem.
> >
> > "FW not responding" log indicates the chip sleep command failure from
> > Host to the FW. It's a temporary failure log for specific command.
> > During the de-init process, this logs is often observed. IIRC, there was
> > a change in the latest driver that reduced its frequency but I am unable
> > to recall the exact change.
>
> So what is a "temporary failure"? It is a pr_warn() and customers
> (rightfully) complains about them. Apart from that, it only happens on
> the second "ifconfig wlan0 up". There are no messages during the first
> one. So I suspect there is still something fishy.
>
> -michael

2022-12-09 14:19:11

by Ajay Singh

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

On 09/12/22 17:33, Michael Walle wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 22/10/26 10:54, Michael Walle wrote:
>> Hi Ajay,
>>
>> On 22/10/25 08:26, [email protected] wrote:
>>>> In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
>>>> although it doesn't contain NULL. Thus the driver is calling into a
>>>> bogus function pointer. There seems to be no locking between the
>>>> asynchronous calls within the workqueue (wilc_enqueue_work()) and when
>>>> the interface is disabled (wilc_deinit()). wilc_deinit() will free the
>>>> host_if_drv object which might still be used within the workqueue
>>>> context.
>>>
>>> Please try the below code changes with your test setup environment.
>> The workqueue handling and wilc_deinit() run in parallel, correct? ..
>>
>>> --- a/drivers/net/wireless/microchip/wilc1000/hif.c
>>> +++ b/drivers/net/wireless/microchip/wilc1000/hif.c
>>> @@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct
>>> work_struct *work)
>>> {
>>> struct host_if_msg *msg = container_of(work, struct
>>> host_if_msg, work);
>>> struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
>>> - struct wilc_user_scan_req *scan_req =
>>> &msg->vif->hif_drv->usr_scan_req;
>>> + struct host_if_drv *hif_drv = msg->vif->hif_drv;
>>> + struct wilc_user_scan_req *scan_req;
>>> const u8 *ch_elm;
>>> u8 *ies;
>>> int ies_len;
>>> size_t offset;
>>>
>>> + if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
>>> + goto done;
>> .. So what if hif_drv will be set to NULL right after this check?
>>
>> I don't think you'll get around using proper locking here.
>>
>>> +
>>> + scan_req = &hif_drv->usr_scan_req;
>>> +
>>> if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
>>> offset = offsetof(struct ieee80211_mgmt,
>>> u.probe_resp.variable);
>>> else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
>>> @@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc,
>>> u8 *buffer, u32 length)
>>> return;
>>> }
>>>
>>> + if (!hif_drv->usr_scan_req.scan_result)
>>> + return;
>>> +
>> This is also racy. What if scan_result is cleared right after this
>> check? Then the work item will still get added to the work queue.
>>
>> Here is the call tree:
>>
>> isr_bh_routine() [interrupt thread ctx]
>> wilc_handle_isr()
>> mutex_lock(hif_cs)
>> wilc_wlan_handle_isr_ext()
>> wilc_wlan_handle_rxq()
>> wilc_wlan_handle_rx_buff()
>> wilc_wlan_cfg_indicate_rx()
>> wilc_network_info_received()
>> wilc_enqueue_work(handle_rcvd_ntwrk_info)
>> mutex_unlock(hif_cs)
>>
>> handle_rcvd_ntwrk_info [hif_workqueue ctx]
>> if (scan_result)
>> scan_result()
>>
>> wilc_mac_close() [ioctl ctx?]
>> wilc_deinit_host_int()
>> wilc_deinit()
>> mutex_lock(deinit_lock)
>> if (scan_result)
>> scan_result()
>> scan_result = NULL
>> kfree(hif_drv)
>> hif_drv = NULL
>> mutex_unlock(deinit_lock)
>>
>> I don't see any synchronization mechanisms, between these threads.
>>
>>> msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
>>> if (IS_ERR(msg))
>>> return;
>>>
>>> The above changes should avoid the kernel crash exception.
>> As mentioned above, I think this will just decrease the chance that
>> it is happening. Nonetheless, I've tried your changes but it doesn't
>> fix the crash.
> Any news here?


Hi Michael,


No progress yet. I tried to simulate the condition a few times but was
unable to see the exact failure in my setup so I need to try more. For
the other "FW not responding" continuous logs, I got some clue.
Probably, will try to send that patch first.

Regards,
Ajay

2022-12-16 10:25:19

by Michael Walle

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

Hi,

On 22/12/09 02:14, [email protected] wrote:
> No progress yet. I tried to simulate the condition a few times but was
> unable to see the exact failure in my setup so I need to try more.

Shouldn't it also be possible to see the issue by code reading? I've
provided the call tree in my previous mail and my concerns regarding
the locking. Either I'm missing something there or there is no
locking between these threads which could cause this issue.

> For the other "FW not responding" continuous logs, I got some clue.
> Probably, will try to send that patch first.

Ok, let me know if you have some patches, I'm happy to test them.

-michael

2023-04-03 14:24:58

by Kirill Buksha

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

On 16.12.22. 11:18, Michael Walle wrote:
> Hi,
>
> On 22/12/09 02:14, [email protected] wrote:
>> No progress yet. I tried to simulate the condition a few times but was
>> unable to see the exact failure in my setup so I need to try more.
> Shouldn't it also be possible to see the issue by code reading? I've
> provided the call tree in my previous mail and my concerns regarding
> the locking. Either I'm missing something there or there is no
> locking between these threads which could cause this issue.
>
>> For the other "FW not responding" continuous logs, I got some clue.
>> Probably, will try to send that patch first.
> Ok, let me know if you have some patches, I'm happy to test them.
>
> -michael
>
>

Hello,

I faced the same kernel oops issue. After analyzing my logs and brief
debugging, I agree with Mikhail: the problem seems to be accessing the
scan_result pointer after it has been nulled.

Regarding the solution: if there is a race between two threads (as
Michael described earlier), then I think that the locking mechanism will
be the most reliable solution. We ran into problems during
deinitialization, but driver contains two more places
(handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
where scan_result is set to NULL.

I use NetworkManager to manage networks and I have experienced the same
failure multiple times when switching from one WiFi network to another.
Keep in mind that switching between networks calls wilc_disconnect() and
wilc_deinit() functions and it is not yet clear which one is causing a
core dump. I think it's worth at least taking a look at these areas of
the code. What do you think?

Best regards,
Kirill Buksha.

2023-04-04 01:52:59

by Ajay Singh

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

On 4/3/23 07:24, Kirill Buksha wrote:
> [Some people who received this message don't often get email from [email protected]. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 16.12.22. 11:18, Michael Walle wrote:
>> Hi,
>>
>> On 22/12/09 02:14, [email protected] wrote:
>>> No progress yet. I tried to simulate the condition a few times but was
>>> unable to see the exact failure in my setup so I need to try more.
>> Shouldn't it also be possible to see the issue by code reading? I've
>> provided the call tree in my previous mail and my concerns regarding
>> the locking. Either I'm missing something there or there is no
>> locking between these threads which could cause this issue.
>>
>>> For the other "FW not responding" continuous logs, I got some clue.
>>> Probably, will try to send that patch first.
>> Ok, let me know if you have some patches, I'm happy to test them.
>>
>> -michael
>>
>>
>
> Hello,
>
> I faced the same kernel oops issue. After analyzing my logs and brief
> debugging, I agree with Mikhail: the problem seems to be accessing the
> scan_result pointer after it has been nulled.

I have submitted a patch [1] which has fix for scan_result NULL pointer
exception issue. The submitted patch handles the synchronization between
mac_close() and asynchronous interrupts from firmware. Basically, it
takes care of blocking the execution of mac_close() till all pending
works are completed and afterward no new work addition is allowed since
the close is in progress. It is worth to try with that patch once and
check it's behavior.

1.
https://lore.kernel.org/linux-wireless/[email protected]/T/#u

>
> Regarding the solution: if there is a race between two threads (as
> Michael described earlier), then I think that the locking mechanism will
> be the most reliable solution. We ran into problems during
> deinitialization, but driver contains two more places
> (handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
> where scan_result is set to NULL.
>
> I use NetworkManager to manage networks and I have experienced the same
> failure multiple times when switching from one WiFi network to another.
> Keep in mind that switching between networks calls wilc_disconnect() and
> wilc_deinit() functions and it is not yet clear which one is causing a
> core dump. I think it's worth at least taking a look at these areas of
> the code. What do you think?

If possible, please share the sequence(commands) for Wifi network
switching scenario. It looks like both functions(mac_close & disconnect)
are getting called from user context. mac_close() is a netdevice
callback whereas wilc_disconnect() is a cfg80211 callback. Generally,
wilc_disconnect() should be enough to disconnect from current Wifi
network without bringing the complete interface down. Is NetworkManager
closing the interface(mac_close()) before switching the WiFi network.


Regards,
Ajay

2023-04-04 16:32:02

by Kirill Buksha

[permalink] [raw]
Subject: Re: wilc1000 kernel crash

On 4.4.23. 03:30, [email protected] wrote:
> On 4/3/23 07:24, Kirill Buksha wrote:
>> [Some people who received this message don't often get email from [email protected]. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 16.12.22. 11:18, Michael Walle wrote:
>>> Hi,
>>>
>>> On 22/12/09 02:14, [email protected] wrote:
>>>> No progress yet. I tried to simulate the condition a few times but was
>>>> unable to see the exact failure in my setup so I need to try more.
>>> Shouldn't it also be possible to see the issue by code reading? I've
>>> provided the call tree in my previous mail and my concerns regarding
>>> the locking. Either I'm missing something there or there is no
>>> locking between these threads which could cause this issue.
>>>
>>>> For the other "FW not responding" continuous logs, I got some clue.
>>>> Probably, will try to send that patch first.
>>> Ok, let me know if you have some patches, I'm happy to test them.
>>>
>>> -michael
>>>
>>>
>> Hello,
>>
>> I faced the same kernel oops issue. After analyzing my logs and brief
>> debugging, I agree with Mikhail: the problem seems to be accessing the
>> scan_result pointer after it has been nulled.
> I have submitted a patch [1] which has fix for scan_result NULL pointer
> exception issue. The submitted patch handles the synchronization between
> mac_close() and asynchronous interrupts from firmware. Basically, it
> takes care of blocking the execution of mac_close() till all pending
> works are completed and afterward no new work addition is allowed since
> the close is in progress. It is worth to try with that patch once and
> check it's behavior.
>
> 1.
> https://lore.kernel.org/linux-wireless/[email protected]/T/#u

Thank you for the patch. I will take a look/test it when I have time.

>> Regarding the solution: if there is a race between two threads (as
>> Michael described earlier), then I think that the locking mechanism will
>> be the most reliable solution. We ran into problems during
>> deinitialization, but driver contains two more places
>> (handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
>> where scan_result is set to NULL.
>>
>> I use NetworkManager to manage networks and I have experienced the same
>> failure multiple times when switching from one WiFi network to another.
>> Keep in mind that switching between networks calls wilc_disconnect() and
>> wilc_deinit() functions and it is not yet clear which one is causing a
>> core dump. I think it's worth at least taking a look at these areas of
>> the code. What do you think?
> If possible, please share the sequence(commands) for Wifi network
> switching scenario. It looks like both functions(mac_close & disconnect)
> are getting called from user context. mac_close() is a netdevice
> callback whereas wilc_disconnect() is a cfg80211 callback. Generally,
> wilc_disconnect() should be enough to disconnect from current Wifi
> network without bringing the complete interface down. Is NetworkManager
> closing the interface(mac_close()) before switching the WiFi network.
>
>
> Regards,
> Ajay

The commands are as follows:
while true; do nmcli c up wlan0-client; nmcli c up wlan0-client-2; done

It takes about 5 minutes until I see the core dump.
I see following message after every command:
...
wilc1000_sdio mmc0:0001:1 wlan0: Deinitializing wilc1000...
...
Message above comes from wilc_wlan_deinitialize() function which is called from wilc_mac_close(). It seems that interface is closed between connections.

Best regards,
Kirill Buksha.