Subject: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

From: Mohammed Shafi Shajakhan <[email protected]>

This fixes the below crash when ath10k probe firmware fails,
NAPI polling tries to access a rx ring resource which was never
allocated, fix this by disabling NAPI right away once the probe
firmware fails by calling 'ath10k_hif_stop'. Its good to note
that the error is never propogated to 'ath10k_pci_probe' when
ath10k_core_register fails, so calling 'ath10k_hif_stop' to cleanup
PCI related things seems to be ok

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
__ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]

Call Trace:

[<ffffffffa113ec62>] ath10k_htt_rx_msdu_buff_replenish+0x42/0x90
[ath10k_core]
[<ffffffffa113f393>] ath10k_htt_txrx_compl_task+0x433/0x17d0
[ath10k_core]
[<ffffffff8114406d>] ? __wake_up_common+0x4d/0x80
[<ffffffff811349ec>] ? cpu_load_update+0xdc/0x150
[<ffffffffa119301d>] ? ath10k_pci_read32+0xd/0x10 [ath10k_pci]
[<ffffffffa1195b17>] ath10k_pci_napi_poll+0x47/0x110 [ath10k_pci]
[<ffffffff817863af>] net_rx_action+0x20f/0x370

Reported-by: Ben Greear <[email protected]>
Fixes: 3c97f5de1f28 ("ath10k: implement NAPI support")
Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>
---
drivers/net/wireless/ath/ath10k/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index f7ea4de..15bccc9 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -2164,6 +2164,7 @@ static int ath10k_core_probe_fw(struct ath10k *ar)
ath10k_core_free_firmware_files(ar);

err_power_down:
+ ath10k_hif_stop(ar);
ath10k_hif_power_down(ar);

return ret;
--
1.9.1


2017-01-25 13:46:33

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Kalle Valo <[email protected]> writes:

> Mohammed Shafi Shajakhan <[email protected]> writes:
>
>> From: Mohammed Shafi Shajakhan <[email protected]>
>>
>> This fixes the below crash when ath10k probe firmware fails,
>> NAPI polling tries to access a rx ring resource which was never
>> allocated, fix this by disabling NAPI right away once the probe
>> firmware fails by calling 'ath10k_hif_stop'. Its good to note
>> that the error is never propogated to 'ath10k_pci_probe' when
>> ath10k_core_register fails, so calling 'ath10k_hif_stop' to cleanup
>> PCI related things seems to be ok
>>
>> BUG: unable to handle kernel NULL pointer dereference at (null)
>> IP: __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
>> __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
>>
>> Call Trace:
>>
>> [<ffffffffa113ec62>] ath10k_htt_rx_msdu_buff_replenish+0x42/0x90
>> [ath10k_core]
>> [<ffffffffa113f393>] ath10k_htt_txrx_compl_task+0x433/0x17d0
>> [ath10k_core]
>> [<ffffffff8114406d>] ? __wake_up_common+0x4d/0x80
>> [<ffffffff811349ec>] ? cpu_load_update+0xdc/0x150
>> [<ffffffffa119301d>] ? ath10k_pci_read32+0xd/0x10 [ath10k_pci]
>> [<ffffffffa1195b17>] ath10k_pci_napi_poll+0x47/0x110 [ath10k_pci]
>> [<ffffffff817863af>] net_rx_action+0x20f/0x370
>>
>> Reported-by: Ben Greear <[email protected]>
>> Fixes: 3c97f5de1f28 ("ath10k: implement NAPI support")
>> Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>
>
> Is there an easy way to reproduce this bug? I don't see it on my x86
> laptop with qca988x and I call rmmod all the time. I would like to test
> this myself.
>
>> --- a/drivers/net/wireless/ath/ath10k/core.c
>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>> @@ -2164,6 +2164,7 @@ static int ath10k_core_probe_fw(struct ath10k *ar)
>> ath10k_core_free_firmware_files(ar);
>> =20
>> err_power_down:
>> + ath10k_hif_stop(ar);
>> ath10k_hif_power_down(ar);
>> =20
>> return ret;
>
> This breaks the symmetry, we should not be calling ath10k_hif_stop() if
> we haven't called ath10k_hif_start() from the same function. This can
> just create a bigger mess later, for example with other bus support like
> sdio or usb. In theory it should enough that we call
> ath10k_hif_power_down() and pci.c does the rest correctly "behind the
> scenes".
>
> I investigated this a bit and I think the real cause is that we call
> napi_enable() from ath10k_pci_hif_power_up() and napi_disable() from
> ath10k_pci_hif_stop(). Does anyone remember why?
>
> I was expecting that we would call napi_enable()/napi_disable() either
> in ath10k_hif_power_up/down() or ath10k_hif_start()/stop(), but not
> mixed like it's currently.

So below is something I was thinking of, now napi_enable() is called
from ath10k_hif_start() and napi_disable() from ath10k_hif_stop(). Would
that work?

--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1648,6 +1648,8 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
=20
ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
=20
+ napi_enable(&ar->napi);
+
ath10k_pci_irq_enable(ar);
ath10k_pci_rx_post(ar);
=20
@@ -2532,7 +2534,6 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar)
ath10k_err(ar, "could not wake up target CPU: %d\n", ret);
goto err_ce;
}
- napi_enable(&ar->napi);
=20
return 0;

--=20
Kalle Valo=

2017-01-25 13:29:47

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Mohammed Shafi Shajakhan <[email protected]> writes:

> From: Mohammed Shafi Shajakhan <[email protected]>
>
> This fixes the below crash when ath10k probe firmware fails,
> NAPI polling tries to access a rx ring resource which was never
> allocated, fix this by disabling NAPI right away once the probe
> firmware fails by calling 'ath10k_hif_stop'. Its good to note
> that the error is never propogated to 'ath10k_pci_probe' when
> ath10k_core_register fails, so calling 'ath10k_hif_stop' to cleanup
> PCI related things seems to be ok
>
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
> __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
>
> Call Trace:
>
> [<ffffffffa113ec62>] ath10k_htt_rx_msdu_buff_replenish+0x42/0x90
> [ath10k_core]
> [<ffffffffa113f393>] ath10k_htt_txrx_compl_task+0x433/0x17d0
> [ath10k_core]
> [<ffffffff8114406d>] ? __wake_up_common+0x4d/0x80
> [<ffffffff811349ec>] ? cpu_load_update+0xdc/0x150
> [<ffffffffa119301d>] ? ath10k_pci_read32+0xd/0x10 [ath10k_pci]
> [<ffffffffa1195b17>] ath10k_pci_napi_poll+0x47/0x110 [ath10k_pci]
> [<ffffffff817863af>] net_rx_action+0x20f/0x370
>
> Reported-by: Ben Greear <[email protected]>
> Fixes: 3c97f5de1f28 ("ath10k: implement NAPI support")
> Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>

Is there an easy way to reproduce this bug? I don't see it on my x86
laptop with qca988x and I call rmmod all the time. I would like to test
this myself.

> --- a/drivers/net/wireless/ath/ath10k/core.c
> +++ b/drivers/net/wireless/ath/ath10k/core.c
> @@ -2164,6 +2164,7 @@ static int ath10k_core_probe_fw(struct ath10k *ar)
> ath10k_core_free_firmware_files(ar);
> =20
> err_power_down:
> + ath10k_hif_stop(ar);
> ath10k_hif_power_down(ar);
> =20
> return ret;

This breaks the symmetry, we should not be calling ath10k_hif_stop() if
we haven't called ath10k_hif_start() from the same function. This can
just create a bigger mess later, for example with other bus support like
sdio or usb. In theory it should enough that we call
ath10k_hif_power_down() and pci.c does the rest correctly "behind the
scenes".

I investigated this a bit and I think the real cause is that we call
napi_enable() from ath10k_pci_hif_power_up() and napi_disable() from
ath10k_pci_hif_stop(). Does anyone remember why?

I was expecting that we would call napi_enable()/napi_disable() either
in ath10k_hif_power_up/down() or ath10k_hif_start()/stop(), but not
mixed like it's currently.

--=20
Kalle Valo=

2017-02-06 12:26:17

by Michael Ney

[permalink] [raw]
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Symmetry is still broken on firmware crash (at least with 6174). ath10k_pci_hif_stop gets called twice, once from the driver restart (warm restart) and once from ieee80211 start (cold restart), resulting in napi_synchrionize/napi_disable getting called twice and sticking the driver in an infinite wait loop (napi_synchronize waits until NAPI_STATE_SCHED is off, while napi_disable leaves NAPI_STATE_SCHED to on when leaving).


> On Feb 6, 2017, at 5:04 AM, Mohammed Shafi Shajakhan <[email protected]> wrote:
>
> Hi Kalle,
>
> the change suggested by you helps, and the device probe, scan
> is successful as well. Still good to have this change part of your
> basic sanity and regression testing !
>
> regards,
> shafi
>
> On Wed, Jan 25, 2017 at 01:46:28PM +0000, Valo, Kalle wrote:
>> Kalle Valo <[email protected]> writes:
>>
>>> Mohammed Shafi Shajakhan <[email protected]> writes:
>>>
>>>> From: Mohammed Shafi Shajakhan <[email protected]>
>>>>
>>>> This fixes the below crash when ath10k probe firmware fails,
>>>> NAPI polling tries to access a rx ring resource which was never
>>>> allocated, fix this by disabling NAPI right away once the probe
>>>> firmware fails by calling 'ath10k_hif_stop'. Its good to note
>>>> that the error is never propogated to 'ath10k_pci_probe' when
>>>> ath10k_core_register fails, so calling 'ath10k_hif_stop' to cleanup
>>>> PCI related things seems to be ok
>>>>
>>>> BUG: unable to handle kernel NULL pointer dereference at (null)
>>>> IP: __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
>>>> __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
>>>>
>>>> Call Trace:
>>>>
>>>> [<ffffffffa113ec62>] ath10k_htt_rx_msdu_buff_replenish+0x42/0x90
>>>> [ath10k_core]
>>>> [<ffffffffa113f393>] ath10k_htt_txrx_compl_task+0x433/0x17d0
>>>> [ath10k_core]
>>>> [<ffffffff8114406d>] ? __wake_up_common+0x4d/0x80
>>>> [<ffffffff811349ec>] ? cpu_load_update+0xdc/0x150
>>>> [<ffffffffa119301d>] ? ath10k_pci_read32+0xd/0x10 [ath10k_pci]
>>>> [<ffffffffa1195b17>] ath10k_pci_napi_poll+0x47/0x110 [ath10k_pci]
>>>> [<ffffffff817863af>] net_rx_action+0x20f/0x370
>>>>
>>>> Reported-by: Ben Greear <[email protected]>
>>>> Fixes: 3c97f5de1f28 ("ath10k: implement NAPI support")
>>>> Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>
>>>
>>> Is there an easy way to reproduce this bug? I don't see it on my x86
>>> laptop with qca988x and I call rmmod all the time. I would like to test
>>> this myself.
>>>
>>>> --- a/drivers/net/wireless/ath/ath10k/core.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>>>> @@ -2164,6 +2164,7 @@ static int ath10k_core_probe_fw(struct ath10k *ar)
>>>> ath10k_core_free_firmware_files(ar);
>>>>
>>>> err_power_down:
>>>> + ath10k_hif_stop(ar);
>>>> ath10k_hif_power_down(ar);
>>>>
>>>> return ret;
>>>
>>> This breaks the symmetry, we should not be calling ath10k_hif_stop() if
>>> we haven't called ath10k_hif_start() from the same function. This can
>>> just create a bigger mess later, for example with other bus support like
>>> sdio or usb. In theory it should enough that we call
>>> ath10k_hif_power_down() and pci.c does the rest correctly "behind the
>>> scenes".
>>>
>>> I investigated this a bit and I think the real cause is that we call
>>> napi_enable() from ath10k_pci_hif_power_up() and napi_disable() from
>>> ath10k_pci_hif_stop(). Does anyone remember why?
>>>
>>> I was expecting that we would call napi_enable()/napi_disable() either
>>> in ath10k_hif_power_up/down() or ath10k_hif_start()/stop(), but not
>>> mixed like it's currently.
>>
>> So below is something I was thinking of, now napi_enable() is called
>> from ath10k_hif_start() and napi_disable() from ath10k_hif_stop(). Would
>> that work?
>>
>> --- a/drivers/net/wireless/ath/ath10k/pci.c
>> +++ b/drivers/net/wireless/ath/ath10k/pci.c
>> @@ -1648,6 +1648,8 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
>>
>> ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
>>
>> + napi_enable(&ar->napi);
>> +
>> ath10k_pci_irq_enable(ar);
>> ath10k_pci_rx_post(ar);
>>
>> @@ -2532,7 +2534,6 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar)
>> ath10k_err(ar, "could not wake up target CPU: %d\n", ret);
>> goto err_ce;
>> }
>> - napi_enable(&ar->napi);
>>
>> return 0;
>>
>> --
>> Kalle Valo
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k

2017-02-10 11:37:28

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Kalle Valo <[email protected]> writes:

> BTW, just curious but why do you have "during rmmod" in the title? I
> think I was able to reproduce this crash by removing all firmware files
> and didn't use rmmod at all.

Nevermind, I was blind again. It was my script which calls rmmod and I
failed to realise that. Sorry for the noise.

--=20
Kalle Valo=

Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Hi,=0A=
=0A=
even with the below patch applied ?=0A=
https://patchwork.kernel.org/patch/9452265/=0A=
=0A=
regards=0A=
shafi=0A=
________________________________________=0A=
From: Michael Ney <[email protected]>=0A=
Sent: 06 February 2017 17:46=0A=
To: Mohammed Shafi Shajakhan=0A=
Cc: Valo, Kalle; [email protected]; [email protected]=
; Shajakhan, Mohammed Shafi (Mohammed Shafi)=0A=
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware =
fails=0A=
=0A=
Symmetry is still broken on firmware crash (at least with 6174). ath10k_pci=
_hif_stop gets called twice, once from the driver restart (warm restart) an=
d once from ieee80211 start (cold restart), resulting in napi_synchrionize/=
napi_disable getting called twice and sticking the driver in an infinite wa=
it loop (napi_synchronize waits until NAPI_STATE_SCHED is off, while napi_d=
isable leaves NAPI_STATE_SCHED to on when leaving).=0A=
=0A=
=0A=
> On Feb 6, 2017, at 5:04 AM, Mohammed Shafi Shajakhan <mohammed@codeaurora=
.org> wrote:=0A=
>=0A=
> Hi Kalle,=0A=
>=0A=
> the change suggested by you helps, and the device probe, scan=0A=
> is successful as well. Still good to have this change part of your=0A=
> basic sanity and regression testing !=0A=
>=0A=
> regards,=0A=
> shafi=0A=
>=0A=
> On Wed, Jan 25, 2017 at 01:46:28PM +0000, Valo, Kalle wrote:=0A=
>> Kalle Valo <[email protected]> writes:=0A=
>>=0A=
>>> Mohammed Shafi Shajakhan <[email protected]> writes:=0A=
>>>=0A=
>>>> From: Mohammed Shafi Shajakhan <[email protected]>=0A=
>>>>=0A=
>>>> This fixes the below crash when ath10k probe firmware fails,=0A=
>>>> NAPI polling tries to access a rx ring resource which was never=0A=
>>>> allocated, fix this by disabling NAPI right away once the probe=0A=
>>>> firmware fails by calling 'ath10k_hif_stop'. Its good to note=0A=
>>>> that the error is never propogated to 'ath10k_pci_probe' when=0A=
>>>> ath10k_core_register fails, so calling 'ath10k_hif_stop' to cleanup=0A=
>>>> PCI related things seems to be ok=0A=
>>>>=0A=
>>>> BUG: unable to handle kernel NULL pointer dereference at (null)=0A=
>>>> IP: __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]=0A=
>>>> __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]=0A=
>>>>=0A=
>>>> Call Trace:=0A=
>>>>=0A=
>>>> [<ffffffffa113ec62>] ath10k_htt_rx_msdu_buff_replenish+0x42/0x90=0A=
>>>> [ath10k_core]=0A=
>>>> [<ffffffffa113f393>] ath10k_htt_txrx_compl_task+0x433/0x17d0=0A=
>>>> [ath10k_core]=0A=
>>>> [<ffffffff8114406d>] ? __wake_up_common+0x4d/0x80=0A=
>>>> [<ffffffff811349ec>] ? cpu_load_update+0xdc/0x150=0A=
>>>> [<ffffffffa119301d>] ? ath10k_pci_read32+0xd/0x10 [ath10k_pci]=0A=
>>>> [<ffffffffa1195b17>] ath10k_pci_napi_poll+0x47/0x110 [ath10k_pci]=0A=
>>>> [<ffffffff817863af>] net_rx_action+0x20f/0x370=0A=
>>>>=0A=
>>>> Reported-by: Ben Greear <[email protected]>=0A=
>>>> Fixes: 3c97f5de1f28 ("ath10k: implement NAPI support")=0A=
>>>> Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>=0A=
>>>=0A=
>>> Is there an easy way to reproduce this bug? I don't see it on my x86=0A=
>>> laptop with qca988x and I call rmmod all the time. I would like to test=
=0A=
>>> this myself.=0A=
>>>=0A=
>>>> --- a/drivers/net/wireless/ath/ath10k/core.c=0A=
>>>> +++ b/drivers/net/wireless/ath/ath10k/core.c=0A=
>>>> @@ -2164,6 +2164,7 @@ static int ath10k_core_probe_fw(struct ath10k *a=
r)=0A=
>>>> ath10k_core_free_firmware_files(ar);=0A=
>>>>=0A=
>>>> err_power_down:=0A=
>>>> + ath10k_hif_stop(ar);=0A=
>>>> ath10k_hif_power_down(ar);=0A=
>>>>=0A=
>>>> return ret;=0A=
>>>=0A=
>>> This breaks the symmetry, we should not be calling ath10k_hif_stop() if=
=0A=
>>> we haven't called ath10k_hif_start() from the same function. This can=
=0A=
>>> just create a bigger mess later, for example with other bus support lik=
e=0A=
>>> sdio or usb. In theory it should enough that we call=0A=
>>> ath10k_hif_power_down() and pci.c does the rest correctly "behind the=
=0A=
>>> scenes".=0A=
>>>=0A=
>>> I investigated this a bit and I think the real cause is that we call=0A=
>>> napi_enable() from ath10k_pci_hif_power_up() and napi_disable() from=0A=
>>> ath10k_pci_hif_stop(). Does anyone remember why?=0A=
>>>=0A=
>>> I was expecting that we would call napi_enable()/napi_disable() either=
=0A=
>>> in ath10k_hif_power_up/down() or ath10k_hif_start()/stop(), but not=0A=
>>> mixed like it's currently.=0A=
>>=0A=
>> So below is something I was thinking of, now napi_enable() is called=0A=
>> from ath10k_hif_start() and napi_disable() from ath10k_hif_stop(). Would=
=0A=
>> that work?=0A=
>>=0A=
>> --- a/drivers/net/wireless/ath/ath10k/pci.c=0A=
>> +++ b/drivers/net/wireless/ath/ath10k/pci.c=0A=
>> @@ -1648,6 +1648,8 @@ static int ath10k_pci_hif_start(struct ath10k *ar)=
=0A=
>>=0A=
>> ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");=0A=
>>=0A=
>> + napi_enable(&ar->napi);=0A=
>> +=0A=
>> ath10k_pci_irq_enable(ar);=0A=
>> ath10k_pci_rx_post(ar);=0A=
>>=0A=
>> @@ -2532,7 +2534,6 @@ static int ath10k_pci_hif_power_up(struct ath10k *=
ar)=0A=
>> ath10k_err(ar, "could not wake up target CPU: %d\n", ret);=
=0A=
>> goto err_ce;=0A=
>> }=0A=
>> - napi_enable(&ar->napi);=0A=
>>=0A=
>> return 0;=0A=
>>=0A=
>> --=0A=
>> Kalle Valo=0A=
>=0A=
> _______________________________________________=0A=
> ath10k mailing list=0A=
> [email protected]=0A=
> http://lists.infradead.org/mailman/listinfo/ath10k=0A=
=0A=

2017-02-06 06:03:00

by Mohammed Shafi Shajakhan

[permalink] [raw]
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Hi Kalle,

sorry for the delay

On Wed, Jan 25, 2017 at 01:46:28PM +0000, Valo, Kalle wrote:
> Kalle Valo <[email protected]> writes:
>
> > Mohammed Shafi Shajakhan <[email protected]> writes:
> >
> >> From: Mohammed Shafi Shajakhan <[email protected]>
> >>
> >> This fixes the below crash when ath10k probe firmware fails,
> >> NAPI polling tries to access a rx ring resource which was never
> >> allocated, fix this by disabling NAPI right away once the probe
> >> firmware fails by calling 'ath10k_hif_stop'. Its good to note
> >> that the error is never propogated to 'ath10k_pci_probe' when
> >> ath10k_core_register fails, so calling 'ath10k_hif_stop' to cleanup
> >> PCI related things seems to be ok
> >>
> >> BUG: unable to handle kernel NULL pointer dereference at (null)
> >> IP: __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
> >> __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
> >>
> >> Call Trace:
> >>
> >> [<ffffffffa113ec62>] ath10k_htt_rx_msdu_buff_replenish+0x42/0x90
> >> [ath10k_core]
> >> [<ffffffffa113f393>] ath10k_htt_txrx_compl_task+0x433/0x17d0
> >> [ath10k_core]
> >> [<ffffffff8114406d>] ? __wake_up_common+0x4d/0x80
> >> [<ffffffff811349ec>] ? cpu_load_update+0xdc/0x150
> >> [<ffffffffa119301d>] ? ath10k_pci_read32+0xd/0x10 [ath10k_pci]
> >> [<ffffffffa1195b17>] ath10k_pci_napi_poll+0x47/0x110 [ath10k_pci]
> >> [<ffffffff817863af>] net_rx_action+0x20f/0x370
> >>
> >> Reported-by: Ben Greear <[email protected]>
> >> Fixes: 3c97f5de1f28 ("ath10k: implement NAPI support")
> >> Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>
> >
> > Is there an easy way to reproduce this bug? I don't see it on my x86
> > laptop with qca988x and I call rmmod all the time. I would like to test
> > this myself.
> >
> >> --- a/drivers/net/wireless/ath/ath10k/core.c
> >> +++ b/drivers/net/wireless/ath/ath10k/core.c
> >> @@ -2164,6 +2164,7 @@ static int ath10k_core_probe_fw(struct ath10k *ar)
> >> ath10k_core_free_firmware_files(ar);
> >>
> >> err_power_down:
> >> + ath10k_hif_stop(ar);
> >> ath10k_hif_power_down(ar);
> >>
> >> return ret;
> >
> > This breaks the symmetry, we should not be calling ath10k_hif_stop() if
> > we haven't called ath10k_hif_start() from the same function. This can
> > just create a bigger mess later, for example with other bus support like
> > sdio or usb. In theory it should enough that we call
> > ath10k_hif_power_down() and pci.c does the rest correctly "behind the
> > scenes".
> >
> > I investigated this a bit and I think the real cause is that we call
> > napi_enable() from ath10k_pci_hif_power_up() and napi_disable() from
> > ath10k_pci_hif_stop(). Does anyone remember why?
> >
> > I was expecting that we would call napi_enable()/napi_disable() either
> > in ath10k_hif_power_up/down() or ath10k_hif_start()/stop(), but not
> > mixed like it's currently.
>
> So below is something I was thinking of, now napi_enable() is called
> from ath10k_hif_start() and napi_disable() from ath10k_hif_stop(). Would
> that work?
>
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1648,6 +1648,8 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
>
> ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
>
> + napi_enable(&ar->napi);
> +
> ath10k_pci_irq_enable(ar);
> ath10k_pci_rx_post(ar);
>
> @@ -2532,7 +2534,6 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar)
> ath10k_err(ar, "could not wake up target CPU: %d\n", ret);
> goto err_ce;
> }
> - napi_enable(&ar->napi);
>
> return 0;
>

[shafi] I think I tried this change some time back, but it had some regression
during device start up, let me check this once and get back to you.

regards,
shafi

2017-02-06 10:05:10

by Mohammed Shafi Shajakhan

[permalink] [raw]
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Hi Kalle,

the change suggested by you helps, and the device probe, scan
is successful as well. Still good to have this change part of your
basic sanity and regression testing !

regards,
shafi

On Wed, Jan 25, 2017 at 01:46:28PM +0000, Valo, Kalle wrote:
> Kalle Valo <[email protected]> writes:
>
> > Mohammed Shafi Shajakhan <[email protected]> writes:
> >
> >> From: Mohammed Shafi Shajakhan <[email protected]>
> >>
> >> This fixes the below crash when ath10k probe firmware fails,
> >> NAPI polling tries to access a rx ring resource which was never
> >> allocated, fix this by disabling NAPI right away once the probe
> >> firmware fails by calling 'ath10k_hif_stop'. Its good to note
> >> that the error is never propogated to 'ath10k_pci_probe' when
> >> ath10k_core_register fails, so calling 'ath10k_hif_stop' to cleanup
> >> PCI related things seems to be ok
> >>
> >> BUG: unable to handle kernel NULL pointer dereference at (null)
> >> IP: __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
> >> __ath10k_htt_rx_ring_fill_n+0x19/0x230 [ath10k_core]
> >>
> >> Call Trace:
> >>
> >> [<ffffffffa113ec62>] ath10k_htt_rx_msdu_buff_replenish+0x42/0x90
> >> [ath10k_core]
> >> [<ffffffffa113f393>] ath10k_htt_txrx_compl_task+0x433/0x17d0
> >> [ath10k_core]
> >> [<ffffffff8114406d>] ? __wake_up_common+0x4d/0x80
> >> [<ffffffff811349ec>] ? cpu_load_update+0xdc/0x150
> >> [<ffffffffa119301d>] ? ath10k_pci_read32+0xd/0x10 [ath10k_pci]
> >> [<ffffffffa1195b17>] ath10k_pci_napi_poll+0x47/0x110 [ath10k_pci]
> >> [<ffffffff817863af>] net_rx_action+0x20f/0x370
> >>
> >> Reported-by: Ben Greear <[email protected]>
> >> Fixes: 3c97f5de1f28 ("ath10k: implement NAPI support")
> >> Signed-off-by: Mohammed Shafi Shajakhan <[email protected]>
> >
> > Is there an easy way to reproduce this bug? I don't see it on my x86
> > laptop with qca988x and I call rmmod all the time. I would like to test
> > this myself.
> >
> >> --- a/drivers/net/wireless/ath/ath10k/core.c
> >> +++ b/drivers/net/wireless/ath/ath10k/core.c
> >> @@ -2164,6 +2164,7 @@ static int ath10k_core_probe_fw(struct ath10k *ar)
> >> ath10k_core_free_firmware_files(ar);
> >>
> >> err_power_down:
> >> + ath10k_hif_stop(ar);
> >> ath10k_hif_power_down(ar);
> >>
> >> return ret;
> >
> > This breaks the symmetry, we should not be calling ath10k_hif_stop() if
> > we haven't called ath10k_hif_start() from the same function. This can
> > just create a bigger mess later, for example with other bus support like
> > sdio or usb. In theory it should enough that we call
> > ath10k_hif_power_down() and pci.c does the rest correctly "behind the
> > scenes".
> >
> > I investigated this a bit and I think the real cause is that we call
> > napi_enable() from ath10k_pci_hif_power_up() and napi_disable() from
> > ath10k_pci_hif_stop(). Does anyone remember why?
> >
> > I was expecting that we would call napi_enable()/napi_disable() either
> > in ath10k_hif_power_up/down() or ath10k_hif_start()/stop(), but not
> > mixed like it's currently.
>
> So below is something I was thinking of, now napi_enable() is called
> from ath10k_hif_start() and napi_disable() from ath10k_hif_stop(). Would
> that work?
>
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1648,6 +1648,8 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
>
> ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
>
> + napi_enable(&ar->napi);
> +
> ath10k_pci_irq_enable(ar);
> ath10k_pci_rx_post(ar);
>
> @@ -2532,7 +2534,6 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar)
> ath10k_err(ar, "could not wake up target CPU: %d\n", ret);
> goto err_ce;
> }
> - napi_enable(&ar->napi);
>
> return 0;
>
> --
> Kalle Valo

2017-02-10 09:48:00

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Mohammed Shafi Shajakhan <[email protected]> writes:

> the change suggested by you helps, and the device probe, scan
> is successful as well. Still good to have this change part of your
> basic sanity and regression testing !

Sure. I was sort of expecting that you would send v4 but I haven't seen
one so I guess you assumed I send that? :) I'll then submit v4.

BTW, just curious but why do you have "during rmmod" in the title? I
think I was able to reproduce this crash by removing all firmware files
and didn't use rmmod at all.

--=20
Kalle Valo=

Subject: Re: [PATCH v3] ath10k: Fix crash during rmmod when probe firmware fails

Mohammed Shafi Shajakhan <[email protected]> writes:=0A=
=0A=
> the change suggested by you helps, and the device probe, scan=0A=
> is successful as well. Still good to have this change part of your=0A=
> basic sanity and regression testing !=0A=
=0A=
Sure. I was sort of expecting that you would send v4 but I haven't seen=0A=
one so I guess you assumed I send that? :) I'll then submit v4.=0A=
=0A=
[shafi] thanks Kalle, just saw your patch.=0A=
=0A=
regards,=0A=
shafi=0A=