2014-08-18 04:01:33

by Linus Torvalds

[permalink] [raw]
Subject: More wireless problems..

So there's still something seriously wrong with the wireless changes
in the current merge window. Maybe it's not new, and is triggered by
something with the hotel wireless here at the kernel summit, but I
just got a NULL pointer dereference:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: ieee80211_vif_use_reserved_switch+0x71c/0xb00 [mac80211]

this may not be iwl-specific at all, but the second oops (which is
likely just a result of the first one) does end up having iwl-related
functions on the stack , so I'm cc'ing both generic wireless people
and intel wireless people. It's the same machine that showed the intel
wireless scanning microcode problem, but the oops looks very
different.

The end result is a dead machine (when kworkers die, things tend to go
downhill fast), so it would be good to have people give this a good
look.

Attached is the more complete oops details. I don't know what
triggered it, the machine was largely idle.

Ideas?

Linus


Attachments:
oops (6.07 kB)

2014-08-18 13:40:47

by Luca Coelho

[permalink] [raw]
Subject: Re: [PATCH] mac80211: fix channel switch for chanctx-based drivers

On Mon, 2014-08-18 at 13:19 +0200, Michal Kazior wrote:
> The new_ctx pointer is set only for non-chanctx
> drivers. This yielded a crash for chanctx-based
> drivers during channel switch finalization:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> IP: ieee80211_vif_use_reserved_switch+0x71c/0xb00 [mac80211]
>
> Use an adequate chanctx pointer to fix this.
>
> Reported-by: Linus Torvalds <[email protected]>
> Signed-off-by: Michal Kazior <[email protected]>
> ---
> Note: This is based on mac80211-next/master albeit
> it should apply cleanly on wireless-next/master
> and v3.17-rc1.
>
> I've verified this fix with iwlmvm & 7260.

Cool! I've also tested this (with P2P client) and it works fine. You
can add my:

Tested-by: Luciano Coelho <[email protected]>

The reason I haven't seen this before is because I've been using 2
channels support with iwlmvm, so we never get an in-place channel
switch. :( The normal case is to have single channel support...

--
Cheers,
Luca.


2014-08-18 11:31:16

by Michal Kazior

[permalink] [raw]
Subject: [PATCH] mac80211: fix channel switch for chanctx-based drivers

The new_ctx pointer is set only for non-chanctx
drivers. This yielded a crash for chanctx-based
drivers during channel switch finalization:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: ieee80211_vif_use_reserved_switch+0x71c/0xb00 [mac80211]

Use an adequate chanctx pointer to fix this.

Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Michal Kazior <[email protected]>
---
Note: This is based on mac80211-next/master albeit
it should apply cleanly on wireless-next/master
and v3.17-rc1.

I've verified this fix with iwlmvm & 7260.


net/mac80211/chan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/mac80211/chan.c b/net/mac80211/chan.c
index f3317fa..7367e66 100644
--- a/net/mac80211/chan.c
+++ b/net/mac80211/chan.c
@@ -1353,7 +1353,7 @@ static int ieee80211_vif_use_reserved_switch(struct ieee80211_local *local)

list_del(&sdata->reserved_chanctx_list);
list_move(&sdata->assigned_chanctx_list,
- &new_ctx->assigned_vifs);
+ &ctx->assigned_vifs);
sdata->reserved_chanctx = NULL;

ieee80211_vif_chanctx_reservation_complete(sdata);
--
1.8.5.3


2014-08-18 04:37:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: More wireless problems..

On Sun, Aug 17, 2014 at 11:01 PM, Linus Torvalds
<[email protected]> wrote:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> IP: ieee80211_vif_use_reserved_switch+0x71c/0xb00 [mac80211]

Looking at the Code: line and the code generation for that function,
this *looks* to be this code:

list_del(&sdata->reserved_chanctx_list);
list_move(&sdata->assigned_chanctx_list,
&new_ctx->assigned_vifs);
sdata->reserved_chanctx = NULL;

in ieee80211_vif_use_reserved_switch(), where "new_ctx" is NULL, so
the "list_move()" ends up oopsing. But maybe I screwed up the
analysis, I don't know the code.

Looks like that is all-new code introduced by commit 5bcae31d9cb1
("mac80211: implement multi-vif in-place reservations")

And doesn't look at all IWL-specific. Adding Michal Kazior to the list
of people.

Linus

2014-08-18 13:53:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mac80211: fix channel switch for chanctx-based drivers

On Mon, Aug 18, 2014 at 6:19 AM, Michal Kazior <[email protected]> wrote:
>
> I've verified this fix with iwlmvm & 7260.

So I'm running a kernel with this manually applied, and so far so
good. But I don't know what actually triggered the problem, and it
definitely didn't happen all the time, so my testing of this is
dubious. But the patch certainly seems to match the symptoms. Thanks,

Linus

2014-08-18 13:59:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mac80211: fix channel switch for chanctx-based drivers

On Mon, Aug 18, 2014 at 6:19 AM, Michal Kazior <[email protected]> wrote:
>
> I've verified this fix with iwlmvm & 7260.

So I'm running a kernel with this manually applied, and so far so
good. But I don't know what actually triggered the problem, and it
definitely didn't happen all the time, so my testing of this is
dubious. But the patch certainly seems to match the symptoms. Thanks,

Linus

2014-08-18 09:18:27

by Luca Coelho

[permalink] [raw]
Subject: Re: More wireless problems..

Hi Linus,

On Sun, 2014-08-17 at 23:37 -0500, Linus Torvalds wrote:
> On Sun, Aug 17, 2014 at 11:01 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > IP: ieee80211_vif_use_reserved_switch+0x71c/0xb00 [mac80211]
>
> Looking at the Code: line and the code generation for that function,
> this *looks* to be this code:
>
> list_del(&sdata->reserved_chanctx_list);
> list_move(&sdata->assigned_chanctx_list,
> &new_ctx->assigned_vifs);
> sdata->reserved_chanctx = NULL;
>
> in ieee80211_vif_use_reserved_switch(), where "new_ctx" is NULL, so
> the "list_move()" ends up oopsing. But maybe I screwed up the
> analysis, I don't know the code.
>
> Looks like that is all-new code introduced by commit 5bcae31d9cb1
> ("mac80211: implement multi-vif in-place reservations")
>
> And doesn't look at all IWL-specific. Adding Michal Kazior to the list
> of people.

Unfortunately I don't think channel switch on the client side has been
tested very thoroughly yet, from the iwlwifi point of view, we're still
fine-tuning things in the firmware configuration. So I'd rather disable
channel-switch entirely for iwlwifi.

Probably better to disable it in mac80211, since this oops will probably
happen with other drivers too, since doesn't seem to be iwlwifi related.

I'll sent a patch soon. Meanwhile, you can apply this directly, to make
the problem go away:

diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index e0ab432..4b8eee8 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -758,6 +758,12 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
netdev_features_t feature_whitelist;
struct cfg80211_chan_def dflt_chandef = {};

+ /* The CSA client code is not very stable yet and seems to be
+ * causing trouble. Disable it for all drivers until
+ * everything has been fixed and tested properly.
+ */
+ hw->flags &= ~IEEE80211_HW_CHANCTX_STA_CSA;
+
if (hw->flags & IEEE80211_HW_QUEUE_CONTROL &&
(local->hw.offchannel_tx_hw_queue ==
IEEE80211_INVAL_HW_QUEUE ||
local->hw.offchannel_tx_hw_queue >= local->hw.queues))

--
Cheers,
Luca.


2014-08-18 13:59:40

by Luca Coelho

[permalink] [raw]
Subject: Re: [PATCH] mac80211: fix channel switch for chanctx-based drivers

Hi Linus,

On Mon, 2014-08-18 at 08:53 -0500, Linus Torvalds wrote:
> On Mon, Aug 18, 2014 at 6:19 AM, Michal Kazior <[email protected]> wrote:
> >
> > I've verified this fix with iwlmvm & 7260.
>
> So I'm running a kernel with this manually applied, and so far so
> good. But I don't know what actually triggered the problem, and it
> definitely didn't happen all the time, so my testing of this is
> dubious. But the patch certainly seems to match the symptoms. Thanks,

What triggers this is a "Channel Switch Announcement" on which the
access point tells the clients to move to another channel at a specified
time. This is not very common, but some enterprise APs use it to
improve the operating radio conditions, for instance.

Previously, as a client, we would simply disconnect from the current
channel and reconnect on the new channel after the time specified by the
AP. Now we implemented a more advanced switch where we don't lose
connectivity, but "simply" switch channels.

Hope this clarifies a bit.

--
Cheers,
Luca.