2015-06-01 01:37:42

by Marty Faltesek

[permalink] [raw]
Subject: system hang with backports-20150511/20150525

Starting with backports-20150511, and continuing with
backports-20150525, we see frequent system hangs. backports-20150424
had no issue.

After the freeze, the console is non-responsive, as well as the
network stack (ssh/ping does not work). Using sysrq, I can see log
messages continuing from ath10k_pci after the freeze, along with some
other threads as well.

mac80211/ath10k/cfg80211 are the only modules in use from backports,
so it seems like a deadlock could possibly be with mac80211 or
ath10k.

LOCKDEP didn't reveal anything.

Using a 3.2.26 kernel on ARM. AP mode. No encryption.

I've collected ftrace events for sched mac80211 net napi cfg80211
workqueue, which are included in the dmesg you can find here because
of its size:

http://tinyurl.com/dmesg-ftrace

In the logs, the last timestamp that my test script wrote is:

[ 1021.291495] hbeat0352

I've captured ftrace events before and after 1021.291495.

Thanks,
Marty Faltesek
Google Fiber


2015-06-02 05:20:06

by Michal Kazior

[permalink] [raw]
Subject: Re: system hang with backports-20150511/20150525

On 1 June 2015 at 21:42, Marty Faltesek <[email protected]> wrote:
> I disabled IEEE80211_HW_SUPPORT_FAST_XMIT before, and still saw the
> hang. I repeated today to confirm.

Thanks for checking.


> I added the extra ath10k debug flags you requested, and it causes a
> system reset without any messages, very soon after the last hbeat
> timestamp. I've uploaded log "crash.6.1.15.13.46" to
> http://tinyurl.com/dmesg-ftrace.

I guess serial console gave up which isn't really surprising :( Thanks
for checking anyway.


> Any advice on how to bisect when using backports?

Sure. Generally you'll need to do the bisect on your linux-next tree
which you use to generate backports:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/backports/backports.git
git clone git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
cd linux-next
git bisect start
git bisect good a3da0fb6
git bisect bad f17107c
# repeat:
cd ../backports
./gentree.py --clean --git-revision HEAD ../linux-next ../backports-output/
cd ../backports-output
# configure, make, test
cd ../linux-next
git bisect <good|bad> # "good" if problem doesn't reproduce, "bad" if it does
# goto repeat


Michał




>
> On Mon, Jun 1, 2015 at 4:27 AM, Michal Kazior <[email protected]> wrote:
>> On 1 June 2015 at 09:13, Kalle Valo <[email protected]> wrote:
>>> Michal Kazior <[email protected]> writes:
>>>
>>>> +ath10k list
>>>>
>>>> On 1 June 2015 at 03:37, Marty Faltesek <[email protected]> wrote:
>>>>> Starting with backports-20150511, and continuing with
>>>>> backports-20150525, we see frequent system hangs. backports-20150424
>>>>> had no issue.
>>>>
>>>> I don't see such binary releases on
>>>> https://backports.wiki.kernel.org/index.php/Main_Page
>>>> Hence I don't know what kernel you've backported the drivers from and
>>>> I can't compare anything.
>>>>
>>>> Can you provide more details, please?
>>>
>>> I suspect it's from here:
>>>
>>> https://www.kernel.org/pub/linux/kernel/projects/backports/2015/05/25/
>>>
>>> The backports project pages are a bit confusing and that location is
>>> hard to find.
>>
>> Oh, thanks!
>>
>> Hmm.. There was a ton of changes between 20150424 and 20150511. For
>> one, ath10k started to use chanctx API and FAST_XMIT. But it's not a
>> given these two are to blame.
>>
>> The latter can be easily disabled by removing
>> IEEE80211_HW_SUPPORT_FAST_XMIT from ar->hw->flags in ath10k's mac.c.
>> The former.. not so easy. I'd be awesome if you could do a git bisect.
>> The commit ids are a3da0fb6(good) f17107c(bad) (you need linux-next
>> git repo including its tags for these ids to be resolvable).
>>
>>
>> Michał

2015-06-01 19:42:50

by Marty Faltesek

[permalink] [raw]
Subject: Re: system hang with backports-20150511/20150525

I disabled IEEE80211_HW_SUPPORT_FAST_XMIT before, and still saw the
hang. I repeated today to confirm.

I added the extra ath10k debug flags you requested, and it causes a
system reset without any messages, very soon after the last hbeat
timestamp. I've uploaded log "crash.6.1.15.13.46" to
http://tinyurl.com/dmesg-ftrace.

Any advice on how to bisect when using backports?

On Mon, Jun 1, 2015 at 4:27 AM, Michal Kazior <[email protected]> wrote:
> On 1 June 2015 at 09:13, Kalle Valo <[email protected]> wrote:
>> Michal Kazior <[email protected]> writes:
>>
>>> +ath10k list
>>>
>>> On 1 June 2015 at 03:37, Marty Faltesek <[email protected]> wrote:
>>>> Starting with backports-20150511, and continuing with
>>>> backports-20150525, we see frequent system hangs. backports-20150424
>>>> had no issue.
>>>
>>> I don't see such binary releases on
>>> https://backports.wiki.kernel.org/index.php/Main_Page
>>> Hence I don't know what kernel you've backported the drivers from and
>>> I can't compare anything.
>>>
>>> Can you provide more details, please?
>>
>> I suspect it's from here:
>>
>> https://www.kernel.org/pub/linux/kernel/projects/backports/2015/05/25/
>>
>> The backports project pages are a bit confusing and that location is
>> hard to find.
>
> Oh, thanks!
>
> Hmm.. There was a ton of changes between 20150424 and 20150511. For
> one, ath10k started to use chanctx API and FAST_XMIT. But it's not a
> given these two are to blame.
>
> The latter can be easily disabled by removing
> IEEE80211_HW_SUPPORT_FAST_XMIT from ar->hw->flags in ath10k's mac.c.
> The former.. not so easy. I'd be awesome if you could do a git bisect.
> The commit ids are a3da0fb6(good) f17107c(bad) (you need linux-next
> git repo including its tags for these ids to be resolvable).
>
>
> Michał

2015-06-01 06:36:08

by Michal Kazior

[permalink] [raw]
Subject: Re: system hang with backports-20150511/20150525

+ath10k list

On 1 June 2015 at 03:37, Marty Faltesek <[email protected]> wrote:
> Starting with backports-20150511, and continuing with
> backports-20150525, we see frequent system hangs. backports-20150424
> had no issue.

I don't see such binary releases on
https://backports.wiki.kernel.org/index.php/Main_Page
Hence I don't know what kernel you've backported the drivers from and
I can't compare anything.

Can you provide more details, please?


> After the freeze, the console is non-responsive, as well as the
> network stack (ssh/ping does not work). Using sysrq, I can see log
> messages continuing from ath10k_pci after the freeze, along with some
> other threads as well.

You probably refer to:

[ 1026.951643] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0,
skipped old beacon
[ 1026.951674] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0,
skipped old beacon
[ 1026.951698] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0,
skipped old beacon

What's puzzling to me are these timestamps. SWBA events are generated
by firmware (and sent to host) every beacon interval which is ~100ms
in most cases. In your case however I can see a burst of at least 10
SWBA events within 1ms. Either top(irq) or bottom(tasklet) got stuck
for some time.

It could be useful if you could enable ath10k debugging with
debug_mask=0xffffff3f (this could generate a lot of messages if you're
running traffic through ath10k).


> mac80211/ath10k/cfg80211 are the only modules in use from backports,
> so it seems like a deadlock could possibly be with mac80211 or
> ath10k.
>
> LOCKDEP didn't reveal anything.

You might want to try tune /proc/sys/kernel/hung_task_timeout_secs
down (e.g. 5 or 10 seconds) and see what happens when you hit the
problem.


> Using a 3.2.26 kernel on ARM. AP mode. No encryption.
>
> I've collected ftrace events for sched mac80211 net napi cfg80211
> workqueue, which are included in the dmesg you can find here because
> of its size:
>
> http://tinyurl.com/dmesg-ftrace
>
> In the logs, the last timestamp that my test script wrote is:
>
> [ 1021.291495] hbeat0352
>
> I've captured ftrace events before and after 1021.291495.

Your dmesg looks really messy and I'm worried if SWBA events really
came in a burst or not.


Michał

2015-06-01 07:13:26

by Kalle Valo

[permalink] [raw]
Subject: Re: system hang with backports-20150511/20150525

Michal Kazior <[email protected]> writes:

> +ath10k list
>
> On 1 June 2015 at 03:37, Marty Faltesek <[email protected]> wrote:
>> Starting with backports-20150511, and continuing with
>> backports-20150525, we see frequent system hangs. backports-20150424
>> had no issue.
>
> I don't see such binary releases on
> https://backports.wiki.kernel.org/index.php/Main_Page
> Hence I don't know what kernel you've backported the drivers from and
> I can't compare anything.
>
> Can you provide more details, please?

I suspect it's from here:

https://www.kernel.org/pub/linux/kernel/projects/backports/2015/05/25/

The backports project pages are a bit confusing and that location is
hard to find.

--
Kalle Valo

2015-06-01 08:27:50

by Michal Kazior

[permalink] [raw]
Subject: Re: system hang with backports-20150511/20150525

On 1 June 2015 at 09:13, Kalle Valo <[email protected]> wrote:
> Michal Kazior <[email protected]> writes:
>
>> +ath10k list
>>
>> On 1 June 2015 at 03:37, Marty Faltesek <[email protected]> wrote:
>>> Starting with backports-20150511, and continuing with
>>> backports-20150525, we see frequent system hangs. backports-20150424
>>> had no issue.
>>
>> I don't see such binary releases on
>> https://backports.wiki.kernel.org/index.php/Main_Page
>> Hence I don't know what kernel you've backported the drivers from and
>> I can't compare anything.
>>
>> Can you provide more details, please?
>
> I suspect it's from here:
>
> https://www.kernel.org/pub/linux/kernel/projects/backports/2015/05/25/
>
> The backports project pages are a bit confusing and that location is
> hard to find.

Oh, thanks!

Hmm.. There was a ton of changes between 20150424 and 20150511. For
one, ath10k started to use chanctx API and FAST_XMIT. But it's not a
given these two are to blame.

The latter can be easily disabled by removing
IEEE80211_HW_SUPPORT_FAST_XMIT from ar->hw->flags in ath10k's mac.c.
The former.. not so easy. I'd be awesome if you could do a git bisect.
The commit ids are a3da0fb6(good) f17107c(bad) (you need linux-next
git repo including its tags for these ids to be resolvable).


Michał