2010-05-31 20:12:56

by Nils Radtke

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock


Hi Reinette,

First off:
Linux mypole 2.6.33.4 #23 SMP PREEMPT Sat May 15 20:27:33 CEST 2010 i686 GNU/Linux

On Thu 2010-05-20 @ 11-33-28AM -0700, reinette chatre wrote:
# On Thu, 2010-05-20 at 05:15 -0700, Nils Radtke wrote:
# > #
# > # To address (1), could you please run with attached debug patch and also
# > # enable rate scaling debugging. That will be "modprobe iwlagn
# > # debug=0x143fff).
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c: In function ‘rs_collect_tx_data’:
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: ‘priv’ undeclared (first use in this function)
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: (Each undeclared identifier is reported only once
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: for each function it appears in.)
# >
# > This happens when compiling w/ the patch applied cleanly against .33.3
# > I'll try to fix it, then conduct the field test.
#
# Sorry ... and thanks.
#
# > For the latter, do
# > you need the same kind of log as for the previous one?
# The goal of this patch is to find the reason behind the error
# "expected_tpt should have been calculated by now". From what I
# understand you only encountered that in one of your tests, not all. Any
# test you can run to reproduce that error will be welcome.
Yep. This expected_tpt stuff happens IRC (see mails for certainty, though)
exclusively on site B.

Ok, so that's the goal. What could I do to advance us an additional step
at a time avoiding pushing hundreds of kilobytes of logs uplink?

To reproduce it, I have to be on site B and just start surfing. It's a matter
of _short_ time until the driver hits the wall.

I'm suspecting still getting those expected_tpt thingies (see below). Though today
it seems I've been lucky, it (maybe) only happened once (or never today).

# Thinking about your question more ... I believe your previous debug logs
# were created with debug flag 0x43fff. For this iteration, please use
# debug flag 0x143fff.

If you insist on a test using .33.3, I will do so but that will have to wait.

Meanwhile I used this patch for .34 to fix the build err from your dbg patch:

drivers/net/wireless/iwlwifi/iwl-agn-rs.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
===================================================================
--- linux.orig/drivers/net/wireless/iwlwifi/iwl-agn-rs.c 2010-05-31 13:21:48.000000000 +0200
+++ linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c 2010-05-31 13:25:52.000000000 +0200
@@ -365,7 +365,8 @@
* packets.
*/
static int rs_collect_tx_data(struct iwl_scale_tbl_info *tbl,
- int scale_index, int attempts, int successes)
+ int scale_index, int attempts, int successes,
+ struct iwl_priv *priv)
{
struct iwl_rate_scale_data *window = NULL;
static const u64 mask = (((u64)1) << (IWL_RATE_MAX_WINDOW - 1));
@@ -868,7 +869,7 @@
&rs_index);
rs_collect_tx_data(curr_tbl, rs_index,
info->status.ampdu_ack_len,
- info->status.ampdu_ack_map);
+ info->status.ampdu_ack_map, priv);

/* Update success/fail counts if not searching for new mode */
if (lq_sta->stay_in_tbl) {
@@ -902,7 +903,7 @@
else
continue;
rs_collect_tx_data(tmp_tbl, rs_index, 1,
- i < retries ? 0 : legacy_success);
+ i < retries ? 0 : legacy_success, priv);
}

/* Update success/fail counts if not searching for new mode */


Test conducted using debug flag 0x143fff.

Test on Site B resulted immediately in a hard crash upon resume, that is I got to X,
activated wireless, echoed 0x143fff to the sysfile and that was it.
But I didn't count on that one so I had no console to watch for it. So right now
I got no clue what caused the crash, there's nothing in the logs, of course..

Next thing was to redo test on site B, but this time I switched to console beforehand.
Sure enough, this time nothing happened..

Log appended as bz2.

Hopefully lines like the following don't indicate essential info getting dropped:
May 31 17:23:17 localhost kernel: [91800.091565] net_ratelimit: 70 callbacks suppressed

This line indicates the first timestamp _after_ the crash:
May 31 17:35:19 localhost kernel: [ 69.488456]

The crash happened after site A and on site B. Just arrived, opened lid and *crash*.

I noticed in iwl-agn-rs.c:2080:
BUG_ON(window->average_tpt != ((window->success_ratio *
tbl->expected_tpt[index] + 64) / 128));
Could that be again the point that hit me today when the machine crashed once?
Would you mind changing this into a milder WARN? That way I wouldn't hit the wall
that hard. And I would notice it anyway while skimming the logs as we still are on the
hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
update..

Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)

The dbg log contains all types of the errs happening here:

- "deauthenticated" msgs w/ "reason 2"

- "request scan called when driver not ready"

- iwl_tx_agg_start

- of course the "expected_tpt should have been[..]" don't show up anymore, the source
has no more WARN regarding this..

And the rest of the 9Mb dbg log..

Could you tell me a bit about your idea of how to track those down? Maybe we can
speed things a little up. The logging and testing stuff takes a lot of time and
most of it I have no clue why that might help or what the goal is..
Are you online in some #chan?

# > # Regarding (2): This is a common issue in busy environments where AP
# > # decides to deathenticate station after it does not receive an ack for
# > # data sent after a few retries. Was this test done in busy environment?
# > Both. This happens in busy environment as well as in an idle one. Can't tell
# > yet whether there're more of those msgs in busy env. I start to feel against
# > Cisco APs..
#
# I don't know ... perhaps these APs have been set up to be strict wrt
# delays.
Sure. May well be.. For sure I'm no fan of this config/policy..
But wait. I noticed no such delays or disassociations using another notebook.
Well, seems I got to investigate in this, too.

# > # Regarding (3): Seems like driver is getting a request to scan after a
# > # request to remove interface. I am still inquiring about this.
# > Probably due to me switching of via RF_KILLSWITCH. But anyway I assume this
# > msg should not happen..
#
# Absolutely. What are the exact steps you run when you encounter this
# issue?
Nothing particular. I.e. after the tests conducted toggle the hw kill switch. Then
a script gets called via acpi callback that in turn 1) kills the respective dhclient3, 2)
terminates wpa_supplicant and 3) removes the wifi modules (for power saving).

Cheers,

Nils


Attachments:
(No filename) (7.04 kB)
2010-05-31_iwlwifi_dbg_filter.bz2 (356.56 kB)
Download all attachments

2010-06-02 17:51:28

by Reinette Chatre

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

On Mon, 2010-05-31 at 13:12 -0700, Nils Radtke wrote:

> This line indicates the first timestamp _after_ the crash:
> May 31 17:35:19 localhost kernel: [ 69.488456]
>
> The crash happened after site A and on site B. Just arrived, opened lid and *crash*.
>
> I noticed in iwl-agn-rs.c:2080:
> BUG_ON(window->average_tpt != ((window->success_ratio *
> tbl->expected_tpt[index] + 64) / 128));
> Could that be again the point that hit me today when the machine crashed once?
> Would you mind changing this into a milder WARN? That way I wouldn't hit the wall
> that hard. And I would notice it anyway while skimming the logs as we still are on the
> hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
> update..
>
> Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)

Seems like you performed the testing without the patch that we used to
address the hang issue from the beginning of this thread. Please see
http://marc.info/?l=linux-wireless&m=127290931304496&w=2 - that thread
also explains why the patch is not in 2.6.34.

I think it is time to move this discussion to a bug report so that it
can be tracked better. Please open a new bug at
http://bugzilla.intellinuxwireless.org/

Reinette




2010-06-10 14:22:57

by Nils Radtke

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock


Hi Reinette,

Thanks for your message.

Yes, you're right about the multiple bugs one thread thing.

Just today I got registered w/ the wireless ml because the
system just did not send me a registration message.

For the bug reports to be created it will take me some time.
I'll firstly report the main issue, the 2 other ones afterwards.
Would it be ok cross referencing i.e. to the log and such
between the reports?

Should I paste all the mail messages in separate report messages
(belonging to one bug report, of course) or should I paste some
links to the thread?

Cheers,

Nils

@John: Yes, you're right but the 2.6.33.4 tree which for me still
has the bug_on in.


On Tue 2010-06-08 @ 10-46-29AM -0700, reinette chatre wrote:
# On Fri, 2010-06-04 at 09:57 -0700, Nils Radtke wrote:
# > I haven't yet received a comment of yours regarding my many other questions in
# > my previous message. I am willing to help investigate more, assist in other ways
# > than testing only (always only doing testing isn't a way to keep up fun..)
#
# Your messages contain references to many issues and it is becoming
# increasingly hard to keep track of them all in a single email thread.
# Since the system crash is clearly the big issue I would like to focus on
# that and get that resolved. This is why I proposed that you create bug
# reports to help track your various issues better.
#
# Reinette
#
#

--

2010-06-08 17:46:31

by Reinette Chatre

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

On Fri, 2010-06-04 at 09:57 -0700, Nils Radtke wrote:
> I haven't yet received a comment of yours regarding my many other questions in
> my previous message. I am willing to help investigate more, assist in other ways
> than testing only (always only doing testing isn't a way to keep up fun..)

Your messages contain references to many issues and it is becoming
increasingly hard to keep track of them all in a single email thread.
Since the system crash is clearly the big issue I would like to focus on
that and get that resolved. This is why I proposed that you create bug
reports to help track your various issues better.

Reinette



2010-06-10 16:19:32

by Reinette Chatre

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

On Thu, 2010-06-10 at 07:22 -0700, Nils Radtke wrote:
> For the bug reports to be created it will take me some time.
> I'll firstly report the main issue, the 2 other ones afterwards.

Sounds great. Thanks

> Would it be ok cross referencing i.e. to the log and such
> between the reports?
> Should I paste all the mail messages in separate report messages
> (belonging to one bug report, of course) or should I paste some
> links to the thread?

I find it most convenient if all information related to the bug is
contained in the bug report. Links can be used.

Reinette





2010-06-04 16:57:38

by Nils Radtke

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

Hi Reinette,

BTW, this:
Jun 3 12:05:43 localhost kernel: [174170.391756] iwlagn 0000:03:00.0:
TX Power requested while scanning!
happened even w/o toggling radio switch, so this seems not uniquely
related to toggling the radio switch.

On mer 2010-06-02 @ 10-51-25 -0700, reinette chatre wrote:
# On Mon, 2010-05-31 at 13:12 -0700, Nils Radtke wrote:
#
# > This line indicates the first timestamp _after_ the crash:
# > May 31 17:35:19 localhost kernel: [ 69.488456]
# >
# > The crash happened after site A and on site B. Just arrived, opened lid and *crash*.
# >
# > I noticed in iwl-agn-rs.c:2080:
# > BUG_ON(window->average_tpt != ((window->success_ratio *
# > tbl->expected_tpt[index] + 64) / 128));
# > Could that be again the point that hit me today when the machine crashed once?
# > Would you mind changing this into a milder WARN? That way I wouldn't hit the wall
# > that hard. And I would notice it anyway while skimming the logs as we still are on the
# > hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
# > update..
# >
# > Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)
#
# Seems like you performed the testing without the patch that we used to
# address the hang issue from the beginning of this thread. Please see
Indeed, that's what it feels like. It is just so annoying, that one..
You can't work w/ the kernel drivers. That's a shame.
BTW, iff the patch for the BUG_ON is in kernel src since 2.6.28, that might
explain a lot of crashes before where I haven't never been able to track it down.
Even more, those days I hadn't a chance to do more on this. Unlike now.

# http://marc.info/?l=linux-wireless&m=127290931304496&w=2 - that thread
# also explains why the patch is not in 2.6.34.
It should definitely and absolutely be merged (change the BUG_ON into WARNING).
Even if, like hypothesized, the bug is hidden elsewhere, a BUG_ON doesn't get
me far, it's killing every chance to advance to a solution. How am I supposed
to investigate w/ the kernel crashing? BTW, I don't like working w/ a Linux
kernel that kills my work regularly, I think that's understandable. If I needed
a break from work, I'd set an alarm.

I've seen a bugreport on this issue on the redhat bts referencing my word about
this BUG_ON only getting hit w/ cisco APs. There's a wide range of AP manufacturers
out there in the city. But only cisco APs are crashing this driver. Admittedly, only
on one single location, but anyway it's a cisco. Always the same MAC, unless they
use to reassign MAC addresses, though..

I think it's a tough one, if an AP is able to crash the driver.

I haven't yet received a comment of yours regarding my many other questions in
my previous message. I am willing to help investigate more, assist in other ways
than testing only (always only doing testing isn't a way to keep up fun..)

# I think it is time to move this discussion to a bug report so that it
# can be tracked better. Please open a new bug at
# http://bugzilla.intellinuxwireless.org/
As you wish. It's probably a good idea. But I still miss the registration mail
from bz, did register yesterday.

So, please see to it, that the patch rendering the BUG_ON into a
WARNING finds it's way back in.

Thank you very much,


Nils Radtke