2010-05-03 19:30:11

by John W. Linville

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

On Mon, May 03, 2010 at 09:17:56PM +0200, [email protected] wrote:

> Strangely, as stated in the previous message, this bug does only happen in conjunction at a specific
> geographical position, so far it only happened there that is. How does this correlate with the designated
> code at line 2076 in iwl-agn-rs.c?:
>
> /* Sanity-check TPT calculations */
> BUG_ON(window->average_tpt != ((window->success_ratio *
> tbl->expected_tpt[index] + 64) / 128));

Interestingly enough, we have been discussing this line of code today. Could you try the patch here?

http://marc.info/?l=linux-wireless&m=127290931304496&w=2

Does it address the issue you are experiencing?

John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.


2010-05-10 23:32:06

by Reinette Chatre

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

On Mon, 2010-05-10 at 11:36 -0700, Nils Radtke wrote:
> Today weather was fine again, finally. So testing with .33.3 w/ the patch applied:
>
> http://marc.info/?l=linux-wireless&m=127290931304496&w=2
>
> The kernel kernel .32 was still running before it crashed immediately on wireless activation.
> The crash log showed again at least two messages, the last was as already described in my first
> message, bug from 2010-04-30: I think even the 0x2030 was the same:
>
> EIP rs_tx_status +x8f/x2030

You report an issue on 2.6.32 ...

>
> W/ .33.3 and the above patch applied:

... but then test the patch with 2.6.33.

Which kernel are you focused on?


> Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux
>
> May 10 19:14:11 [ 80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:17 [ 626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:30 [ 638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:32 [ 641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:54 [ 663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:58 [ 666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:24:02 [ 671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

Can you see any impact on your connection speed that can be connected to
these messages?

> Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)
>
> May 10 19:24:16 [ 685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:24:22 [ 691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:28:02 [ 911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
>
> The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
> dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
> got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
> with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now"
> problem).

It could be that some of the queues get stuck. Can you try with the
patches in
http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c113 ? They
are based on 2.6.33.

Reinette




2010-05-06 09:14:38

by Christian Borntraeger

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

Am Montag 03 Mai 2010 21:22:19 schrieb John W. Linville:
> > /* Sanity-check TPT calculations */
> > BUG_ON(window->average_tpt != ((window->success_ratio *
> > tbl->expected_tpt[index] + 64) / 128));
>
> Interestingly enough, we have been discussing this line of code today. Could you try the patch here?
>
> http://marc.info/?l=linux-wireless&m=127290931304496&w=2

I also see a hard lockup some time after connection to my companies
wireless network. My private network does not seem to trigger that bug.
Unfortunately the kernel is not able to switch back graphics, so I
cannot tell if I see the same BUG - even if the problem description is
the same.
For reference, the patch above does not help on my T61p.

It started soon after 2.6.34-rc4. Before and with rc4 I had to
apply this http://patchwork.ozlabs.org/patch/49850/mbox/ patch to avoid
the other crash. With only this patch on top of rc4 everything seemed to
work fine, so the lockup seems to be triggered by one of the other patches.
Sometimes it takes some minutes to crash, which makes it hard to bisect
the problem.

Christian

2010-05-10 18:36:51

by Nils Radtke

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

Hi John,

Today weather was fine again, finally. So testing with .33.3 w/ the patch applied:

http://marc.info/?l=linux-wireless&m=127290931304496&w=2

The kernel kernel .32 was still running before it crashed immediately on wireless activation.
The crash log showed again at least two messages, the last was as already described in my first
message, bug from 2010-04-30: I think even the 0x2030 was the same:

EIP rs_tx_status +x8f/x2030




W/ .33.3 and the above patch applied:

Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux

May 10 19:14:11 [ 80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:17 [ 626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:30 [ 638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:32 [ 641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:54 [ 663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:58 [ 666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:24:02 [ 671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now


Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)

May 10 19:24:16 [ 685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:24:22 [ 691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:28:02 [ 911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0

The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now"
problem).



Another type occurred: (probably .33.2 or something)

May 3 17:58:11 [ 3946.608743] iwlagn 0000:03:00.0: request scan called when driver not ready.
May 3 17:58:12 [ 3948.082684] iwlagn 0000:03:00.0: TX Power requested while scanning!
May 3 18:01:00 [ 4115.282852] iwlagn 0000:03:00.0: RF_KILL bit toggled to disable radio.

Is this "TX Power requested while scanning" because of RF_KILL set to off?



Thank you,

Nils


2010-05-11 17:21:34

by Reinette Chatre

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

On Tue, 2010-05-11 at 08:50 -0700, Christian Borntraeger wrote:
> Am Donnerstag 06 Mai 2010 18:28:48 schrieb reinette chatre:
> > Below seven iwlwifi patches were added after rc4. If you are unable to
> > bisect ... perhaps you can run a while by reverting more and more from
> > this list?
> >
> > f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
> > 88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
> > 8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
> > ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
> > de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
> > 04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
> > dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings
>
> Just to give you some feedback. Sometimes it takes a while until it crashes.
> My hand made bisect is currently at two remaining patches:
>
> de0f60ea94e132c858caa64a44b2012e1e8580b0
> 8b9fce77737ae9983f61ec56cd53f52fb738b2c7
>
> reverting both solves my hard lockup. I will try to isolate the "bad" patch
> but this takes some more days since I wont be in the "hazardous environment"
> this week.

Thank you for digging into this. It will be very helpful it you can get
us a trace of the crash - any chance that netconsole may work?

Reinette



2010-05-11 15:50:19

by Christian Borntraeger

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

Am Donnerstag 06 Mai 2010 18:28:48 schrieb reinette chatre:
> Below seven iwlwifi patches were added after rc4. If you are unable to
> bisect ... perhaps you can run a while by reverting more and more from
> this list?
>
> f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
> 88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
> 8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
> ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
> de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
> 04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
> dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings

Just to give you some feedback. Sometimes it takes a while until it crashes.
My hand made bisect is currently at two remaining patches:

de0f60ea94e132c858caa64a44b2012e1e8580b0
8b9fce77737ae9983f61ec56cd53f52fb738b2c7

reverting both solves my hard lockup. I will try to isolate the "bad" patch
but this takes some more days since I wont be in the "hazardous environment"
this week.

Christian

2010-05-06 16:28:49

by Reinette Chatre

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

Hi Christian,

On Thu, 2010-05-06 at 02:14 -0700, Christian Borntraeger wrote:
> Am Montag 03 Mai 2010 21:22:19 schrieb John W. Linville:
> > > /* Sanity-check TPT calculations */
> > > BUG_ON(window->average_tpt != ((window->success_ratio *
> > > tbl->expected_tpt[index] + 64) / 128));
> >
> > Interestingly enough, we have been discussing this line of code today. Could you try the patch here?
> >
> > http://marc.info/?l=linux-wireless&m=127290931304496&w=2
>
> I also see a hard lockup some time after connection to my companies
> wireless network. My private network does not seem to trigger that bug.
> Unfortunately the kernel is not able to switch back graphics, so I
> cannot tell if I see the same BUG - even if the problem description is
> the same.

It will be hard to debug this without some logs. Can you perhaps run
with netconsole for a while? Is it possible to trigger this when not in
X to be able to get some information about where issue is?

> For reference, the patch above does not help on my T61p.
>
> It started soon after 2.6.34-rc4. Before and with rc4 I had to
> apply this http://patchwork.ozlabs.org/patch/49850/mbox/ patch to avoid
> the other crash. With only this patch on top of rc4 everything seemed to
> work fine, so the lockup seems to be triggered by one of the other patches.
> Sometimes it takes some minutes to crash, which makes it hard to bisect
> the problem.

Below seven iwlwifi patches were added after rc4. If you are unable to
bisect ... perhaps you can run a while by reverting more and more from
this list?

f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings

Reinette



2010-05-12 15:18:44

by Christian Borntraeger

[permalink] [raw]
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

Am Dienstag 11 Mai 2010 19:21:33 schrieb reinette chatre:
> > de0f60ea94e132c858caa64a44b2012e1e8580b0
> > 8b9fce77737ae9983f61ec56cd53f52fb738b2c7
> >
> > reverting both solves my hard lockup. I will try to isolate the "bad" patch
> > but this takes some more days since I wont be in the "hazardous environment"
> > this week.

Drat! Now I got a lockup with these two patches reverted. Dont know if that was
the same bug.

> Thank you for digging into this. It will be very helpful it you can get
> us a trace of the crash - any chance that netconsole may work?

No success so far.