2009-07-13 08:35:53

by Chris Clayton

[permalink] [raw]
Subject: 2.6.31-rc2: Possible regression in rt61pci driver

Hi,

Please cc me on any reply because I'm not subscribed.

I've been testing 2.6.31 development kernels on my laptop and find
that I can induce a complete lock-up more or less at will. To do so,
all I have to do is generate some network traffic on my wireless LAN
(I've been using wget to transfer a file from another box on my LAN)
and then wait. If I run netstat repeatedly while waiting, I see a TCP
connection to port 21 on another box on my LAN in a TIME_WAIT state.
It seems that when that connection disappears, the laptop locks up
hard and I can only recover by powering off and on again. I think the
problem is related to the rt61pci driver because I haven't been able
to induce the lock-up when using a wireless card that's supported by
the ath5k driver. I started bisecting, but a couple of times I arrived
at points where although the kernel builds OK, I have no network
connectivity. I guessed at good, but the bisection process finished at
a change that can't be the culprit (because it's for a different
architecture).

I attach the best diagnostics I can think of at this point in time
(but am more than happy to provide any others that are requested). It
includes the output from dmesg from a boot that locked up and the
syslog journal from that boot; a description of the wireless card from
lspci -v and the output from netstat that shows the connection I think
is involved. As I say, feel free to ask for any other diagnostics that
will help track the problem down.

I have confirmed that the problem is still present in a kernel built
after a 'git pull' this morning, although it was somewhere around the
time that -rc2 was released that I first came across it. I cannot
induce the problem with 2.6.30.1.

Thanks

Chris


--
No, Sir; there is nothing which has yet been contrived by man, by
which so much happiness is produced as by a good tavern or inn -
Doctor Samuel Johnson


Attachments:
dmesg.txt (29.63 kB)
syslog (85.67 kB)
lspci.txt (345.00 B)
netstat.txt (1.51 kB)
Download all attachments

2009-07-14 11:05:02

by Chris Clayton

[permalink] [raw]
Subject: Re: 2.6.31-rc2: Possible regression in rt61pci driver

Hi again,


2009/7/13 Chris Clayton <[email protected]>:
> Hi,
>
> Please cc me on any reply because I'm not subscribed.
>
> I've been testing 2.6.31 development kernels on my laptop and find
> that I can induce a complete lock-up more or less at will. To do so,
> all I have to do is generate some network traffic on my wireless LAN
> (I've been using wget to transfer a file from another box on my LAN)
> and then wait. If I run netstat repeatedly while waiting, I see a TCP
> connection to port 21 on another box on my LAN in a TIME_WAIT state.
> It seems that when that connection disappears, the laptop locks up
> hard and I can only recover by powering off and on again. I think the
> problem is related to the rt61pci driver because I haven't been able
> to induce the lock-up when using a wireless card that's supported by
> the ath5k driver. I started bisecting, but a couple of times I arrived
> at points where although the kernel builds OK, I have no network
> connectivity. I guessed at good, but the bisection process finished at
> a change that can't be the culprit (because it's for a different
> architecture).
>
> I attach the best diagnostics I can think of at this point in time
> (but am more than happy to provide any others that are requested). It
> includes the output from dmesg from a boot that locked up and the
> syslog journal from that boot; a description of the wireless card from
> lspci -v and the output from netstat that shows the connection I think
> is involved. As I say, feel free to ask for any other diagnostics that
> will help track the problem down.
>
> I have confirmed that the problem is still present in a kernel built
> after a 'git pull' this morning, although it was somewhere around the
> time that -rc2 was released that I first came across it. I cannot
> induce the problem with 2.6.30.1.

I've updated to 2.6.31-rc3 this morning and done some more testing.
I'm now convinced that the rt61pci driver is somehow involved in
locking up the laptop. With the (Belkin) rt61 card inserted, the
machine will lock up even if I am doing nothing (no web browsing,
email or anything else at all) except running this script in a console
window:

i=0
while true; do
let i++
echo -n "$i "
sleep 1
done

In the tests I have done so far, the counter has never gone beyond 240
before the machine locked. With the (no-name) ath5k card inserted I
can use the laptop for normal web browsing, email, etc with no
problems - the counter in the script above gets to over 2000.

As I said yesterday, I'm happy to provide additional diagnostics,
apply patches, etc.

Thanks

Chris
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson

2009-07-21 11:39:26

by Chris Clayton

[permalink] [raw]
Subject: Re: 2.6.31-rc2: Possible regression in rt61pci driver

2009/7/14 Chris Clayton <[email protected]>:
<snip>

> I've updated to 2.6.31-rc3 this morning and done some more testing.
> I'm now convinced that the rt61pci driver is somehow involved in
> locking up the laptop. With the (Belkin) rt61 card inserted, the
> machine will lock up even if I am doing nothing (no web browsing,
> email or anything else at all) except running this script in a console
> window:
>
> i=0
> while true; do
> ? ? ? ?let i++
> ? ? ? ?echo -n "$i "
> ? ? ? ?sleep 1
> done
>
> In the tests I have done so far, the counter has never gone beyond 240
> before the machine locked. With the (no-name) ath5k card inserted I
> can use the laptop for normal web browsing, email, etc with no
> problems - the counter in the script above gets to over 2000.
>

The freeze still happens with 2.6.31-rc3-git5, but I've been doing
some more fact-finding.

Running the script shown above and with the rt61-based card inserted,
I can freeze the laptop even if I am doing nothing else on the laptop.
When the freeze occurs, the laptop is effectively dead, no response to
mouse movement or keyboard input and no response to pings from
another machine on my network. However, if I eject the card, the
laptop comes to life again. The key presses from when the laptop was
frozen appear on screen and pings from another machine are responded
to. The script continues to run and display the counter. I then
reinsert the card and everything appears OK until the laptop freezes
again a minute or two later. During a test run this morning the
machine froze at (from the output of the script) 80, 235, 369, 538 and
672. Each time, ejecting the card brought the machine back to life.

Trying the same test with the ath5k-based card inserted resulted in
the script getting to 2300 without the laptop freezing, at which point
I stopped the script.

I started trying to isolate the change that causes the problem by
reverting changes to the files in drivers/net/wireless/rt2x00. The
change "rt2x00: Remove last usage of beacon_int from
ieee80211_config" reverted cleanly and the kernel built OK, but I
still got the freeze. "rt2x00: Remove usage of
IEEE80211_CONF_CHANGE_BEACON_INTERVAL" also reverted cleanly but the
kernel doesn't build because of dependencies on changes to mac80211.
I'm afraid I am out of my depth now, so I will have to abandon that
line of enquiry.

I hope this new information helps track the problem down. I've
attached the output from dmesg that shows the messages emitted when I
eject the card, plus my config

Thanks,

Chris
--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson


Attachments:
rt61freeze.log (19.20 kB)
.config (49.24 kB)
Download all attachments

2009-07-26 19:15:22

by Chris Clayton

[permalink] [raw]
Subject: Re: 2.6.31-rc2: Possible regression in rt61pci driver

2009/7/21 Chris Clayton <[email protected]>:
> 2009/7/14 Chris Clayton <[email protected]>:
> <snip>
>
>> I've updated to 2.6.31-rc3 this morning and done some more testing.
>> I'm now convinced that the rt61pci driver is somehow involved in
>> locking up the laptop. With the (Belkin) rt61 card inserted, the
>> machine will lock up even if I am doing nothing (no web browsing,
>> email or anything else at all) except running this script in a console
>> window:
>>
>> i=0
>> while true; do
>> ? ? ? ?let i++
>> ? ? ? ?echo -n "$i "
>> ? ? ? ?sleep 1
>> done
>>
>> In the tests I have done so far, the counter has never gone beyond 240
>> before the machine locked. With the (no-name) ath5k card inserted I
>> can use the laptop for normal web browsing, email, etc with no
>> problems - the counter in the script above gets to over 2000.
>>
>
> The freeze still happens with 2.6.31-rc3-git5, but I've been doing
> some more fact-finding.
>
> Running the script shown above and with the rt61-based card inserted,
> I can freeze the laptop even if I am doing nothing else on the laptop.
> When the freeze occurs, the laptop is effectively dead, no response to
> mouse movement or keyboard input and no response to pings from
> another machine on my network. However, if I eject the card, the
> laptop comes to life again. The key presses from when the laptop was
> frozen appear on screen and pings from another machine are responded
> to. The script continues to run and display the counter. I then
> reinsert the card and everything appears OK until the laptop freezes
> again a minute or two later. During a test run this morning the
> machine froze at (from the output of the script) 80, 235, 369, 538 and
> 672. Each time, ejecting the card brought the machine back to life.
>
> Trying the same test with the ath5k-based card inserted resulted in
> the script getting to 2300 without the laptop freezing, at which point
> I stopped the script.
>

One more data point. I wondered whether the freeze would "time out" if
I just left the laptop frozen, but my testing shows that it probably
does not (or if it does it takes more than 27 minutes to do so.

I've also tried to bisect again, but, as last time, once I got to the
batch of network-related changes that went into -rc1, I get a series
of kernels that build but either won't boot or have inoperable
wireless networking.

Finally, since I haven't had a single reply to the regression report I
posted almost two weeks ago, I now give up. I'll switch to using the
card supported by the ath5k driver.

<snip>

Chris
--
No, Sir; there is nothing which has yet been contrived by man, by
which so much happiness is produced as by a good tavern or inn -
Doctor Samuel Johnson

2009-07-26 20:10:18

by Pavel Roskin

[permalink] [raw]
Subject: Re: 2.6.31-rc2: Possible regression in rt61pci driver

On Sun, 2009-07-26 at 20:15 +0100, Chris Clayton wrote:

> One more data point. I wondered whether the freeze would "time out" if
> I just left the laptop frozen, but my testing shows that it probably
> does not (or if it does it takes more than 27 minutes to do so.

I suggest that you run it on the text console after "dmesg -n 8", so
that all kernel messages are seen.

I'm using rt61pci with wireless-testing, and I don't see any freezes.

> I've also tried to bisect again, but, as last time, once I got to the
> batch of network-related changes that went into -rc1, I get a series
> of kernels that build but either won't boot or have inoperable
> wireless networking.

You can use "git bisect skip" to skip those revisions.

You can specify the paths in "git bisect start" so that only changes to
the interesting places (like drivers/net wireless, net/mac80211 and
net/wireless) are considered when calculating the next commit. This
will probably help you avoid the bad place.

Also, please try the current wireless-testing. The problem may be fixed
there.

> Finally, since I haven't had a single reply to the regression report I
> posted almost two weeks ago, I now give up. I'll switch to using the
> card supported by the ath5k driver.

Perhaps there was not enough material for others to comment on, and
nobody was experiencing anything similar.

If I had such problem, I would try to bisect it, but I cannot reproduce
it.

--
Regards,
Pavel Roskin

2009-07-26 21:33:12

by Chris Clayton

[permalink] [raw]
Subject: Re: 2.6.31-rc2: Possible regression in rt61pci driver

Thanks for the reply, Pavel.

2009/7/26 Pavel Roskin <[email protected]>:
> On Sun, 2009-07-26 at 20:15 +0100, Chris Clayton wrote:
>
>> One more data point. I wondered whether the freeze would "time out" if
>> I just left the laptop frozen, but my testing shows that it probably
>> does not (or if it does it takes more than 27 minutes to do so.
>
> I suggest that you run it on the text console after "dmesg -n 8", so
> that all kernel messages are seen.
>
> I'm using rt61pci with wireless-testing, and I don't see any freezes.

Do you have CONFIG_MAC80211_DEFAULT_PS enabled? I have just built and
installed -rc4 with this option disabled and it has survived almost 20
minutes so far without a freeze. No previous kernel in the 2.6.31
series has survived more than 5 minutes without freezing, so this
looks promising.
>
>> I've also tried to bisect again, but, as last time, once I got to the
>> batch of network-related changes that went into -rc1, I get a series
>> of kernels that build but either won't boot or have inoperable
>> wireless networking.
>
> You can use "git bisect skip" to skip those revisions.
>
> You can specify the paths in "git bisect start" so that only changes to
> the interesting places (like drivers/net wireless, net/mac80211 and
> net/wireless) are considered when calculating the next commit. ?This
> will probably help you avoid the bad place.
>

Thanks for those tips. I'll note them in my "useful stuff I might
forget" notebook and then try to find time over the next few weeks to
get to grips with the power of git.

Chris


--
No, Sir; there is nothing which has yet been contrived by man, by which
so much happiness is produced as by a good tavern or inn - Doctor Samuel
Johnson