I have modified my code that is using a 9170. I am really concerned about
roaming and so am testing that pretty hard. Yesterday I had a loop that
forced a DISCONNECT followed by a REASSOCIATE every 30 seconds. After
between 1:30 and 1:40 it failed by no longer receiving scan results. When I
looked into a log, the very last scan results that I received had a reduced
number of BSSs, down from 10-12 per scan to 4, then the next scan was zero.
It never recovered. All scans always failed to return any results from then
on and, of course, the re-associate failed. This 'feels' to me like a memory
leak somewhere, either in the firmware or the driver. I am running the
2.6.31 kernel/driver and the dual file firmware and version 0.6.10 of the
supplicant. At the moment I am running another test where it roams every 60
seconds rather than 30 seconds to see what kind of difference that makes. I
know that my kernel is old, but for now I don't have a choice. Does anyone
have any experience like this or insight into this new problem? This is an
embedded device that doesn't have the memory of a PC. Is there some way that
I could instrument something to check this?
Thank you,
Chuck
Well, (as usual) I was wrong. It isn't a memory problem. It seems that after
some indeterminant time, the USB interface locks up. When we try to take it
down (ifconfig wlan0 down) we get a message about outstanding urbs. By
powering down the 9170 we can re-set the device and get it to re-associate
and resume work. So, the problem is a USB problem. The question is if it is
a module problem or a system problem. We are typically seeing this after
50-200 reassociations. If we don't reassociate, it doesn't seem to occur.
Does anyone else have experience or insight into this?
Chuck
----- Original Message -----
From: "Luis R. Rodriguez" <[email protected]>
To: "Chuck Crisler" <[email protected]>
Cc: <[email protected]>
Sent: Monday, September 27, 2010 1:31 PM
Subject: Re: memory leak in scan with 9170?
> On Mon, Sep 27, 2010 at 10:16 AM, Chuck Crisler <[email protected]>
> wrote:
>> I have modified my code that is using a 9170. I am really concerned about
>> roaming and so am testing that pretty hard. Yesterday I had a loop that
>> forced a DISCONNECT followed by a REASSOCIATE every 30 seconds. After
>> between 1:30 and 1:40 it failed by no longer receiving scan results. When
>> I
>> looked into a log, the very last scan results that I received had a
>> reduced
>> number of BSSs, down from 10-12 per scan to 4, then the next scan was
>> zero.
>> It never recovered. All scans always failed to return any results from
>> then
>> on and, of course, the re-associate failed. This 'feels' to me like a
>> memory
>> leak somewhere, either in the firmware or the driver. I am running the
>> 2.6.31 kernel/driver and the dual file firmware and version 0.6.10 of the
>> supplicant.
>
> Both are ancient. Please try compat-wireless-2.6.36-rc3-1, I will soon
> make a new release with some stable fixes applied which are not yet in
> Linus' tree which I think will help a lot with your roaming testing. I
> should also note roaming was not possible until circa 2.6.33 when
> Jouni allowed for cfg80211 to authenticate to two APs at the same time
> and then move off to it to associate. Also although technically older
> userspace should work with newer kernels I have noted some issues with
> some really old supplicant on current kernels. I don't think there has
> been enough motivation to track down the exact issues though, but your
> best bet is to just upgrade the supplicant.
>
>> At the moment I am running another test where it roams every 60
>> seconds rather than 30 seconds to see what kind of difference that makes.
>> I
>> know that my kernel is old, but for now I don't have a choice. Does
>> anyone
>> have any experience like this or insight into this new problem? This is
>> an
>> embedded device that doesn't have the memory of a PC. Is there some way
>> that
>> I could instrument something to check this?
>
> I'm testing roaming by using wpa_cli roam <bss> in an ESS every 5
> seconds. To really stress test the hell out of this I force a roam
> every second too, its quite fun, it created a crash but I think we now
> know one of the main issues behind some warnings and Johannes has been
> brainstorming some solution. I don't suspect you'll hit these corner
> cases unless you roam every 2 seconds or so. The warnings are related
> to the fact that we assume the STA peer channel is the currently
> operating one when we TX a frame, and if we already associated to
> another station when moving from 2.4 GHz to 5 GHz we can potentially
> be trying to send a frame to a peer with no valid bitrate.
>
> You can use my script to test stuff as well:
>
> http://bombadil.infradead.org/~mcgrof/test-roam
>
> For example if you already know your ESS just replace the ESS variable
> with the set of BSSes for your ESS, they all most be on the same SSID
> though.
>
> Luis
>
On Mon, Sep 27, 2010 at 4:01 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Mon, Sep 27, 2010 at 3:40 PM, Chuck Crisler <[email protected]> wrote:
>> Well, (as usual) I was wrong. It isn't a memory problem. It seems that after
>> some indeterminant time, the USB interface locks up. When we try to take it
>> down (ifconfig wlan0 down) we get a message about outstanding urbs. By
>> powering down the 9170 we can re-set the device and get it to re-associate
>> and resume work. So, the problem is a USB problem. The question is if it is
>> a module problem or a system problem. We are typically seeing this after
>> 50-200 reassociations. If we don't reassociate, it doesn't seem to occur.
>> Does anyone else have experience or insight into this?
>
> Upgrade.
Let me clarify, 2.6.31 is not supported, its not listed on kernel.org
any more as a supported kernel. You are losing valuable fixes by not
moving away from it. If you don't have a plan to move, you need it, if
you have policies to lock you down to old kernels, try to change it.
Luis
On Mon, 2010-09-27 at 16:01 -0700, Luis R. Rodriguez wrote:
> On Mon, Sep 27, 2010 at 3:40 PM, Chuck Crisler <[email protected]> wrote:
> > Well, (as usual) I was wrong. It isn't a memory problem. It seems that after
> > some indeterminant time, the USB interface locks up. When we try to take it
> > down (ifconfig wlan0 down) we get a message about outstanding urbs. By
> > powering down the 9170 we can re-set the device and get it to re-associate
> > and resume work. So, the problem is a USB problem. The question is if it is
> > a module problem or a system problem. We are typically seeing this after
> > 50-200 reassociations. If we don't reassociate, it doesn't seem to occur.
> > Does anyone else have experience or insight into this?
>
> Upgrade.
Won't help. I've seen that issue as recently as 2.6.35 (I think) with
ar9170, and eventually figured I wouldn't bother and started using
carl9170.
johannes
On Mon, Sep 27, 2010 at 10:16 AM, Chuck Crisler <[email protected]> wrote:
> I have modified my code that is using a 9170. I am really concerned about
> roaming and so am testing that pretty hard. Yesterday I had a loop that
> forced a DISCONNECT followed by a REASSOCIATE every 30 seconds. After
> between 1:30 and 1:40 it failed by no longer receiving scan results. When I
> looked into a log, the very last scan results that I received had a reduced
> number of BSSs, down from 10-12 per scan to 4, then the next scan was zero.
> It never recovered. All scans always failed to return any results from then
> on and, of course, the re-associate failed. This 'feels' to me like a memory
> leak somewhere, either in the firmware or the driver. I am running the
> 2.6.31 kernel/driver and the dual file firmware and version 0.6.10 of the
> supplicant.
Both are ancient. Please try compat-wireless-2.6.36-rc3-1, I will soon
make a new release with some stable fixes applied which are not yet in
Linus' tree which I think will help a lot with your roaming testing. I
should also note roaming was not possible until circa 2.6.33 when
Jouni allowed for cfg80211 to authenticate to two APs at the same time
and then move off to it to associate. Also although technically older
userspace should work with newer kernels I have noted some issues with
some really old supplicant on current kernels. I don't think there has
been enough motivation to track down the exact issues though, but your
best bet is to just upgrade the supplicant.
> At the moment I am running another test where it roams every 60
> seconds rather than 30 seconds to see what kind of difference that makes. I
> know that my kernel is old, but for now I don't have a choice. Does anyone
> have any experience like this or insight into this new problem? This is an
> embedded device that doesn't have the memory of a PC. Is there some way that
> I could instrument something to check this?
I'm testing roaming by using wpa_cli roam <bss> in an ESS every 5
seconds. To really stress test the hell out of this I force a roam
every second too, its quite fun, it created a crash but I think we now
know one of the main issues behind some warnings and Johannes has been
brainstorming some solution. I don't suspect you'll hit these corner
cases unless you roam every 2 seconds or so. The warnings are related
to the fact that we assume the STA peer channel is the currently
operating one when we TX a frame, and if we already associated to
another station when moving from 2.4 GHz to 5 GHz we can potentially
be trying to send a frame to a peer with no valid bitrate.
You can use my script to test stuff as well:
http://bombadil.infradead.org/~mcgrof/test-roam
For example if you already know your ESS just replace the ESS variable
with the set of BSSes for your ESS, they all most be on the same SSID
though.
Luis
On Mon, Sep 27, 2010 at 3:40 PM, Chuck Crisler <[email protected]> wrote:
> Well, (as usual) I was wrong. It isn't a memory problem. It seems that after
> some indeterminant time, the USB interface locks up. When we try to take it
> down (ifconfig wlan0 down) we get a message about outstanding urbs. By
> powering down the 9170 we can re-set the device and get it to re-associate
> and resume work. So, the problem is a USB problem. The question is if it is
> a module problem or a system problem. We are typically seeing this after
> 50-200 reassociations. If we don't reassociate, it doesn't seem to occur.
> Does anyone else have experience or insight into this?
Upgrade.
Luis