2011-05-09 22:52:15

by Rafał Miłecki

[permalink] [raw]
Subject: Regression affecting b43 LP-PHY card

Juan owns Lenovo affected by well-known LP-PHY DMA errors. His testing
procedure is following:
modprobe wl; connect; download sth small; rmmod wl;
modprobe b43; download 2GB

When working on DMA errors we discovered that wireless-testing is not
working well for him. Even after performing above procedure his
machine disconnects quickly and he is not able to reconnect. We tested
2.6.39-rc6 from tarball and it was working fine. I'd like to highlight
here, that we were switching between mainline and wireless-testing few
times. It is not a random issue.

I suspected this regression could be caused by my recent ssb patches.
So I reverted all of them but this didn't help.

In this situation we decided to bisect. I was a little afraid of last
merges so we took older 2.6.38 as GOOD (we tested this twice) and
wireless-testing commit before my ssb changes as BAD. Today Juan
finished bisecting kernel:
http://pastebin.com/HSKbRzpB

According to his bisection the first bad commit is
e06383db9ec591696a06654257474b85bac1f8cb [0]:
hrtimers: extend hrtimer base code to handle more then 2 clockids

Does it make any sense to you? Could this be some timing issue?

It was too late to test this today, we (Juan) will work on this
tomorrow. It's impossible to revert this commit from HEAD of
wireless-testing, so my idea is to checkout commit, test, revert,
test.

Did anyone else experience any similar problems with latest wireless-testing?


http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e06383db9ec591696a06654257474b85bac1f8cb

--
Rafał


2011-05-10 21:48:54

by Rafał Miłecki

[permalink] [raw]
Subject: Re: [hrtimers] Re: Regression affecting b43 LP-PHY card

2011/5/10 Rafał Miłecki <[email protected]>:
> 2011/5/10 John Stultz <[email protected]>:
>> On Tue, 2011-05-10 at 20:57 +0200, Rafał Miłecki wrote:
>>> W dniu 10 maja 2011 00:52 użytkownik Rafał Miłecki <[email protected]> napisał:
>>> > In this situation we decided to bisect. I was a little afraid of last
>>> > merges so we took older 2.6.38 as GOOD (we tested this twice) and
>>> > wireless-testing commit before my ssb changes as BAD. Today Juan
>>> > finished bisecting kernel:
>>> > http://pastebin.com/HSKbRzpB
>>> >
>>> > According to his bisection the first bad commit is
>>> > e06383db9ec591696a06654257474b85bac1f8cb [0]:
>>> > hrtimers: extend hrtimer base code to handle more then 2 clockids
>>> >
>>> > Does it make any sense to you? Could this be some timing issue?
>>> >
>>> > It was too late to test this today, we (Juan) will work on this
>>> > tomorrow. It's impossible to revert this commit from HEAD of
>>> > wireless-testing, so my idea is to checkout commit, test, revert,
>>> > test.
>>> >
>>> > Did anyone else experience any similar problems with latest wireless-testing?
>>> >
>>> >
>>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e06383db9ec591696a06654257474b85bac1f8cb
>>>
>>> Today Juan checkouted commit e06383db9ec591696a06654257474b85bac1f8cb
>>> and tested it. He was disconnected really soon.
>>>
>>> Then he reverted e06383db9ec591696a06654257474b85bac1f8cb and tested
>>> again. Connection was stable, he downloaded 2GB file over network.
>>>
>>>
>>> John S.: your commit does not touch Broadcom card directly, but it
>>> seems it somehow affects it. I suspect there can be some timing issue.
>>> Do you have any idea what could it be, how can we debug this?
>>
>> Sorry for the trouble!
>>
>> My commit exposed a few spots where hrtimers were being initialized
>> before hrtimer_init is called, which caused problems. Thomas provided a
>> solution that makes such behavior still function ok:
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4
>>
>> Let me know if the issue is still reproducible with Linus' latest git
>> tree.
>
> We were using wireless-testing git tree, commit:
> 1e664a777e5eb4b23e65e76fbeadd2376fe8d8d8
>
> I can not see ce31332d3c77532d6ea97ddcb475a2b02dd358b4 it git log.
> I'll apply and test, thanks!

I can confirm updated wireless-testing resolves this issue!

Too bad Juan spent 2-3 days on bisecting on his Atom... but at least
we came and met ready fix. It could be pain to find a solution with
out feedback only. I'm aware it was not very detailed.

Thanks for your help :)

--
Rafał

2011-05-10 18:31:16

by Larry Finger

[permalink] [raw]
Subject: Re: Regression affecting b43 LP-PHY card

On 05/09/2011 05:52 PM, Rafał Miłecki wrote:
> Juan owns Lenovo affected by well-known LP-PHY DMA errors. His testing
> procedure is following:
> modprobe wl; connect; download sth small; rmmod wl;
> modprobe b43; download 2GB
>
> When working on DMA errors we discovered that wireless-testing is not
> working well for him. Even after performing above procedure his
> machine disconnects quickly and he is not able to reconnect. We tested
> 2.6.39-rc6 from tarball and it was working fine. I'd like to highlight
> here, that we were switching between mainline and wireless-testing few
> times. It is not a random issue.
>
> I suspected this regression could be caused by my recent ssb patches.
> So I reverted all of them but this didn't help.
>
> In this situation we decided to bisect. I was a little afraid of last
> merges so we took older 2.6.38 as GOOD (we tested this twice) and
> wireless-testing commit before my ssb changes as BAD. Today Juan
> finished bisecting kernel:
> http://pastebin.com/HSKbRzpB
>
> According to his bisection the first bad commit is
> e06383db9ec591696a06654257474b85bac1f8cb [0]:
> hrtimers: extend hrtimer base code to handle more then 2 clockids
>
> Does it make any sense to you? Could this be some timing issue?
>
> It was too late to test this today, we (Juan) will work on this
> tomorrow. It's impossible to revert this commit from HEAD of
> wireless-testing, so my idea is to checkout commit, test, revert,
> test.
>
> Did anyone else experience any similar problems with latest wireless-testing?

I did some testing over the weekend using the LP-PHY device in my HP Mini 110
netbook. This one does not have any DMA issues, but b43 generates PHY
transmission errors and dies when I try to copy a file over my LAN. The source
material is contained on an NFS-mounted volume. The machine that exports the
volume is connected by wire to the router/switch. When I get a file from the
Internet, there are no problems. In the latter case, the transfer rate of the
download is up to 1.2 MB/s. I don't know what the peak rate is for the NFS copy
operation.

I was able to test kernels from the wireless-testing tree back to v2.6.36. All
behaved the same, thus my problem is not a regression.

Larry

2011-05-09 22:58:11

by Ben Greear

[permalink] [raw]
Subject: Re: Regression affecting b43 LP-PHY card

On 05/09/2011 03:52 PM, Rafał Miłecki wrote:

> Did anyone else experience any similar problems with latest wireless-testing?

With the ath5k patch I posted, and the ath9k patches that Felix
posted recently in response to my bug reports, I've had good
luck with ath9k and ath5k.

I also pulled in the slub cmpxcg fix that fixes fatal bugs
for compiles for something earlier than Pentium-II processors.

Hopefully -rc7 will be out shortly (which will contain the slub
fix, and if we're lucky..the ath9k and ath5k fixes too).

I don't test with any other wifi nics..just ath5k and ath9k.

I've mostly been testing virtual stations...will crank up some
ath9k APs shortly...

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2011-05-10 19:45:37

by Rafał Miłecki

[permalink] [raw]
Subject: Re: [hrtimers] Re: Regression affecting b43 LP-PHY card

2011/5/10 John Stultz <[email protected]>:
> On Tue, 2011-05-10 at 20:57 +0200, Rafał Miłecki wrote:
>> W dniu 10 maja 2011 00:52 użytkownik Rafał Miłecki <[email protected]> napisał:
>> > In this situation we decided to bisect. I was a little afraid of last
>> > merges so we took older 2.6.38 as GOOD (we tested this twice) and
>> > wireless-testing commit before my ssb changes as BAD. Today Juan
>> > finished bisecting kernel:
>> > http://pastebin.com/HSKbRzpB
>> >
>> > According to his bisection the first bad commit is
>> > e06383db9ec591696a06654257474b85bac1f8cb [0]:
>> > hrtimers: extend hrtimer base code to handle more then 2 clockids
>> >
>> > Does it make any sense to you? Could this be some timing issue?
>> >
>> > It was too late to test this today, we (Juan) will work on this
>> > tomorrow. It's impossible to revert this commit from HEAD of
>> > wireless-testing, so my idea is to checkout commit, test, revert,
>> > test.
>> >
>> > Did anyone else experience any similar problems with latest wireless-testing?
>> >
>> >
>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e06383db9ec591696a06654257474b85bac1f8cb
>>
>> Today Juan checkouted commit e06383db9ec591696a06654257474b85bac1f8cb
>> and tested it. He was disconnected really soon.
>>
>> Then he reverted e06383db9ec591696a06654257474b85bac1f8cb and tested
>> again. Connection was stable, he downloaded 2GB file over network.
>>
>>
>> John S.: your commit does not touch Broadcom card directly, but it
>> seems it somehow affects it. I suspect there can be some timing issue.
>> Do you have any idea what could it be, how can we debug this?
>
> Sorry for the trouble!
>
> My commit exposed a few spots where hrtimers were being initialized
> before hrtimer_init is called, which caused problems. Thomas provided a
> solution that makes such behavior still function ok:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4
>
> Let me know if the issue is still reproducible with Linus' latest git
> tree.

We were using wireless-testing git tree, commit:
1e664a777e5eb4b23e65e76fbeadd2376fe8d8d8

I can not see ce31332d3c77532d6ea97ddcb475a2b02dd358b4 it git log.
I'll apply and test, thanks!

--
Rafał

2011-05-10 18:57:29

by Rafał Miłecki

[permalink] [raw]
Subject: [hrtimers] Re: Regression affecting b43 LP-PHY card

W dniu 10 maja 2011 00:52 użytkownik Rafał Miłecki <[email protected]> napisał:
> Juan owns Lenovo affected by well-known LP-PHY DMA errors. His testing
> procedure is following:
> modprobe wl; connect; download sth small; rmmod wl;
> modprobe b43; download 2GB
>
> When working on DMA errors we discovered that wireless-testing is not
> working well for him. Even after performing above procedure his
> machine disconnects quickly and he is not able to reconnect. We tested
> 2.6.39-rc6 from tarball and it was working fine. I'd like to highlight
> here, that we were switching between mainline and wireless-testing few
> times. It is not a random issue.
>
> I suspected this regression could be caused by my recent ssb patches.
> So I reverted all of them but this didn't help.
>
> In this situation we decided to bisect. I was a little afraid of last
> merges so we took older 2.6.38 as GOOD (we tested this twice) and
> wireless-testing commit before my ssb changes as BAD. Today Juan
> finished bisecting kernel:
> http://pastebin.com/HSKbRzpB
>
> According to his bisection the first bad commit is
> e06383db9ec591696a06654257474b85bac1f8cb [0]:
> hrtimers: extend hrtimer base code to handle more then 2 clockids
>
> Does it make any sense to you? Could this be some timing issue?
>
> It was too late to test this today, we (Juan) will work on this
> tomorrow. It's impossible to revert this commit from HEAD of
> wireless-testing, so my idea is to checkout commit, test, revert,
> test.
>
> Did anyone else experience any similar problems with latest wireless-testing?
>
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e06383db9ec591696a06654257474b85bac1f8cb

Today Juan checkouted commit e06383db9ec591696a06654257474b85bac1f8cb
and tested it. He was disconnected really soon.

Then he reverted e06383db9ec591696a06654257474b85bac1f8cb and tested
again. Connection was stable, he downloaded 2GB file over network.


John S.: your commit does not touch Broadcom card directly, but it
seems it somehow affects it. I suspect there can be some timing issue.
Do you have any idea what could it be, how can we debug this?

--
Rafał

2011-05-10 19:07:11

by John Stultz

[permalink] [raw]
Subject: Re: [hrtimers] Re: Regression affecting b43 LP-PHY card

On Tue, 2011-05-10 at 20:57 +0200, Rafał Miłecki wrote:
> W dniu 10 maja 2011 00:52 użytkownik Rafał Miłecki <[email protected]> napisał:
> > In this situation we decided to bisect. I was a little afraid of last
> > merges so we took older 2.6.38 as GOOD (we tested this twice) and
> > wireless-testing commit before my ssb changes as BAD. Today Juan
> > finished bisecting kernel:
> > http://pastebin.com/HSKbRzpB
> >
> > According to his bisection the first bad commit is
> > e06383db9ec591696a06654257474b85bac1f8cb [0]:
> > hrtimers: extend hrtimer base code to handle more then 2 clockids
> >
> > Does it make any sense to you? Could this be some timing issue?
> >
> > It was too late to test this today, we (Juan) will work on this
> > tomorrow. It's impossible to revert this commit from HEAD of
> > wireless-testing, so my idea is to checkout commit, test, revert,
> > test.
> >
> > Did anyone else experience any similar problems with latest wireless-testing?
> >
> >
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e06383db9ec591696a06654257474b85bac1f8cb
>
> Today Juan checkouted commit e06383db9ec591696a06654257474b85bac1f8cb
> and tested it. He was disconnected really soon.
>
> Then he reverted e06383db9ec591696a06654257474b85bac1f8cb and tested
> again. Connection was stable, he downloaded 2GB file over network.
>
>
> John S.: your commit does not touch Broadcom card directly, but it
> seems it somehow affects it. I suspect there can be some timing issue.
> Do you have any idea what could it be, how can we debug this?

Sorry for the trouble!

My commit exposed a few spots where hrtimers were being initialized
before hrtimer_init is called, which caused problems. Thomas provided a
solution that makes such behavior still function ok:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4

Let me know if the issue is still reproducible with Linus' latest git
tree.

thanks
-john