2011-11-29 00:58:56

by Philipp Dreimann

[permalink] [raw]
Subject: rtlwifi, rtl8192se bug soft-lockup

Hello!

I since kernel v3.1, my system suffers from lock-ups because of the
rtl8192se driver. v3.0 is still fine.


[ 704.057088] Pid: 2112, comm: kworker/0:3 Not tainted
3.1.0-1-686-pae #1 ASUSTeK Computer INC. 1201T/1201T
[ 704.057120] EIP: 0060:[<c105cf7c>] EFLAGS: 00000297 CPU: 0
[ 704.057140] EIP is at do_raw_spin_lock+0x10/0x15
[ 704.057152] EAX: f4bbd188 EBX: f4bbd160 ECX: f4bbc4a8 EDX: 00009998
[ 704.057164] ESI: f4bbc320 EDI: 00000000 EBP: 00000100 ESP: f580dfc0
[ 704.057175] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 704.057190] Process kworker/0:3 (pid: 2112, ti=f580c000
task=f4b69660 task.ti=f49d8000)
[ 704.057200] Stack:
[ 704.057210] f84c5608 f4bbd308 f4bbd30c c103bac7 00000006 c13caa18
c103c05d c103c0f1
[ 704.057265] 00000001 0000000a 00000000 f49d9ebc f49d8000 c103c05d
00000046 c100ccb4
[ 704.057322] Call Trace:
[ 704.057352] [<f84c5608>] ? rtl_lps_leave+0xf/0xc4 [rtlwifi]
[ 704.057369] [<c103bac7>] ? tasklet_action+0x62/0xa5
[ 704.057383] [<c103c05d>] ? local_bh_enable+0x2/0x2
[ 704.057397] [<c103c0f1>] ? __do_softirq+0x94/0x12f
[ 704.057411] [<c103c05d>] ? local_bh_enable+0x2/0x2
[ 704.057420] <IRQ>
[ 704.057438] [<c103c2e2>] ? irq_exit+0x32/0x80
[ 704.057454] [<c100ca6e>] ? do_IRQ+0x65/0x76
[ 704.057468] [<c12b2a30>] ? common_interrupt+0x30/0x38
[ 704.057487] [<c1156283>] ? delay_tsc+0x1d/0x54
[ 704.057501] [<c115623b>] ? __delay+0x6/0x7
[ 704.057520] [<f86a4a15>] ?
rtl92s_phy_set_rf_power_state+0x458/0x531 [rtl8192se]
[ 704.057543] [<f84c4fd1>] ? rtl_ps_set_rf_state+0xbd/0xc2 [rtlwifi]
[ 704.057566] [<f84c5973>] ? rtl_swlps_rf_sleep+0x6f/0x154 [rtlwifi]
[ 704.057587] [<f84c5a7b>] ? rtl_swlps_wq_callback+0x23/0x78 [rtlwifi]
[ 704.057603] [<c1049633>] ? process_one_work+0x112/0x1fa
[ 704.057624] [<f84c5a58>] ? rtl_swlps_rf_sleep+0x154/0x154 [rtlwifi]
[ 704.057638] [<c104a33e>] ? worker_thread+0xa9/0x122
[ 704.057653] [<c104a295>] ? manage_workers.isra.23+0x13d/0x13d
[ 704.057668] [<c104ca80>] ? kthread+0x63/0x68
[ 704.057683] [<c104ca1d>] ? kthread_worker_fn+0x101/0x101
[ 704.057696] [<c12b2a3e>] ? kernel_thread_helper+0x6/0x10
[ 704.057706] Code: c3 3e ff 08 79 05 e8 0c 9d 0f 00 c3 3e 81 28 00
00 10 00 74 05 e8 e1 9c 0f 00 c3 ba 00 01 00 00 3e 66 0f c1 10 38 f2
74 06 f3 90 <8a> 10 eb f6 c3 89 c2 0f b7 02 38 e0 8d 88 00 01 00 00 75
05 3e

One of the reasons for this lockup is most likely the change from
spin_lock_irq* to spin_lock, see
67fc6052a49b781efbcfc138f3b68fe79ddd0c2f and earlier.

I will try the recently proposed patch for another problem from
Stanislaw, [PATCH v2] rtlwifi: fix lps_lock deadlock, to check if it
resolves my issue as well. But I think that Mike might be affected by
the change again...

BR,
Philipp


2011-11-29 02:16:45

by Larry Finger

[permalink] [raw]
Subject: Re: rtlwifi, rtl8192se bug soft-lockup

On 11/28/2011 06:58 PM, Philipp Dreimann wrote:
> Hello!
>
> I since kernel v3.1, my system suffers from lock-ups because of the
> rtl8192se driver. v3.0 is still fine.
>
>
> [ 704.057088] Pid: 2112, comm: kworker/0:3 Not tainted
> 3.1.0-1-686-pae #1 ASUSTeK Computer INC. 1201T/1201T
> [ 704.057120] EIP: 0060:[<c105cf7c>] EFLAGS: 00000297 CPU: 0
> [ 704.057140] EIP is at do_raw_spin_lock+0x10/0x15
> [ 704.057152] EAX: f4bbd188 EBX: f4bbd160 ECX: f4bbc4a8 EDX: 00009998
> [ 704.057164] ESI: f4bbc320 EDI: 00000000 EBP: 00000100 ESP: f580dfc0
> [ 704.057175] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [ 704.057190] Process kworker/0:3 (pid: 2112, ti=f580c000
> task=f4b69660 task.ti=f49d8000)
> [ 704.057200] Stack:
> [ 704.057210] f84c5608 f4bbd308 f4bbd30c c103bac7 00000006 c13caa18
> c103c05d c103c0f1
> [ 704.057265] 00000001 0000000a 00000000 f49d9ebc f49d8000 c103c05d
> 00000046 c100ccb4
> [ 704.057322] Call Trace:
> [ 704.057352] [<f84c5608>] ? rtl_lps_leave+0xf/0xc4 [rtlwifi]
> [ 704.057369] [<c103bac7>] ? tasklet_action+0x62/0xa5
> [ 704.057383] [<c103c05d>] ? local_bh_enable+0x2/0x2
> [ 704.057397] [<c103c0f1>] ? __do_softirq+0x94/0x12f
> [ 704.057411] [<c103c05d>] ? local_bh_enable+0x2/0x2
> [ 704.057420]<IRQ>
> [ 704.057438] [<c103c2e2>] ? irq_exit+0x32/0x80
> [ 704.057454] [<c100ca6e>] ? do_IRQ+0x65/0x76
> [ 704.057468] [<c12b2a30>] ? common_interrupt+0x30/0x38
> [ 704.057487] [<c1156283>] ? delay_tsc+0x1d/0x54
> [ 704.057501] [<c115623b>] ? __delay+0x6/0x7
> [ 704.057520] [<f86a4a15>] ?
> rtl92s_phy_set_rf_power_state+0x458/0x531 [rtl8192se]
> [ 704.057543] [<f84c4fd1>] ? rtl_ps_set_rf_state+0xbd/0xc2 [rtlwifi]
> [ 704.057566] [<f84c5973>] ? rtl_swlps_rf_sleep+0x6f/0x154 [rtlwifi]
> [ 704.057587] [<f84c5a7b>] ? rtl_swlps_wq_callback+0x23/0x78 [rtlwifi]
> [ 704.057603] [<c1049633>] ? process_one_work+0x112/0x1fa
> [ 704.057624] [<f84c5a58>] ? rtl_swlps_rf_sleep+0x154/0x154 [rtlwifi]
> [ 704.057638] [<c104a33e>] ? worker_thread+0xa9/0x122
> [ 704.057653] [<c104a295>] ? manage_workers.isra.23+0x13d/0x13d
> [ 704.057668] [<c104ca80>] ? kthread+0x63/0x68
> [ 704.057683] [<c104ca1d>] ? kthread_worker_fn+0x101/0x101
> [ 704.057696] [<c12b2a3e>] ? kernel_thread_helper+0x6/0x10
> [ 704.057706] Code: c3 3e ff 08 79 05 e8 0c 9d 0f 00 c3 3e 81 28 00
> 00 10 00 74 05 e8 e1 9c 0f 00 c3 ba 00 01 00 00 3e 66 0f c1 10 38 f2
> 74 06 f3 90<8a> 10 eb f6 c3 89 c2 0f b7 02 38 e0 8d 88 00 01 00 00 75
> 05 3e
>
> One of the reasons for this lockup is most likely the change from
> spin_lock_irq* to spin_lock, see
> 67fc6052a49b781efbcfc138f3b68fe79ddd0c2f and earlier.
>
> I will try the recently proposed patch for another problem from
> Stanislaw, [PATCH v2] rtlwifi: fix lps_lock deadlock, to check if it
> resolves my issue as well. But I think that Mike might be affected by
> the change again...

From a quick look, Stanislaw's patch should fix your system. If not, then
please consider pulling a git tree and checking out commit 34ddb20, which is the
one before 67fc6052.

Larry

2011-12-07 21:09:44

by Larry Finger

[permalink] [raw]
Subject: Re: rtlwifi, rtl8192se bug soft-lockup

On 12/07/2011 02:47 PM, Philipp Dreimann wrote:
> On 7 December 2011 15:23, Larry Finger<[email protected]> wrote:
>> On 12/07/2011 07:59 AM, Philipp Dreimann wrote:
>>>
>>> On 29 November 2011 00:16, Larry Finger<[email protected]> wrote:
>>>>
>>>> On 11/28/2011 06:58 PM, Philipp Dreimann wrote:
>>>> From a quick look, Stanislaw's patch should fix your system. If not,
>>>> then
>>>> please consider pulling a git tree and checking out commit 34ddb20, which
>>>> is
>>>> the one before 67fc6052.
>>>
>>>
>>> It fixed the issue *but* I am currently back to kernel v3.0.3, as it
>>> is the most stable for me. I am not sure whether new issues were
>>> introduced by using a v3.2-rc or if there is more wrong in the
>>> rtl8192se driver itself. I had random sound and standby issues at
>>> which I will have a look some other day.
>>
>>
>> The bug that affected 3.2-rcX and fixed by Stanislaw's patch was not
>> introduced until 3.1. A patch to fix it there was just queued by GregKH.
>
> I had Stanislaw's patch included.
>
>>> Another idea about the problem:
>>> I omitted for some reason the following line in the first email about
>>> the problem:
>>> [ 732.056049] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:2112]
>>
>>
>> That was a serious omission.
>
> Yes.
>
>>> While looking at the Call Trace and the code I have no idea why
>>> rtl92s_phy_set_rf_power_state needs that much time for the ERFSLEEP
>>> operation. I suspected an issue in the loop but did not find it so
>>> far.
>>
>>
>> With a modern CPU, no loop can take 22s unless it involves a spin lock that
>> never is released.
>
> Yes, and it should not, as the loop has the lock!
>
> Putting things together:
>
> - Stanislaw's patch prevents the occurrence of the issue with using
> the irq safe spin lock.
> This is kind of an an revert of
> 312d5479dcfaca2b8aa451201b5388fdb8c8684a (I did not check
> everything!).
>
> - The loop-issue is still around but won't be noticed unless the
> delayed execution of rtl_lps_leave() has side-effects..
>
> - As it took up to an hour to hit the issue, I suspect that there is
> something else going wrong which interferes with the loop...

I ran for almost 36 hours and never hit the issue. I have no idea what the
difference is between our two systems. In fact, I have never hit the "stalled"
CPU issue, even with the bug that Stanislaw fixed. You will need to do the testing.

Larry

2011-12-08 17:26:31

by Larry Finger

[permalink] [raw]
Subject: Re: rtlwifi, rtl8192se bug soft-lockup

On 12/08/2011 03:52 AM, Stanislaw Gruszka wrote:
> On Wed, Dec 07, 2011 at 06:47:58PM -0200, Philipp Dreimann wrote:
>> No, this was not posted so far. I will try to debug the loop issue
>> soonish. The outlined idea above only prevents the issue without
>> knowing what is happening.
>
> I looked at it a bit more and realized that we can replace spinlock
> by mutex. This should fix remaining problems here, and hopefully do
> not introduce any others. Could you test two attached patches, and
> if they do not crash intermediately :-) retest again with
> CONFIG_LOCKDEP ?

After about 1 hour of testing, I see no problems and no lockdep warnings.

BTW, patch #2 did not apply to wireless-testing as your previous patch changing
the locking in ps.c has already been applied. Fortunately, it was not much of a
problem to get it applied using wiggle.

Larry


2011-12-07 17:23:53

by Larry Finger

[permalink] [raw]
Subject: Re: rtlwifi, rtl8192se bug soft-lockup

On 12/07/2011 07:59 AM, Philipp Dreimann wrote:
> On 29 November 2011 00:16, Larry Finger<[email protected]> wrote:
>> On 11/28/2011 06:58 PM, Philipp Dreimann wrote:
>> From a quick look, Stanislaw's patch should fix your system. If not, then
>> please consider pulling a git tree and checking out commit 34ddb20, which is
>> the one before 67fc6052.
>
> It fixed the issue *but* I am currently back to kernel v3.0.3, as it
> is the most stable for me. I am not sure whether new issues were
> introduced by using a v3.2-rc or if there is more wrong in the
> rtl8192se driver itself. I had random sound and standby issues at
> which I will have a look some other day.

The bug that affected 3.2-rcX and fixed by Stanislaw's patch was not introduced
until 3.1. A patch to fix it there was just queued by GregKH.

> Another idea about the problem:
> I omitted for some reason the following line in the first email about
> the problem:
> [ 732.056049] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:2112]

That was a serious omission.

> While looking at the Call Trace and the code I have no idea why
> rtl92s_phy_set_rf_power_state needs that much time for the ERFSLEEP
> operation. I suspected an issue in the loop but did not find it so
> far.

With a modern CPU, no loop can take 22s unless it involves a spin lock that
never is released.

> Another solution which I tested was the following:
> 0. rtl_lps_leave function informs the rtl92s_phy_set_rf_power_state
> being in the ERFSLEEP-case-loop, that it needs the lock.
> 1. rtl92s_phy_set_rf_power_state notices, "return false" ( leaves the
> loop and function as if the action failed ) and the lock is released.
>
> This seemed to work fine as well. - But I am not sure what this might
> break for others...

Is this the same patch that you posted on the linux-wireless ML? Although I have
not heard back from Chaoming, I formatted that patch correctly and submitted it
to John Linville yesterday with the notation that it should be applied to the
stable kernels.

Larry

2011-12-07 20:48:02

by Philipp Dreimann

[permalink] [raw]
Subject: Re: rtlwifi, rtl8192se bug soft-lockup

On 7 December 2011 15:23, Larry Finger <[email protected]> wrote:
> On 12/07/2011 07:59 AM, Philipp Dreimann wrote:
>>
>> On 29 November 2011 00:16, Larry Finger<[email protected]> ?wrote:
>>>
>>> On 11/28/2011 06:58 PM, Philipp Dreimann wrote:
>>> ?From a quick look, Stanislaw's patch should fix your system. If not,
>>> then
>>> please consider pulling a git tree and checking out commit 34ddb20, which
>>> is
>>> the one before 67fc6052.
>>
>>
>> It fixed the issue *but* I am currently back to kernel v3.0.3, as it
>> is the most stable for me. I am not sure whether new issues were
>> introduced by using a v3.2-rc or if there is more wrong in the
>> rtl8192se driver itself. I had random sound and standby issues at
>> which I will have a look some other day.
>
>
> The bug that affected 3.2-rcX and fixed by Stanislaw's patch was not
> introduced until 3.1. A patch to fix it there was just queued by GregKH.

I had Stanislaw's patch included.

>> Another idea about the problem:
>> I omitted for some reason the following line in the first email about
>> the problem:
>> [ ?732.056049] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:2112]
>
>
> That was a serious omission.

Yes.

>> While looking at the Call Trace and the code I have no idea why
>> rtl92s_phy_set_rf_power_state needs that much time for the ERFSLEEP
>> operation. I suspected an issue in the loop but did not find it so
>> far.
>
>
> With a modern CPU, no loop can take 22s unless it involves a spin lock that
> never is released.

Yes, and it should not, as the loop has the lock!

Putting things together:

- Stanislaw's patch prevents the occurrence of the issue with using
the irq safe spin lock.
This is kind of an an revert of
312d5479dcfaca2b8aa451201b5388fdb8c8684a (I did not check
everything!).

- The loop-issue is still around but won't be noticed unless the
delayed execution of rtl_lps_leave() has side-effects..

- As it took up to an hour to hit the issue, I suspect that there is
something else going wrong which interferes with the loop...

>> Another solution which I tested was the following:
>> 0. rtl_lps_leave function informs the ?rtl92s_phy_set_rf_power_state
>> being in the ERFSLEEP-case-loop, that it needs the lock.
>> 1. rtl92s_phy_set_rf_power_state notices, "return false" ( leaves the
>> loop and function as if the action failed ) and the lock is released.
>>
>> This seemed to work fine as well. - But I am not sure what this might
>> break for others...
>
> Is this the same patch that you posted on the linux-wireless ML?

No, this was not posted so far. I will try to debug the loop issue
soonish. The outlined idea above only prevents the issue without
knowing what is happening.

> Although I
> have not heard back from Chaoming, I formatted that patch correctly and
> submitted it to John Linville yesterday with the notation that it should be
> applied to the stable kernels.

Thanks. The comment in v2 is fine now.

2011-12-07 13:59:52

by Philipp Dreimann

[permalink] [raw]
Subject: Re: rtlwifi, rtl8192se bug soft-lockup

On 29 November 2011 00:16, Larry Finger <[email protected]> wrote:
> On 11/28/2011 06:58 PM, Philipp Dreimann wrote:
> From a quick look, Stanislaw's patch should fix your system. If not, then
> please consider pulling a git tree and checking out commit 34ddb20, which is
> the one before 67fc6052.

It fixed the issue *but* I am currently back to kernel v3.0.3, as it
is the most stable for me. I am not sure whether new issues were
introduced by using a v3.2-rc or if there is more wrong in the
rtl8192se driver itself. I had random sound and standby issues at
which I will have a look some other day.

Another idea about the problem:
I omitted for some reason the following line in the first email about
the problem:
[ 732.056049] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:2112]

While looking at the Call Trace and the code I have no idea why
rtl92s_phy_set_rf_power_state needs that much time for the ERFSLEEP
operation. I suspected an issue in the loop but did not find it so
far.

Another solution which I tested was the following:
0. rtl_lps_leave function informs the rtl92s_phy_set_rf_power_state
being in the ERFSLEEP-case-loop, that it needs the lock.
1. rtl92s_phy_set_rf_power_state notices, "return false" ( leaves the
loop and function as if the action failed ) and the lock is released.

This seemed to work fine as well. - But I am not sure what this might
break for others...

2011-12-08 09:52:25

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: rtlwifi, rtl8192se bug soft-lockup

On Wed, Dec 07, 2011 at 06:47:58PM -0200, Philipp Dreimann wrote:
> No, this was not posted so far. I will try to debug the loop issue
> soonish. The outlined idea above only prevents the issue without
> knowing what is happening.

I looked at it a bit more and realized that we can replace spinlock
by mutex. This should fix remaining problems here, and hopefully do
not introduce any others. Could you test two attached patches, and
if they do not crash intermediately :-) retest again with
CONFIG_LOCKDEP ?

Thanks
Stanislaw


Attachments:
(No filename) (528.00 B)
0001-rtlwifi-use-work-for-lps.patch (3.55 kB)
0002-rtlwifi-merge-ips-lps-spinlocks-into-one-mutex.patch (4.11 kB)
Download all attachments