2019-05-31 21:57:57

by Soeren Moch

[permalink] [raw]
Subject: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"

This reverts commit ed194d1367698a0872a2b75bbe06b3932ce9df3a.

In contrast to the original patch description, apparently not all handlers
were audited properly. E.g. my RT5370 based USB WIFI adapter (driver in
drivers/net/wireless/ralink/rt2x00) hangs after a while under heavy load.
This revert fixes this.

Also revert the follow-up patch d6142b91e9cc249b3aa22c90fade67e2e2d52cdb
("usb: core: remove flags variable in __usb_hcd_giveback_urb()"), since now
we need the flags variable again.

Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected] # 4.20+
Signed-off-by: Soeren Moch <[email protected]>
---
drivers/usb/core/hcd.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 94d22551fc1b..08d25fcf8b8e 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1739,6 +1739,7 @@ static void __usb_hcd_giveback_urb(struct urb *urb)
struct usb_hcd *hcd = bus_to_hcd(urb->dev->bus);
struct usb_anchor *anchor = urb->anchor;
int status = urb->unlinked;
+ unsigned long flags;

urb->hcpriv = NULL;
if (unlikely((urb->transfer_flags & URB_SHORT_NOT_OK) &&
@@ -1755,7 +1756,20 @@ static void __usb_hcd_giveback_urb(struct urb *urb)

/* pass ownership to the completion handler */
urb->status = status;
+
+ /*
+ * We disable local IRQs here avoid possible deadlock because
+ * drivers may call spin_lock() to hold lock which might be
+ * acquired in one hard interrupt handler.
+ *
+ * The local_irq_save()/local_irq_restore() around complete()
+ * will be removed if current USB drivers have been cleaned up
+ * and no one may trigger the above deadlock situation when
+ * running complete() in tasklet.
+ */
+ local_irq_save(flags);
urb->complete(urb);
+ local_irq_restore(flags);

usb_anchor_resume_wakeups(anchor);
atomic_dec(&urb->use_count);
--
2.17.1


2019-05-31 22:07:18

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"

On Fri, May 31, 2019 at 11:53:40PM +0200, Soeren Moch wrote:
> This reverts commit ed194d1367698a0872a2b75bbe06b3932ce9df3a.
>
> In contrast to the original patch description, apparently not all handlers
> were audited properly. E.g. my RT5370 based USB WIFI adapter (driver in
> drivers/net/wireless/ralink/rt2x00) hangs after a while under heavy load.
> This revert fixes this.

Why not just fix that driver? Wouldn't that be easier?

thanks,

greg k-h

2019-05-31 23:05:54

by Soeren Moch

[permalink] [raw]
Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"



On 01.06.19 00:05, Greg Kroah-Hartman wrote:
> On Fri, May 31, 2019 at 11:53:40PM +0200, Soeren Moch wrote:
>> This reverts commit ed194d1367698a0872a2b75bbe06b3932ce9df3a.
>>
>> In contrast to the original patch description, apparently not all handlers
>> were audited properly. E.g. my RT5370 based USB WIFI adapter (driver in
>> drivers/net/wireless/ralink/rt2x00) hangs after a while under heavy load.
>> This revert fixes this.
> Why not just fix that driver? Wouldn't that be easier?
>
I suspect there are more drivers to fix. I only tested WIFI sticks so
far, RTL8188 drivers also seem to suffer from this. I'm not sure how to
fix all this properly, maybe Sebastian as original patch author can help
here.
This patch is mostly for -stable, to get an acceptable solution quickly.
It was really annoying to get such unstable WIFI connection over the
last three kernel releases to my development board.  Since my internet
service provider forcefully updated my router box 3 weeks ago, I
unfortunately see the same symptoms on my primary internet access.
That's even worse, I need to reset this router box every few days. I'm
not sure, however, that this is caused by the same problem, but it feels
like this.
So can we please fix this regression quickly and workout a proper fix
later? In the original patch there is no reason given, why this patch is
necessary. With this revert I at least see a stable connection.

Thanks,
Soeren

Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"

On 2019-06-01 01:02:37 [+0200], Soeren Moch wrote:
> > Why not just fix that driver? Wouldn't that be easier?
> >
> I suspect there are more drivers to fix. I only tested WIFI sticks so
> far, RTL8188 drivers also seem to suffer from this. I'm not sure how to
> fix all this properly, maybe Sebastian as original patch author can help
> here.

Suspecting isn't helping here.

> This patch is mostly for -stable, to get an acceptable solution quickly.
> It was really annoying to get such unstable WIFI connection over the
> last three kernel releases to my development board.  Since my internet
> service provider forcefully updated my router box 3 weeks ago, I
> unfortunately see the same symptoms on my primary internet access.
> That's even worse, I need to reset this router box every few days. I'm
> not sure, however, that this is caused by the same problem, but it feels
> like this.
> So can we please fix this regression quickly and workout a proper fix
> later? In the original patch there is no reason given, why this patch is
> necessary. With this revert I at least see a stable connection.

I will look into this. This patch got in in v4.20-rc1 and the final
kernel was released by the end of 2018. This is the first report I am
aware of over half year later…

> Thanks,
> Soeren

Sebastian

Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"

On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
> I will look into this.

nothing obvious. If there is really blocken lock, could you please
enable lockdep
|CONFIG_LOCK_DEBUGGING_SUPPORT=y
|CONFIG_PROVE_LOCKING=y
|# CONFIG_LOCK_STAT is not set
|CONFIG_DEBUG_RT_MUTEXES=y
|CONFIG_DEBUG_SPINLOCK=y
|CONFIG_DEBUG_MUTEXES=y
|CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
|CONFIG_DEBUG_RWSEMS=y
|CONFIG_DEBUG_LOCK_ALLOC=y
|CONFIG_LOCKDEP=y
|# CONFIG_DEBUG_LOCKDEP is not set
|CONFIG_DEBUG_ATOMIC_SLEEP=y

and send me the splat that lockdep will report?

> > Thanks,
> > Soeren

Sebastian

2019-06-03 12:48:51

by Soeren Moch

[permalink] [raw]
Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"



On 01.06.19 12:50, Sebastian Andrzej Siewior wrote:
> On 2019-06-01 01:02:37 [+0200], Soeren Moch wrote:
>>> Why not just fix that driver? Wouldn't that be easier?
>>>
>> I suspect there are more drivers to fix. I only tested WIFI sticks so
>> far, RTL8188 drivers also seem to suffer from this. I'm not sure how to
>> fix all this properly, maybe Sebastian as original patch author can help
>> here.
> Suspecting isn't helping here.
Yes, you are right.
When I encountered this problem half a year ago, I tried some other type
of wifi stick
and immediately ran into trouble with this, too.

Now I did a short test with a RTL8188CUS stick, I could not reproduce
this bug with
this stick so far.

>
>> This patch is mostly for -stable, to get an acceptable solution quickly.
>> It was really annoying to get such unstable WIFI connection over the
>> last three kernel releases to my development board.  Since my internet
>> service provider forcefully updated my router box 3 weeks ago, I
>> unfortunately see the same symptoms on my primary internet access.
>> That's even worse, I need to reset this router box every few days. I'm
>> not sure, however, that this is caused by the same problem, but it feels
>> like this.
>> So can we please fix this regression quickly and workout a proper fix
>> later? In the original patch there is no reason given, why this patch is
>> necessary. With this revert I at least see a stable connection.
> I will look into this. This patch got in in v4.20-rc1 and the final
> kernel was released by the end of 2018. This is the first report I am
> aware of over half year later…
It is not easy to reproduce this bug reliably within a reasonable period
of time. It
took days to bisect this. And usb core code was for sure not my first
candidate
to look at.

Reverting this patch solves the problem, but of course disabling interrupts
also can hide a bug elsewhere.

Soeren

2019-06-03 12:52:29

by Soeren Moch

[permalink] [raw]
Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"



On 01.06.19 13:02, Sebastian Andrzej Siewior wrote:
> On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
>> I will look into this.
> nothing obvious. If there is really blocken lock, could you please
> enable lockdep
> |CONFIG_LOCK_DEBUGGING_SUPPORT=y
> |CONFIG_PROVE_LOCKING=y
> |# CONFIG_LOCK_STAT is not set
> |CONFIG_DEBUG_RT_MUTEXES=y
> |CONFIG_DEBUG_SPINLOCK=y
> |CONFIG_DEBUG_MUTEXES=y
> |CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> |CONFIG_DEBUG_RWSEMS=y
> |CONFIG_DEBUG_LOCK_ALLOC=y
> |CONFIG_LOCKDEP=y
> |# CONFIG_DEBUG_LOCKDEP is not set
> |CONFIG_DEBUG_ATOMIC_SLEEP=y
>
> and send me the splat that lockdep will report?
>
I will do so. I cannot promise, however, that this will happen within
the next few days.

Thanks for your suggestions,
Soeren

2019-06-12 14:46:26

by Soeren Moch

[permalink] [raw]
Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"



On 01.06.19 13:02, Sebastian Andrzej Siewior wrote:
> On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
>> I will look into this.
>
> nothing obvious. If there is really blocken lock, could you please
> enable lockdep
> |CONFIG_LOCK_DEBUGGING_SUPPORT=y
> |CONFIG_PROVE_LOCKING=y
> |# CONFIG_LOCK_STAT is not set
> |CONFIG_DEBUG_RT_MUTEXES=y
> |CONFIG_DEBUG_SPINLOCK=y
> |CONFIG_DEBUG_MUTEXES=y
> |CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> |CONFIG_DEBUG_RWSEMS=y
> |CONFIG_DEBUG_LOCK_ALLOC=y
> |CONFIG_LOCKDEP=y
> |# CONFIG_DEBUG_LOCKDEP is not set
> |CONFIG_DEBUG_ATOMIC_SLEEP=y
>
> and send me the splat that lockdep will report?
>

Nothing interesting:

[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 5.1.0 (root@matrix) (gcc version 7.4.0
(Debian 7.4.0-6)) #6 SMP PREEMPT Wed Jun 12 11:28:41 CEST 2019
[ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7),
cr=10c5387d
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing
instruction cache
[ 0.000000] OF: fdt: Machine model: TBS2910 Matrix ARM mini PC
...
[ 0.000000] rcu: Preemptible hierarchical RCU implementation.
[ 0.000000] rcu: RCU lockdep checking is enabled.
...
[ 0.003546] Lock dependency validator: Copyright (c) 2006 Red Hat,
Inc., Ingo Molnar
[ 0.003657] ... MAX_LOCKDEP_SUBCLASSES: 8
[ 0.003713] ... MAX_LOCK_DEPTH: 48
[ 0.003767] ... MAX_LOCKDEP_KEYS: 8191
[ 0.003821] ... CLASSHASH_SIZE: 4096
[ 0.003876] ... MAX_LOCKDEP_ENTRIES: 32768
[ 0.003931] ... MAX_LOCKDEP_CHAINS: 65536
[ 0.003986] ... CHAINHASH_SIZE: 32768
[ 0.004042] memory used by lock dependency info: 5243 kB

Nothing else.

When stopping hostapd after it hangs:
[ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue
14 failed to flush

Soeren

2019-06-12 18:03:27

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"

On Wed, 12 Jun 2019, Soeren Moch wrote:

> On 01.06.19 13:02, Sebastian Andrzej Siewior wrote:
> > On 2019-06-01 12:50:08 [+0200], To Soeren Moch wrote:
> >> I will look into this.
> >
> > nothing obvious. If there is really blocken lock, could you please
> > enable lockdep
> > |CONFIG_LOCK_DEBUGGING_SUPPORT=y
> > |CONFIG_PROVE_LOCKING=y
> > |# CONFIG_LOCK_STAT is not set
> > |CONFIG_DEBUG_RT_MUTEXES=y
> > |CONFIG_DEBUG_SPINLOCK=y
> > |CONFIG_DEBUG_MUTEXES=y
> > |CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> > |CONFIG_DEBUG_RWSEMS=y
> > |CONFIG_DEBUG_LOCK_ALLOC=y
> > |CONFIG_LOCKDEP=y
> > |# CONFIG_DEBUG_LOCKDEP is not set
> > |CONFIG_DEBUG_ATOMIC_SLEEP=y
> >
> > and send me the splat that lockdep will report?
> >
>
> Nothing interesting:
>
> [ 0.000000] Booting Linux on physical CPU 0x0
> [ 0.000000] Linux version 5.1.0 (root@matrix) (gcc version 7.4.0
> (Debian 7.4.0-6)) #6 SMP PREEMPT Wed Jun 12 11:28:41 CEST 2019
> [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7),
> cr=10c5387d
> [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing
> instruction cache
> [ 0.000000] OF: fdt: Machine model: TBS2910 Matrix ARM mini PC
> ...
> [ 0.000000] rcu: Preemptible hierarchical RCU implementation.
> [ 0.000000] rcu: RCU lockdep checking is enabled.
> ...
> [ 0.003546] Lock dependency validator: Copyright (c) 2006 Red Hat,
> Inc., Ingo Molnar
> [ 0.003657] ... MAX_LOCKDEP_SUBCLASSES: 8
> [ 0.003713] ... MAX_LOCK_DEPTH: 48
> [ 0.003767] ... MAX_LOCKDEP_KEYS: 8191
> [ 0.003821] ... CLASSHASH_SIZE: 4096
> [ 0.003876] ... MAX_LOCKDEP_ENTRIES: 32768
> [ 0.003931] ... MAX_LOCKDEP_CHAINS: 65536
> [ 0.003986] ... CHAINHASH_SIZE: 32768
> [ 0.004042] memory used by lock dependency info: 5243 kB
>
> Nothing else.
>
> When stopping hostapd after it hangs:
> [ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue
> 14 failed to flush

Instead of reverting the original commit, can you prevent the problem
by adding local_irq_save() and local_irq_restore() to the URB
completion routines in that wireless driver?

Probably people who aren't already pretty familiar with the driver code
won't easily be able to locate the race. Still, a little overkill may
be an acceptable solution.

Alan Stern

Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"

On 2019-06-12 10:38:11 [-0400], Alan Stern wrote:
> >
> > When stopping hostapd after it hangs:
> > [ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue
> > 14 failed to flush
>
> Instead of reverting the original commit, can you prevent the problem
> by adding local_irq_save() and local_irq_restore() to the URB
> completion routines in that wireless driver?
>
> Probably people who aren't already pretty familiar with the driver code
> won't easily be able to locate the race. Still, a little overkill may
> be an acceptable solution.

Not from RT point of view because you do
local_irq_save() -> spin_lock()

but it might be worth checking if the local_irq_save() within that
wireless driver avoids the race or not.

> Alan Stern

Sebastian

2019-06-17 10:14:54

by Soeren Moch

[permalink] [raw]
Subject: Re: [PATCH] Revert "usb: core: remove local_irq_save() around ->complete() handler"



On 12.06.19 16:47, Sebastian Andrzej Siewior wrote:
> On 2019-06-12 10:38:11 [-0400], Alan Stern wrote:
>>> When stopping hostapd after it hangs:
>>> [ 903.504475] ieee80211 phy0: rt2x00queue_flush_queue: Warning - Queue
>>> 14 failed to flush
>> Instead of reverting the original commit, can you prevent the problem
>> by adding local_irq_save() and local_irq_restore() to the URB
>> completion routines in that wireless driver?
>>
>> Probably people who aren't already pretty familiar with the driver code
>> won't easily be able to locate the race. Still, a little overkill may
>> be an acceptable solution.
> Not from RT point of view because you do
> local_irq_save() -> spin_lock()
>
> but it might be worth checking if the local_irq_save() within that
> wireless driver avoids the race or not.
I just sent a patch for the underlying problem in the rt2x00 driver, see
[1].
So we can drop this patch here and keep RT folks happy.
I really hope that no other usb drivers are affected by similar problems.

Zyxel support just sent me (some?) source code for the firmware of my
internet router (my real problem I mentioned before). Need to look into
this...

Thanks for your help,
Soeren


[1] https://lkml.org/lkml/2019/6/17/238
>> Alan Stern
> Sebastian