2019-02-27 22:20:09

by Daniel Kurtz

[permalink] [raw]
Subject: [PATCH] xhci: use iopoll for xhci_handshake

In cases such as xhci_abort_cmd_ring(), xhci_handshake() is called with
a spin lock held (and local interrupts disabled) with a huge 5 second
timeout. This can translates to 5 million calls to udelay(1). By its
very nature, udelay() is not meant to be precise, it only guarantees to
delay a minimum of 1 microsecond. Therefore the actual delay of
xhci_handshake() can be significantly longer. If the average udelay(1)
is greater than 2.2 us, the total time in xhci_handshake() - with
interrupts disabled can be > 11 seconds triggering the kernel's soft lockup
detector.

To avoid this, let's replace the open coded io polling loop with one from
iopoll.h that uses a loop timed with the more presumably reliable ktime
infrastructure.

Signed-off-by: Daniel Kurtz <[email protected]>
---
drivers/usb/host/xhci.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 005e65922608e..fde5ff84ba69b 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -16,6 +16,7 @@
#include <linux/slab.h>
#include <linux/dmi.h>
#include <linux/dma-mapping.h>
+#include <linux/iopoll.h>

#include "xhci.h"
#include "xhci-trace.h"
@@ -69,18 +70,18 @@ static bool td_on_ring(struct xhci_td *td, struct xhci_ring *ring)
int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, int usec)
{
u32 result;
+ int ret;

- do {
- result = readl(ptr);
- if (result == ~(u32)0) /* card removed */
- return -ENODEV;
- result &= mask;
- if (result == done)
- return 0;
- udelay(1);
- usec--;
- } while (usec > 0);
- return -ETIMEDOUT;
+ ret = readl_poll_timeout_atomic(ptr, result,
+ ((result & mask) == done || (result == ~(u32)0)),
+ 1, usec);
+ if (ret)
+ return ret;
+
+ if (result == ~(u32)0) /* card removed */
+ return -ENODEV;
+
+ return 0;
}

/*
--
2.21.0.rc2.261.ga7da99ff1b-goog



2019-02-28 08:00:35

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] xhci: use iopoll for xhci_handshake

On Wed, Feb 27, 2019 at 03:19:17PM -0700, Daniel Kurtz wrote:
> In cases such as xhci_abort_cmd_ring(), xhci_handshake() is called with
> a spin lock held (and local interrupts disabled) with a huge 5 second
> timeout. This can translates to 5 million calls to udelay(1). By its
> very nature, udelay() is not meant to be precise, it only guarantees to
> delay a minimum of 1 microsecond. Therefore the actual delay of
> xhci_handshake() can be significantly longer. If the average udelay(1)
> is greater than 2.2 us, the total time in xhci_handshake() - with
> interrupts disabled can be > 11 seconds triggering the kernel's soft lockup
> detector.
>
> To avoid this, let's replace the open coded io polling loop with one from
> iopoll.h that uses a loop timed with the more presumably reliable ktime
> infrastructure.
>
> Signed-off-by: Daniel Kurtz <[email protected]>

Looks sane to me, nice fixup.

Reviewed-by: Greg Kroah-Hartman <[email protected]>

Is this causing problems on older kernels/devices today such that we
should backport this?

thanks,

greg k-h

2019-02-28 08:55:44

by Mathias Nyman

[permalink] [raw]
Subject: Re: [PATCH] xhci: use iopoll for xhci_handshake

On 28.2.2019 9.09, Greg Kroah-Hartman wrote:
> On Wed, Feb 27, 2019 at 03:19:17PM -0700, Daniel Kurtz wrote:
>> In cases such as xhci_abort_cmd_ring(), xhci_handshake() is called with
>> a spin lock held (and local interrupts disabled) with a huge 5 second
>> timeout. This can translates to 5 million calls to udelay(1). By its
>> very nature, udelay() is not meant to be precise, it only guarantees to
>> delay a minimum of 1 microsecond. Therefore the actual delay of
>> xhci_handshake() can be significantly longer. If the average udelay(1)
>> is greater than 2.2 us, the total time in xhci_handshake() - with
>> interrupts disabled can be > 11 seconds triggering the kernel's soft lockup
>> detector.
>>
>> To avoid this, let's replace the open coded io polling loop with one from
>> iopoll.h that uses a loop timed with the more presumably reliable ktime
>> infrastructure.
>>
>> Signed-off-by: Daniel Kurtz <[email protected]>
>
> Looks sane to me, nice fixup.
>
> Reviewed-by: Greg Kroah-Hartman <[email protected]>
>
> Is this causing problems on older kernels/devices today such that we
> should backport this?
>

A very similar patch was submitted some weeks ago by Andrey Smirnov.

https://lore.kernel.org/lkml/[email protected]/

His commit message only mentions that readl_poll_timeout_atomic() does the same job,
not about any issues with the loop, so I was going to send it forward to usb-next
after 5.1-rc (to 5.2).

-Mathias

2019-02-28 18:25:46

by Daniel Kurtz

[permalink] [raw]
Subject: Re: [PATCH] xhci: use iopoll for xhci_handshake

On Thu, Feb 28, 2019 at 12:09 AM Greg Kroah-Hartman
<[email protected]> wrote:
>
> On Wed, Feb 27, 2019 at 03:19:17PM -0700, Daniel Kurtz wrote:
> > In cases such as xhci_abort_cmd_ring(), xhci_handshake() is called with
> > a spin lock held (and local interrupts disabled) with a huge 5 second
> > timeout. This can translates to 5 million calls to udelay(1). By its
> > very nature, udelay() is not meant to be precise, it only guarantees to
> > delay a minimum of 1 microsecond. Therefore the actual delay of
> > xhci_handshake() can be significantly longer. If the average udelay(1)
> > is greater than 2.2 us, the total time in xhci_handshake() - with
> > interrupts disabled can be > 11 seconds triggering the kernel's soft lockup
> > detector.
> >
> > To avoid this, let's replace the open coded io polling loop with one from
> > iopoll.h that uses a loop timed with the more presumably reliable ktime
> > infrastructure.
> >
> > Signed-off-by: Daniel Kurtz <[email protected]>
>
> Looks sane to me, nice fixup.
>
> Reviewed-by: Greg Kroah-Hartman <[email protected]>
>
> Is this causing problems on older kernels/devices today such that we
> should backport this?

We detected that xhci_handshake timing out can lead to softlockup
while debugging a USB issue on a new product. The xhci_handshake
timeout itself is a symptom of another underlying problem causing some
commands to be aborted. I don't know if any such underlying problems
exist on other older devices, but the potential is there so a backport
is reasonable. Although, it may just shift the symptom of an
underlying problem from a softlockup/oops to some other symptom, like
USB just being dead.

-Dan

>
> thanks,
>
> greg k-h