LinuxLists.cc - Re: [PATCH v2] xhci: reduce xhci_handshake timeout in xhci

2022-02-14 14:04:23

Subject: Re: [PATCH v2] xhci: reduce xhci_handshake timeout in xhci_reset

Hi Mathias,

On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
> On 14.2.2022 14.20, Pavankumar Kondeti wrote:
> > From: Daehwan Jung <[email protected]>
> >
> > xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> > controller reset and controller ready operations can be fatal to the
> > system when controller is timed out. Reduce the timeout to 1 second
> > and print a error message when the time out happens.
> >
> > Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
>
>
> The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> intentionally increased the timeout to 10 seconds as that host might take 9
> seconds to complete reset. This was done almost 10 years ago so I don't know
> if it really is an issue anymore.
>
> Anyways, your patch might break Renesas 72021 instead of fixing it.

Unfortunately, yes :-( . We have this reduced timeout patch in our previous
commercialized products so thought this would be a good time to fix this
once for all. Since this patch has been 10 years long, not sure if any other
controllers also need 10 sec timeout. It would probably better

>
> I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
>
> Lets see if there is another solution for your case.
>
> - Does a "guard interval" after writing the reset help?
> For example Intel xHCI needs 1ms before touching xHC after writing the reset bit

I will ask this question to our hardware team. Setting that one quirk from
DWC3 host might require other changes like this [1].
>
> - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit?

The RESET bit never gets cleared from USBCMD register.

>
> - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
> and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
> Have you identified which one is the problematic case?

The crash reports I have seen are pointing to

usb_remove_hcd()->xhci_stop()->xhci_reset()
>
> I think we halt the host in the above case first, meaning there should be no
> xHC interrupts when xhci_reset() is called. So if we could guarantee xhci interrupt
> isn't handled on this cpu, maybe we could somehow enable local interrupt after
> halting the host?
>
> haven't really thought this true yet, but something like this could e investigated:
>
> spin_lock_irqsave()
> xhci_halt()
> < enable interrupts, magically turn spin_lock_irqsave() to just keeping spin lock>
> xhci_reset()
> spin_unlock()

This is a very good suggestion. However, disabling preemption for 10 seconds
is also bad even on non-RT kernels like mobiles are using. Most of the SoCs
will have a watchdog which makes sure that all CPUs are schedulable and flag
this condition. The most important thread in the system could have just woken
on this CPU and it can't run until we drop the spin lock.

[1] https://lore.kernel.org/linux-usb/[email protected]/

Thanks,
Pavan

2022-02-15 12:40:29

by Daehwan Jung

[permalink] [raw]

Subject: Re: [PATCH v2] xhci: reduce xhci_handshake timeout in xhci_reset

Attachments:

(No filename) (3.67 kB)

2022-02-15 14:49:25

by Mathias Nyman

[permalink] [raw]

Subject: Re: [PATCH v2] xhci: reduce xhci_handshake timeout in xhci_reset

On 14.2.2022 15.53, Pavan Kondeti wrote:
> Hi Mathias,
>
> On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
>> On 14.2.2022 14.20, Pavankumar Kondeti wrote:
>>> From: Daehwan Jung <[email protected]>
>>>
>>> xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
>>> controller reset and controller ready operations can be fatal to the
>>> system when controller is timed out. Reduce the timeout to 1 second
>>> and print a error message when the time out happens.
>>>
>>> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
>>
>>
>> The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
>> intentionally increased the timeout to 10 seconds as that host might take 9
>> seconds to complete reset. This was done almost 10 years ago so I don't know
>> if it really is an issue anymore.
>>
>> Anyways, your patch might break Renesas 72021 instead of fixing it.
>
> Unfortunately, yes :-( . We have this reduced timeout patch in our previous
> commercialized products so thought this would be a good time to fix this
> once for all. Since this patch has been 10 years long, not sure if any other
> controllers also need 10 sec timeout. It would probably better
>
>>
>> I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
>>
>> Lets see if there is another solution for your case.
>>
>> - Does a "guard interval" after writing the reset help?
>> For example Intel xHCI needs 1ms before touching xHC after writing the reset bit
>
> I will ask this question to our hardware team. Setting that one quirk from
> DWC3 host might require other changes like this [1].
>>
>> - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit?
>
> The RESET bit never gets cleared from USBCMD register.
>
>>
>> - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
>> and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
>> Have you identified which one is the problematic case?
>
> The crash reports I have seen are pointing to
>
> usb_remove_hcd()->xhci_stop()->xhci_reset()

Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
disabled and spinlock held. In both these cases we're not that interested in the
outcome of xhci_reset().

But during probe we call xhci_reset() with interrupts enabled without spinlock,
and here we really care about it succeeding.
I'm also guessing reset could take a longer time during probe due to possible recent
BIOS handover, or firmware loading etc.

So how about passing a timeout value to xhci_reset()?
Give it 10 seconds during probe, and 250ms in the other cases.

Then dig into the reason why xhci_stop(), xhci_resume() and xhci_shutdown() call
xhci_reset() witch spin_lock_irq() held.

-Mathias

2022-02-16 07:05:28

by Pavan Kondeti

[permalink] [raw]

Subject: Re: [PATCH v2] xhci: reduce xhci_handshake timeout in xhci_reset

Hi Mathias,

On Tue, Feb 15, 2022 at 12:16:12PM +0200, Mathias Nyman wrote:
> On 14.2.2022 15.53, Pavan Kondeti wrote:
> > Hi Mathias,
> >
> > On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
> >> On 14.2.2022 14.20, Pavankumar Kondeti wrote:
> >>> From: Daehwan Jung <[email protected]>
> >>>
> >>> xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> >>> controller reset and controller ready operations can be fatal to the
> >>> system when controller is timed out. Reduce the timeout to 1 second
> >>> and print a error message when the time out happens.
> >>>
> >>> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> >>
> >>
> >> The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> >> intentionally increased the timeout to 10 seconds as that host might take 9
> >> seconds to complete reset. This was done almost 10 years ago so I don't know
> >> if it really is an issue anymore.
> >>
> >> Anyways, your patch might break Renesas 72021 instead of fixing it.
> >
> > Unfortunately, yes :-( . We have this reduced timeout patch in our previous
> > commercialized products so thought this would be a good time to fix this
> > once for all. Since this patch has been 10 years long, not sure if any other
> > controllers also need 10 sec timeout. It would probably better
> >
> >>
> >> I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
> >>
> >> Lets see if there is another solution for your case.
> >>
> >> - Does a "guard interval" after writing the reset help?
> >> For example Intel xHCI needs 1ms before touching xHC after writing the reset bit
> >
> > I will ask this question to our hardware team. Setting that one quirk from
> > DWC3 host might require other changes like this [1].
> >>
> >> - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit?
> >
> > The RESET bit never gets cleared from USBCMD register.
> >
> >>
> >> - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
> >> and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
> >> Have you identified which one is the problematic case?
> >
> > The crash reports I have seen are pointing to
> >
> > usb_remove_hcd()->xhci_stop()->xhci_reset()
>
> Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
> disabled and spinlock held. In both these cases we're not that interested in the
> outcome of xhci_reset().
>
> But during probe we call xhci_reset() with interrupts enabled without spinlock,
> and here we really care about it succeeding.
> I'm also guessing reset could take a longer time during probe due to possible recent
> BIOS handover, or firmware loading etc.
>
> So how about passing a timeout value to xhci_reset()?
> Give it 10 seconds during probe, and 250ms in the other cases.
>

Thanks for this suggestion.

This sounds better compared to the quirks approach. xhci_resume() also seems
to be calling xhci_reset() in the hibernation path, I believe we should treat
this like probe()/startup case and give larger timeout.

Thanks,
Pavan