Date: Wed, 27 Apr 2016 14:13:56 -0500
From: Bin Liu <b-liu@ti.com>
To: Maxim Uvarov <muvarov@gmail.com>
CC: <linux-kernel@vger.kernel.org>, <linux-usb@vger.kernel.org>,
        Greg KH <gregkh@linuxfoundation.org>,
        <sergei.shtylyov@cogentembedded.com>
Subject: Re: [PATCHv2] musb_host: fix lockup on rxcsr_h_error
Message-ID: <20160427191356.GA16821@uda0271908>
Mail-Followup-To: Bin Liu <b-liu@ti.com>, Maxim Uvarov <muvarov@gmail.com>,
	linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org,
	Greg KH <gregkh@linuxfoundation.org>,
	sergei.shtylyov@cogentembedded.com
References: <1461739918-9583-1-git-send-email-muvarov@gmail.com>
 <20160427154620.GB5604@uda0271908>
 <CAJGZr0JNgYccDdsPLPWGyY3ASnN4PBxfJuhxeEEWk87T0qh2fg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <CAJGZr0JNgYccDdsPLPWGyY3ASnN4PBxfJuhxeEEWk87T0qh2fg@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4984
Lines: 140

Hi,

On Wed, Apr 27, 2016 at 09:26:10PM +0300, Maxim Uvarov wrote:
> 2016-04-27 18:46 GMT+03:00 Bin Liu <b-liu@ti.com>:
> > Hi,
> >
> > On Wed, Apr 27, 2016 at 09:51:58AM +0300, Max Uvarov wrote:
> >> Fix soft lockup when resetting remote device attached
> >> to usb host. Configuration:
> >> pppd -> musb hub -> usb-serial -> gsm modem
> >
> > I have heard a few reports similar to this symptom, but never been able
> > to reproduce it on my side.
> >
> 
> Ok, I can reproduce it almost very easy.
> 
> >> When gsm modem resets, musb rolls in incoming rx interrupts
> >> which does not give any time to other application as result
> >> it totally lock ups. Solution is to keep original logic for RXCSR_H_ERROR
> >
> > Have you looked where exact place in the interrupt routine the execution
> > has stuck in?
> >
> 
> It does not stuck. It goes to that line which print proto error over
> and over again and
> nothing stops that. After some time kernel reports lockup. But
> actually it's not stuck,
> all cpu time was eaten by executing that handlers.
> 
> 
> >> and merge RXCSR_DATAERROR and RXCSR_H_ERROR branches to call same code
> >> for setting rx stall with MUSB_RXCSR_H_WZC_BITS.
> >
> > MUSB_RXCSR_H_WZC_BITS itself does not set rx stall, it just ensures
> > MUSB_RXCSR_H_RXSTALL not to be cleared. Please check its comment in
> > musb_regs.h.
> >
> >>
> >> Signed-off-by: Max Uvarov <muvarov@gmail.com>
> >> ---
> >>  v2: use bitwise or for error flags before logical and. (Sergei Shtylyov).
> >>
> >>  drivers/usb/musb/musb_host.c | 12 +++++-------
> >>  1 file changed, 5 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/usb/musb/musb_host.c b/drivers/usb/musb/musb_host.c
> >> index c3d5fc9..2d9aa78 100644
> >> --- a/drivers/usb/musb/musb_host.c
> >> +++ b/drivers/usb/musb/musb_host.c
> >> @@ -1592,14 +1592,12 @@ void musb_host_rx(struct musb *musb, u8 epnum)
> >
> > What kernel do you use? This line # is away off from upstream kernel.
> >
> 
> I did this patch for 4.1 but 4.6 has the same problem and patch
> cleanly applies to the latest torvalds/linux.git v4.6-rc5. This
> interrupt handler has the same code.  And looks like on 3.14

Yeah, this code hasn't been chaned for year. But in general, it is
prepfered to create patches on latest kernel to avoid other headache.

> everything worked. I don't have a time to diff 2 versions. Might be
> regression.
> 
> 
> >>
> >>               /* stall; record URB status */
> >>               status = -EPIPE;
> >> +     } else if (rx_csr & (MUSB_RXCSR_DATAERROR | MUSB_RXCSR_H_ERROR)) {
> >>
> >> -     } else if (rx_csr & MUSB_RXCSR_H_ERROR) {
> >> -             dev_dbg(musb->controller, "end %d RX proto error\n", epnum);
> >> -
> >> -             status = -EPROTO;
> >> -             musb_writeb(epio, MUSB_RXINTERVAL, 0);
> >> -
> >> -     } else if (rx_csr & MUSB_RXCSR_DATAERROR) {
> >> +             if (rx_csr & MUSB_RXCSR_H_ERROR) {
> >> +                     status = -EPROTO;
> >> +                     musb_writeb(epio, MUSB_RXINTERVAL, 0);
> >> +             }
> >
> > Please help me to understand how this change fixes the issue. I see the
> > most effect of the change here is directly 'goto finish' so that 'done'
> > flag is not set, then musb_advance_schedule() is not called. Is this the
> > case or I missed other important pieces?
> >
> 
> Right that is the goal. On this rxcsr_h_error kernel reschedules
> current interrupt.  And that continues forever. For example adding

The MUSB Programming Guide says CPU should clear this MUSB_RXCSR_H_ERROR
bit, but the current driver doesn't. I am wondering if this causes the
controller keeps generating the same interrupt. Can you please try the
following change instead to see if the lockup goes away?

@@ -1870,6 +1870,9 @@ void musb_host_rx(struct musb *musb, u8 epnum)
                status = -EPROTO;
                musb_writeb(epio, MUSB_RXINTERVAL, 0);
 
+               rx_csr &= ~MUSB_RXCSR_H_ERROR;
+               musb_writew(epio, MUSB_RXCSR, rx_csr);
+
        } else if (rx_csr & MUSB_RXCSR_DATAERROR) {
 
                if (USB_ENDPOINT_XFER_ISOC != qh->type) {

Regards,
-Bin.

> msleep() can give some time for other processes. I'm not an expert in
> this chip but I think that right solution in that case is not try to
> reschedule and quick and allow hub to make reset and once again init
> all devices (in my case ppp/pppd also shutdowns and then I bring
> everything up with script.). The same behavior with dma and pio mode.
> 
> Regards,
> Max.
> 
> > Thanks,
> > -Bin.
> >
> >>
> >>               if (USB_ENDPOINT_XFER_ISOC != qh->type) {
> >>                       dev_dbg(musb->controller, "RX end %d NAK timeout\n", epnum);
> >> --
> >> 1.9.1
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Best regards,
> Maxim Uvarov