From: Rajesh Bhagat <rajesh.bhagat@nxp.com>
To: Mathias Nyman <mathias.nyman@linux.intel.com>
CC: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
        "linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Sriram Dash <sriram.dash@nxp.com>
Subject: RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
 commmand timeout
Thread-Topic: [PATCH] usb: xhci: Fix incomplete PM resume operation due to
 XHCI commmand timeout
Thread-Index: AQHRgOP0koj0An7NrUiFwLeaADZ9zZ9fDpyAgAQ7AdCAAFktgIABTHuggAB1NoCAAQay8IAAsdaAgAdP/gCAAk/0gIACPkawgAC4cQCAANwGQA==
Date: Fri, 1 Apr 2016 03:55:11 +0000
Message-ID: <HE1PR0401MB2028787DE01C0C8E79520414E39A0@HE1PR0401MB2028.eurprd04.prod.outlook.com>
References: <1458284463-12743-1-git-send-email-rajesh.bhagat@nxp.com>
 <56EBE485.1060301@linux.intel.com>
 <HE1PR0401MB202857F4B183B84FCEDFBDA6E38F0@HE1PR0401MB2028.eurprd04.prod.outlook.com>
 <56EFBBD1.6080600@intel.com>
 <HE1PR0401MB20282AD00F64A8AC716C2E24E3800@HE1PR0401MB2028.eurprd04.prod.outlook.com>
 <56F1350B.7000006@linux.intel.com>
 <HE1PR0401MB20280AAAEC685FF3FC1DBF0EE3810@HE1PR0401MB2028.eurprd04.prod.outlook.com>
 <56F2A697.1000701@linux.intel.com>
 <HE1PR0401MB2028391A02548481538D4E38E3860@HE1PR0401MB2028.eurprd04.prod.outlook.com>
 <56FAB97F.4030206@linux.intel.com>
 <HE1PR0401MB2028D000B15CE0B9E998294CE3990@HE1PR0401MB2028.eurprd04.prod.outlook.com>
 <56FD35F4.3020005@linux.intel.com>
In-Reply-To: <56FD35F4.3020005@linux.intel.com>
Accept-Language: en-US
Content-Language: en-US
spamdiagnosticoutput: 1:23
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-originalarrivaltime: 01 Apr 2016 03:55:11.0219
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 686ea1d3-bc2b-4c6f-a92c-d99c5c301635
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB3PR04MB0555
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Length: 7832
Lines: 184


> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> Sent: Thursday, March 31, 2016 8:07 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 31.03.2016 06:51, Rajesh Bhagat wrote:
> >
> >
> >
> > Hello Mathias,
> >
> > Thanks a lot for your support.
> >
> > Attached patch is working and allows the completion handler to be
> > called. And resume operation completes and shell comes back (but with lot of
> errors).
> >
> > Now, some other points which needs our attention here are:
> > 1. usb core hub code is not checking the error status of hcd->driver-
> >reset_device(xhci_discover_or_reset_device)
> >      and continues considering reset_device was successful (and causes other issues).
> Hence, a check is needed on return
> >      value of reset_device and proper action is required on it.
> > 2. Even if above return value is checked and reset_device status is used. USB core
> hub retries for N times which is not
> >      required in this case as adding to the resume operation time. But if in this case
> we return -ENOTCONN instead of -EINVAL
> >      from hcd->driver->reset_device(xhci_discover_or_reset_device), code returns
> early and doesn't retry.
> >
> > Proposed Solution for above with your suggested patches is as below:
> >
> > diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c index
> > 7cad1fa..efeba31 100644
> > --- a/drivers/usb/core/hub.c
> > +++ b/drivers/usb/core/hub.c
> > @@ -2807,13 +2807,17 @@ done:
> >                          struct usb_hcd *hcd = bus_to_hcd(udev->bus);
> >
> >                          update_devnum(udev, 0);
> > -                       /* The xHC may think the device is already reset,
> > -                        * so ignore the status.
> > +                       /*
> > +                        * check the status of resetting the device and update
> > +                        * the udev status.
> >                           */
> >                          if (hcd->driver->reset_device)
> > -                               hcd->driver->reset_device(hcd, udev);
> > +                               status =
> > + hcd->driver->reset_device(hcd, udev);
> >
> > -                       usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +                       if (status == 0)
> > +                               usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +                       else
> > +                               usb_set_device_state(udev,
> > + USB_STATE_NOTATTACHED);
> >                  }
> >          } else {
> >                  if (udev)
> > diff --git a/drivers/usb/host/xhci-ring.c
> > b/drivers/usb/host/xhci-ring.c index b3a0a22..9427175f 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -310,6 +310,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> >                  return -ESHUTDOWN;
> >          }
> >
> > +       /* writing the CMD_RING_ABORT bit should create a command completion
> > +        * event, add a command completion timeout for it as well
> > +        */
> > +       mod_timer(&xhci->cmd_timer, jiffies +
> > + XHCI_CMD_DEFAULT_TIMEOUT);
> >          return 0;
> >   }
> >
> > @@ -1243,6 +1247,7 @@ void xhci_handle_command_timeout(unsigned long data)
> >          int ret;
> >          unsigned long flags;
> >          u64 hw_ring_state;
> > +       u32 old_status;
> >          struct xhci_command *cur_cmd = NULL;
> >          xhci = (struct xhci_hcd *) data;
> >
> > @@ -1250,6 +1255,7 @@ void xhci_handle_command_timeout(unsigned long data)
> >          spin_lock_irqsave(&xhci->lock, flags);
> >          if (xhci->current_cmd) {
> >                  cur_cmd = xhci->current_cmd;
> > +               old_status = cur_cmd->status;
> >                  cur_cmd->status = COMP_CMD_ABORT;
> >          }
> >
> > @@ -1272,7 +1278,15 @@ void xhci_handle_command_timeout(unsigned long
> data)
> >          }
> >          /* command timeout on stopped ring, ring can't be aborted */
> >          xhci_dbg(xhci, "Command timeout on stopped ring\n");
> > -       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> > +
> > +       /* is this the second timeout for the same command? ring wont restart */
> > +       if (old_status == COMP_CMD_ABORT) {
> > +               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
> > +               xhci_cleanup_command_queue(xhci);
> > +       /* everything else here to halt, cleanup, free etc kill xhc */
> > +       } else
> > +               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> > +
> >          spin_unlock_irqrestore(&xhci->lock, flags);
> >          return;
> >   }
> > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index
> > c502c22..bd18966 100644
> > --- a/drivers/usb/host/xhci.c
> > +++ b/drivers/usb/host/xhci.c
> > @@ -3450,7 +3450,7 @@ int xhci_discover_or_reset_device(struct usb_hcd *hcd,
> struct usb_device *udev)
> >                  if (ret == 1)
> >                          return 0;
> >                  else
> > -                       return -EINVAL;
> > +                       return -ENOTCONN;     /* Don't retry */
> >          }
> >
> >          if (virt_dev->tt_info)
> >
> > Sample output after above patch (timer is set as wakeup source):
> >
> > root@phoenix:~# echo mem > /sys/power/state
> > PM: Syncing filesystems ... done.
> > Freezing user space processes ... (elapsed 0.001 seconds) done.
> > Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > PM: suspend of devices complete after 155.658 msecs
> > PM: late suspend of devices complete after 1.594 msecs
> > PM: noirq suspend of devices complete after 1.496 msecs
> > PM: noirq resume of devices complete after 1.290 msecs
> > PM: early resume of devices complete after 1.250 msecs usb usb1: root
> > hub lost power or was reset usb usb2: root hub lost power or was reset
> > xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> > xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> > xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> > xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> > xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> > xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> > PM: resume of devices complete after 25436.366 msecs     <= resume time is
> increased a lot even after using -ENOTCONN
> > Restarting tasks ...
> > usb 1-1: USB disconnect, device number 2 usb 1-1.2: USB disconnect,
> > device number 3 usb 2-1: USB disconnect, device number 2 done.
> > root@phoenix:~#
> >
> > Please share your opinion on other changes for patch submission as well as resume
> time.
> >

> 
> I think more effort should be put into investigating why this happens in the first place.
> What is the root cause? why doesn't xhci start properly after resume in this case.
> 
Hello Mathias, 

I understand your point and would surely debug the root cause of the issue. But implementing the 
fallback mechanism in SW, if HW does not respond well seems to a better solution to me. 

Just to add, code prior to common implementation of xhci_handle_command_timeout was handling the above 
case and was __not__ dependent on HW. 

Please comment on the possibility of above fallback mechanism in XHCI SW and any side effects that you 
can foresee. 

> Optimizing resume time and error paths should be secondary here, resuming faster
> isn't really a matter when xhci is completely stuck.
> 

I agree on above point. 

- Rajesh Bhagat 

> -Mathias