2014-01-08 01:39:32

by David Cohen

[permalink] [raw]
Subject: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

From: jianqian <[email protected]>

There is a possible kernel panic faced on xhci_suspend().
Due to kernel modified the hub autosupend_delay to 0s, after usb1 root
hub finishes initialization, it will trigger runtime_suspend and then
it will trigger xhci runtime suspend. But at that time, if
xhci->shared_hcd is still doing initialization, it is possible to face
null pointer kernel panic in xhci_suspend() function.

This patch checks if xhci->shared_hcd is null to avoid panic.

Signed-off-by: jianqian <[email protected]>
Signed-off-by: David Cohen <[email protected]>
---

This is the kernel panic. The bug was discovered on current LTS kernel 3.10, as
showed on logs. But the problem does not seem to be fixed so far.
Maybe we should consider apply it on kernel >= 3.10?

[ 5.037841] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[ 5.037854] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 5.037864] usb usb1: Product: xHCI Host Controller
[ 5.037875] usb usb1: Manufacturer: Linux 3.10.16-261428-g07aa93a xhci_hcd
[ 5.037885] usb usb1: SerialNumber: 0000:00:14.0
[ 5.038358] xHCI xhci_add_endpoint called for root hub
[ 5.038365] xHCI xhci_check_bandwidth called for root hub
[ 5.038553] hub 1-0:1.0: USB hub found
[ 5.038587] hub 1-0:1.0: 6 ports detected
[ 5.189460] BUG: unable to handle kernel NULL pointer dereference at 00000198
[ 5.189485] IP: [<c17dd9d2>] xhci_suspend+0x22/0x2e0
[ 5.189499] *pdpt = 0000000000000000 *pde = f000f3b5f000f3b0-
[ 5.189514] Oops: 0000 [#1] PREEMPT SMP-
[ 5.189524] Modules linked in:
[ 5.189539] CPU: 3 PID: 67 Comm: kworker/3:1 Not tainted 3.10.16-261428-g07aa93a #39
[ 5.189558] Workqueue: pm pm_runtime_work
[ 5.189565] task: f2be5fa0 ti: f26d0000 task.ti: f26d0000
[ 5.189576] EIP: 0060:[<c17dd9d2>] EFLAGS: 00010246 CPU: 3
[ 5.189587] EIP is at xhci_suspend+0x22/0x2e0
[ 5.189594] EAX: 00000000 EBX: f2738a00 ECX: c17eb820 EDX: 00000001
[ 5.189602] ESI: f2721000 EDI: f3054064 EBP: f26d1da4 ESP: f26d1d84
[ 5.189611] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 5.189619] CR0: 8005003b CR2: 00000198 CR3: 020fa000 CR4: 001007f0
[ 5.189626] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 5.189632] DR6: ffff0ff0 DR7: 00000400
[ 5.189636] Stack:
[ 5.189668] f26b0b90 f26d1da4 c1af6a35 f2738a00 c1fdcba4 f3054064 f2721000 f3054064
[ 5.189697] f26d1db4 c17eb84d 00000000 f2738a00 f26d1dd0 c17c40d8 00000001 f26d1dd8
[ 5.189726] f3054064 f30540d4 c1b8fbc0 f26d1df4 c17c41b3 f26d1e0c c1255fbc 00000000
[ 5.189730] Call Trace:
[ 5.189752] [<c1af6a35>] ? sub_preempt_count+0x55/0xe0
[ 5.189773] [<c17eb84d>] xhci_pci_suspend+0x2d/0x40
[ 5.189792] [<c17c40d8>] suspend_common+0x98/0x150
[ 5.189810] [<c17c41b3>] hcd_pci_runtime_suspend+0x23/0x60
[ 5.189827] [<c1255fbc>] ? __queue_work+0x10c/0x2f0
[ 5.189842] [<c17b9b75>] ? usb_suspend_both+0x135/0x1a0
[ 5.189861] [<c155ea34>] pci_pm_runtime_suspend+0x54/0x110
[ 5.189875] [<c1af6a35>] ? sub_preempt_count+0x55/0xe0
[ 5.189894] [<c155e9e0>] ? pci_pm_runtime_resume+0xb0/0xb0
[ 5.189907] [<c16a84ff>] __rpm_callback+0x2f/0x80
[ 5.189922] [<c16a8578>] rpm_callback+0x28/0x80
[ 5.189937] [<c16a86d0>] rpm_suspend+0x100/0x5f0
[ 5.189956] [<c16a9b51>] __pm_runtime_suspend+0x41/0x80
[ 5.189972] [<c155e8a0>] ? pci_uevent+0x120/0x120
[ 5.189988] [<c155e8d8>] pci_pm_runtime_idle+0x38/0x50
[ 5.190002] [<c16a84ff>] __rpm_callback+0x2f/0x80
[ 5.190016] [<c16a8d41>] rpm_idle+0x101/0x230
[ 5.190032] [<c16a9cd2>] pm_runtime_work+0x92/0xb0
[ 5.190048] [<c125668d>] process_one_work+0x11d/0x3d0
[ 5.190062] [<c1af6b3d>] ? add_preempt_count+0x7d/0xf0
[ 5.190076] [<c1af6a35>] ? sub_preempt_count+0x55/0xe0
[ 5.190093] [<c1256d19>] worker_thread+0xf9/0x320
[ 5.190109] [<c1af3676>] ? _raw_spin_unlock_irqrestore+0x26/0x50
[ 5.190126] [<c1256c20>] ? rescuer_thread+0x2b0/0x2b0
[ 5.190140] [<c125c7f4>] kthread+0x94/0xa0
[ 5.190154] [<c1af6a35>] ? sub_preempt_count+0x55/0xe0
[ 5.190177] [<c1af9f37>] ret_from_kernel_thread+0x1b/0x28
[ 5.190190] [<c125c760>] ? kthread_create_on_node+0xc0/0xc0
[ 5.190377] Code: 00 00 8d bc 27 00 00 00 00 55 89 e5 57 56 53 83 ec 14 3e 8d 74 26 00 8b 18 89 c6 83 bb 98 01 00 00 04 0f 85 79 02 00 00 8b 40 04 <83> b8 98 01 00 00 04 0f 85 69 02 00 00 f0 80 a3 58 01 00 00 fb
[ 5.190396] EIP: [<c17dd9d2>] xhci_suspend+0x22/0x2e0 SS:ESP 0068:f26d1d84
[ 5.190401] CR2: 0000000000000198
[ 5.190440] ---[ end trace b828c61ee573eafd ]---
[ 5.196507] Kernel panic - not syncing: Fatal exception
[ 5.567347] Rebooting in 40 seconds..
[ 45.589915] ACPI MEMORY or I/O RESET_REG.

drivers/usb/host/xhci.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index f052272b25c8..97c1d58f7baf 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -853,7 +853,14 @@ int xhci_suspend(struct xhci_hcd *xhci)
struct usb_hcd *hcd = xhci_to_hcd(xhci);
u32 command;

- if (hcd->state != HC_STATE_SUSPENDED ||
+ /*
+ * Need to check if xhci->shared_hcd is null to avoid kernel panic when
+ * boot up and xhci did not allocate xhci->shared_hcd yet.
+ * Due to autosuspend_delay set to 0, after usb1 finishes probe, xhci
+ * runtime suspend will trigger immediately, but xhci->shared_hcd may
+ * not be initialized yet.
+ */
+ if (hcd->state != HC_STATE_SUSPENDED || !xhci->shared_hcd ||
xhci->shared_hcd->state != HC_STATE_SUSPENDED)
return -EINVAL;

--
1.8.4.2


2014-01-08 01:45:31

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

On Tue, Jan 07, 2014 at 05:44:26PM -0800, David Cohen wrote:
> From: jianqian <[email protected]>
>
> There is a possible kernel panic faced on xhci_suspend().
> Due to kernel modified the hub autosupend_delay to 0s, after usb1 root
> hub finishes initialization, it will trigger runtime_suspend and then
> it will trigger xhci runtime suspend. But at that time, if
> xhci->shared_hcd is still doing initialization, it is possible to face
> null pointer kernel panic in xhci_suspend() function.
>
> This patch checks if xhci->shared_hcd is null to avoid panic.

That sounds like this is a race that should be fixed properly, not just
papered over, right?

thanks,

greg k-h

2014-01-08 01:46:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

On Tue, Jan 07, 2014 at 05:44:26PM -0800, David Cohen wrote:
> From: jianqian <[email protected]>
>
> There is a possible kernel panic faced on xhci_suspend().
> Due to kernel modified the hub autosupend_delay to 0s, after usb1 root
> hub finishes initialization, it will trigger runtime_suspend and then
> it will trigger xhci runtime suspend. But at that time, if
> xhci->shared_hcd is still doing initialization, it is possible to face
> null pointer kernel panic in xhci_suspend() function.
>
> This patch checks if xhci->shared_hcd is null to avoid panic.
>
> Signed-off-by: jianqian <[email protected]>
> Signed-off-by: David Cohen <[email protected]>
> ---
>
> This is the kernel panic. The bug was discovered on current LTS kernel 3.10, as
> showed on logs. But the problem does not seem to be fixed so far.
> Maybe we should consider apply it on kernel >= 3.10?

How do you trigger this? I've never seen anyone report this problem
before, is there something different in the hardware you are using that
enables this to be triggered easier?

thanks,

greg k-h

2014-01-08 03:49:20

by Tang, Jianqiang

[permalink] [raw]
Subject: RE: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

Hi,
1) I met this issue one time just boot up our Linux Platform(Kernel3.10) with XHCI driver, then kernel panic happen.

And this issue reported once by other internal team.

Nothing special of reproduce step and do not need special Hardware I think.

Just random issue which will happen when meet the timing condition.

2) This issue is introduced by this patch:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=596d789a211d134dc5f94d1e5957248c204ef850

which set all hub autosuspend delay to 0.

This causes race condition during XHCI driver initialization,

After USB2 hcd and USB2 root hub finish the initialization, USB2 root hub is functional and auto suspend right now, hence trigger XHCI runtime suspend flow;

At the same time, XHCI driver continue to initialize the USB3 hcd and assign to xhci->shared_hcd after finish the initialization;

Since xhci_suspend() use the xhci->shared_hcd, so there is race condition that when XHCI runtime suspend called, xhci->shared_hcd still NULL.

I think this patch is a fix solution since before XHCI finish the whole initialization, USB2 root hub triggered runtime suspend is mean less and do not need to handle.


Thanks!

-----Original Message-----
From: Greg KH [mailto:[email protected]]
Sent: Wednesday, January 08, 2014 9:47 AM
To: David Cohen
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Tang, Jianqiang
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

On Tue, Jan 07, 2014 at 05:44:26PM -0800, David Cohen wrote:
> From: jianqian <[email protected]>
>
> There is a possible kernel panic faced on xhci_suspend().
> Due to kernel modified the hub autosupend_delay to 0s, after usb1 root
> hub finishes initialization, it will trigger runtime_suspend and then
> it will trigger xhci runtime suspend. But at that time, if
> xhci->shared_hcd is still doing initialization, it is possible to face
> null pointer kernel panic in xhci_suspend() function.
>
> This patch checks if xhci->shared_hcd is null to avoid panic.
>
> Signed-off-by: jianqian <[email protected]>
> Signed-off-by: David Cohen <[email protected]>
> ---
>
> This is the kernel panic. The bug was discovered on current LTS kernel
> 3.10, as showed on logs. But the problem does not seem to be fixed so far.
> Maybe we should consider apply it on kernel >= 3.10?

How do you trigger this? I've never seen anyone report this problem before, is there something different in the hardware you are using that enables this to be triggered easier?

thanks,

greg k-h

2014-01-08 04:16:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()


A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Wed, Jan 08, 2014 at 03:49:07AM +0000, Tang, Jianqiang wrote:
> Hi,
> 1) I met this issue one time just boot up our Linux Platform(Kernel3.10) with XHCI driver, then kernel panic happen.
>
> And this issue reported once by other internal team.
>
> Nothing special of reproduce step and do not need special Hardware I think.
>
> Just random issue which will happen when meet the timing condition.
>
> 2) This issue is introduced by this patch:
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=596d789a211d134dc5f94d1e5957248c204ef850
>
> which set all hub autosuspend delay to 0.

That patch was released in a kernel almost a full year ago, yet we have
never had a report of this happening before, so are you sure this patch
is the root cause?

> This causes race condition during XHCI driver initialization,
>
> After USB2 hcd and USB2 root hub finish the initialization, USB2 root hub is functional and auto suspend right now, hence trigger XHCI runtime suspend flow;
>
> At the same time, XHCI driver continue to initialize the USB3 hcd and assign to xhci->shared_hcd after finish the initialization;
>
> Since xhci_suspend() use the xhci->shared_hcd, so there is race condition that when XHCI runtime suspend called, xhci->shared_hcd still NULL.
>
> I think this patch is a fix solution since before XHCI finish the whole initialization, USB2 root hub triggered runtime suspend is mean less and do not need to handle.

With this patch applied, does the crash go away?

2014-01-08 05:41:06

by David Cohen

[permalink] [raw]
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

Hi Greg,

On Tue, Jan 07, 2014 at 08:16:20PM -0800, Greg KH wrote:
>
> A: No.
> Q: Should I include quotations after my reply?
>
> http://daringfireball.net/2007/07/on_top
>
> On Wed, Jan 08, 2014 at 03:49:07AM +0000, Tang, Jianqiang wrote:
> > Hi,
> > 1) I met this issue one time just boot up our Linux Platform(Kernel3.10) with XHCI driver, then kernel panic happen.
> >
> > And this issue reported once by other internal team.
> >
> > Nothing special of reproduce step and do not need special Hardware I think.
> >
> > Just random issue which will happen when meet the timing condition.
> >
> > 2) This issue is introduced by this patch:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=596d789a211d134dc5f94d1e5957248c204ef850
> >
> > which set all hub autosuspend delay to 0.
>
> That patch was released in a kernel almost a full year ago, yet we have
> never had a report of this happening before, so are you sure this patch
> is the root cause?

This bug happened in a platform with 1 usb3 host controller + 1 usb3 OTG
controller (Jianqiang, please correct me if I'm wrong).
How common is this configuration out there?

Br, David Cohen

>
> > This causes race condition during XHCI driver initialization,
> >
> > After USB2 hcd and USB2 root hub finish the initialization, USB2 root hub is functional and auto suspend right now, hence trigger XHCI runtime suspend flow;
> >
> > At the same time, XHCI driver continue to initialize the USB3 hcd and assign to xhci->shared_hcd after finish the initialization;
> >
> > Since xhci_suspend() use the xhci->shared_hcd, so there is race condition that when XHCI runtime suspend called, xhci->shared_hcd still NULL.
> >
> > I think this patch is a fix solution since before XHCI finish the whole initialization, USB2 root hub triggered runtime suspend is mean less and do not need to handle.
>
> With this patch applied, does the crash go away?

2014-01-08 15:48:09

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

On Tue, 7 Jan 2014, Greg KH wrote:

> On Tue, Jan 07, 2014 at 05:44:26PM -0800, David Cohen wrote:
> > From: jianqian <[email protected]>
> >
> > There is a possible kernel panic faced on xhci_suspend().
> > Due to kernel modified the hub autosupend_delay to 0s, after usb1 root
> > hub finishes initialization, it will trigger runtime_suspend and then
> > it will trigger xhci runtime suspend. But at that time, if
> > xhci->shared_hcd is still doing initialization, it is possible to face
> > null pointer kernel panic in xhci_suspend() function.
> >
> > This patch checks if xhci->shared_hcd is null to avoid panic.
>
> That sounds like this is a race that should be fixed properly, not just
> papered over, right?

That was my reaction too. The best way to solve the problem is to
prevent the USB-2 root hub from suspending until after the USB-3 root
hub has been registered.

Alan Stern

2014-01-08 19:48:38

by David Cohen

[permalink] [raw]
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

On Wed, Jan 08, 2014 at 10:48:06AM -0500, Alan Stern wrote:
> On Tue, 7 Jan 2014, Greg KH wrote:
>
> > On Tue, Jan 07, 2014 at 05:44:26PM -0800, David Cohen wrote:
> > > From: jianqian <[email protected]>
> > >
> > > There is a possible kernel panic faced on xhci_suspend().
> > > Due to kernel modified the hub autosupend_delay to 0s, after usb1 root
> > > hub finishes initialization, it will trigger runtime_suspend and then
> > > it will trigger xhci runtime suspend. But at that time, if
> > > xhci->shared_hcd is still doing initialization, it is possible to face
> > > null pointer kernel panic in xhci_suspend() function.
> > >
> > > This patch checks if xhci->shared_hcd is null to avoid panic.
> >
> > That sounds like this is a race that should be fixed properly, not just
> > papered over, right?
>
> That was my reaction too. The best way to solve the problem is to
> prevent the USB-2 root hub from suspending until after the USB-3 root
> hub has been registered.

That makes sense. Thanks for the feedback.
I'll check for a new approach.

Br, David Cohen

>
> Alan Stern

2014-02-21 13:20:14

by Mathias Nyman

[permalink] [raw]
Subject: Re: [RFC/PATCH] usb/xhci: avoid kernel panic on xhci_suspend()

On 01/08/2014 09:53 PM, David Cohen wrote:
> On Wed, Jan 08, 2014 at 10:48:06AM -0500, Alan Stern wrote:
>> On Tue, 7 Jan 2014, Greg KH wrote:
>>
>>> On Tue, Jan 07, 2014 at 05:44:26PM -0800, David Cohen wrote:
>>>> From: jianqian <[email protected]>
>>>>
>>>> There is a possible kernel panic faced on xhci_suspend().
>>>> Due to kernel modified the hub autosupend_delay to 0s, after usb1 root
>>>> hub finishes initialization, it will trigger runtime_suspend and then
>>>> it will trigger xhci runtime suspend. But at that time, if
>>>> xhci->shared_hcd is still doing initialization, it is possible to face
>>>> null pointer kernel panic in xhci_suspend() function.
>>>>
>>>> This patch checks if xhci->shared_hcd is null to avoid panic.
>>>
>>> That sounds like this is a race that should be fixed properly, not just
>>> papered over, right?
>>
>> That was my reaction too. The best way to solve the problem is to
>> prevent the USB-2 root hub from suspending until after the USB-3 root
>> hub has been registered.
>
> That makes sense. Thanks for the feedback.
> I'll check for a new approach.
>

Could you check if this patch works for you?:

http://marc.info/?l=linux-usb&m=139298822514995&w=2

I'm not able to reproduce the original issue myself

-Mathias