2020-07-20 08:40:36

by Michal Hocko

[permalink] [raw]
Subject: kworker/0:3+pm hogging CPU

Hi,
I am not sure when this started happening but at least 5.6, 5.7 and
5.8-rc6 all exhibit this problem. Please note that the this is happining
on a laptop which is not completely healthy. The BIOS complains that the
Battery cannot be charged because the power supply cannot be recognized.
This is somehting I have for a longer time without noticing the
kworker/0:3+pm problem so I do not think they are related but who knows.

Anyway, kworker/0:3+pm is hogging CPU. /proc/<pid>/0:3+pm is very often
empty so I suspect it just schedules very often and performs only small
amount of work. Here are few backtraces I've managed to catch

[<0>] preempt_schedule_thunk+0x16/0x18
[<0>] usb_hcd_submit_urb+0x686/0x773
[<0>] usb_start_wait_urb+0x59/0xc8
[<0>] usb_control_msg+0xd4/0x10d
[<0>] get_port_status+0x51/0x6f
[<0>] hub_port_status+0x41/0xa3
[<0>] hub_activate+0x230/0x4ee
[<0>] hub_resume+0x20/0x9a
[<0>] usb_resume_interface.isra.0+0xb0/0xd0
[<0>] usb_suspend_both+0xf6/0x20c
[<0>] usb_runtime_suspend+0x32/0x62
[<0>] __rpm_callback+0x70/0xd4
[<0>] rpm_callback+0x55/0x6b
[<0>] rpm_suspend+0x2af/0x440
[<0>] __pm_runtime_suspend+0x48/0x62
[<0>] usb_runtime_idle+0x26/0x2d
[<0>] __rpm_callback+0x70/0xd4
[<0>] rpm_idle+0x179/0x1df
[<0>] pm_runtime_work+0x6b/0x81
[<0>] process_one_work+0x1bd/0x2c6
[<0>] worker_thread+0x19c/0x240
[<0>] kthread+0x11b/0x123
[<0>] ret_from_fork+0x22/0x30

[<0>] slab_pre_alloc_hook+0x15/0x46
[<0>] __kmalloc+0x148/0x15a
[<0>] usb_hcd_submit_urb+0x5c6/0x773
[<0>] usb_start_wait_urb+0x59/0xc8
[<0>] usb_control_msg+0xd4/0x10d
[<0>] get_port_status+0x51/0x6f
[<0>] hub_port_status+0x41/0xa3
[<0>] hub_activate+0x230/0x4ee
[<0>] hub_resume+0x20/0x9a
[<0>] usb_resume_interface.isra.0+0xb0/0xd0
[<0>] usb_suspend_both+0xf6/0x20c

[<0>] slab_pre_alloc_hook+0x15/0x46
[<0>] __kmalloc+0x148/0x15a
[<0>] usb_hcd_submit_urb+0x5c6/0x773
[<0>] usb_start_wait_urb+0x59/0xc8
[<0>] usb_control_msg+0xd4/0x10d
[<0>] set_port_feature+0x33/0x38
[<0>] hub_suspend+0x15c/0x16a
[<0>] usb_suspend_both+0x91/0x20c
[<0>] usb_runtime_suspend+0x32/0x62
[<0>] __rpm_callback+0x70/0xd4
[<0>] rpm_callback+0x55/0x6b
[<0>] rpm_suspend+0x2af/0x440
[<0>] __pm_runtime_suspend+0x48/0x62
[<0>] usb_runtime_idle+0x26/0x2d
[<0>] __rpm_callback+0x70/0xd4
[<0>] rpm_idle+0x179/0x1df
[<0>] pm_runtime_work+0x6b/0x81
[<0>] process_one_work+0x1bd/0x2c6
[<0>] worker_thread+0x19c/0x240
[<0>] kthread+0x11b/0x123
[<0>] ret_from_fork+0x22/0x30

Is this something known or something I can give more information about?
From a very quick look into the code it sounds as if the system wanted
to suspend an USB device/controller but that keeps failing again and
again.
--
Michal Hocko
SUSE Labs


2020-07-20 14:00:31

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon, Jul 20, 2020 at 10:39:56AM +0200, Michal Hocko wrote:
> Hi,
> I am not sure when this started happening but at least 5.6, 5.7 and
> 5.8-rc6 all exhibit this problem. Please note that the this is happining
> on a laptop which is not completely healthy. The BIOS complains that the
> Battery cannot be charged because the power supply cannot be recognized.
> This is somehting I have for a longer time without noticing the
> kworker/0:3+pm problem so I do not think they are related but who knows.
>
> Anyway, kworker/0:3+pm is hogging CPU. /proc/<pid>/0:3+pm is very often
> empty so I suspect it just schedules very often and performs only small
> amount of work. Here are few backtraces I've managed to catch
>
> [<0>] preempt_schedule_thunk+0x16/0x18
> [<0>] usb_hcd_submit_urb+0x686/0x773
> [<0>] usb_start_wait_urb+0x59/0xc8
> [<0>] usb_control_msg+0xd4/0x10d
> [<0>] get_port_status+0x51/0x6f
> [<0>] hub_port_status+0x41/0xa3
> [<0>] hub_activate+0x230/0x4ee
> [<0>] hub_resume+0x20/0x9a
> [<0>] usb_resume_interface.isra.0+0xb0/0xd0
> [<0>] usb_suspend_both+0xf6/0x20c
> [<0>] usb_runtime_suspend+0x32/0x62
> [<0>] __rpm_callback+0x70/0xd4
> [<0>] rpm_callback+0x55/0x6b
> [<0>] rpm_suspend+0x2af/0x440
> [<0>] __pm_runtime_suspend+0x48/0x62
> [<0>] usb_runtime_idle+0x26/0x2d
> [<0>] __rpm_callback+0x70/0xd4
> [<0>] rpm_idle+0x179/0x1df
> [<0>] pm_runtime_work+0x6b/0x81
> [<0>] process_one_work+0x1bd/0x2c6
> [<0>] worker_thread+0x19c/0x240
> [<0>] kthread+0x11b/0x123
> [<0>] ret_from_fork+0x22/0x30
>
> [<0>] slab_pre_alloc_hook+0x15/0x46
> [<0>] __kmalloc+0x148/0x15a
> [<0>] usb_hcd_submit_urb+0x5c6/0x773
> [<0>] usb_start_wait_urb+0x59/0xc8
> [<0>] usb_control_msg+0xd4/0x10d
> [<0>] get_port_status+0x51/0x6f
> [<0>] hub_port_status+0x41/0xa3
> [<0>] hub_activate+0x230/0x4ee
> [<0>] hub_resume+0x20/0x9a
> [<0>] usb_resume_interface.isra.0+0xb0/0xd0
> [<0>] usb_suspend_both+0xf6/0x20c
>
> [<0>] slab_pre_alloc_hook+0x15/0x46
> [<0>] __kmalloc+0x148/0x15a
> [<0>] usb_hcd_submit_urb+0x5c6/0x773
> [<0>] usb_start_wait_urb+0x59/0xc8
> [<0>] usb_control_msg+0xd4/0x10d
> [<0>] set_port_feature+0x33/0x38
> [<0>] hub_suspend+0x15c/0x16a
> [<0>] usb_suspend_both+0x91/0x20c
> [<0>] usb_runtime_suspend+0x32/0x62
> [<0>] __rpm_callback+0x70/0xd4
> [<0>] rpm_callback+0x55/0x6b
> [<0>] rpm_suspend+0x2af/0x440
> [<0>] __pm_runtime_suspend+0x48/0x62
> [<0>] usb_runtime_idle+0x26/0x2d
> [<0>] __rpm_callback+0x70/0xd4
> [<0>] rpm_idle+0x179/0x1df
> [<0>] pm_runtime_work+0x6b/0x81
> [<0>] process_one_work+0x1bd/0x2c6
> [<0>] worker_thread+0x19c/0x240
> [<0>] kthread+0x11b/0x123
> [<0>] ret_from_fork+0x22/0x30
>
> Is this something known or something I can give more information about?
> From a very quick look into the code it sounds as if the system wanted
> to suspend an USB device/controller but that keeps failing again and
> again.

Yes, that's what it looks like to me too. Or maybe the suspend succeeds
but then the device/controller immediately gets woken up again.

Can you provide the contents of /sys/kernel/debug/usb/devices and also a
usbmon trace showing a few rounds of this recurring activity? Perhaps a
section of the dmesg log with dynamic debugging enabled for the usbcore
module, as well.

Alan Stern

2020-07-20 14:35:32

by Michal Hocko

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon 20-07-20 09:58:57, Alan Stern wrote:
[...]
> Can you provide the contents of /sys/kernel/debug/usb/devices and also a

attached.

> usbmon trace showing a few rounds of this recurring activity?

This is not my area so I will need some help here. I assume I should
look for debug/usb/usbmon which contains quite some files for me
0s 0u 1s 1t 1u 2s 2t 2u
most of them provide data when cating them.

> section of the dmesg log with dynamic debugging enabled for the usbcore
> module, as well.

Could you give me more details steps please?
--
Michal Hocko
SUSE Labs


Attachments:
(No filename) (600.00 B)
usb.devices (3.01 kB)
Download all attachments

2020-07-20 15:15:47

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon, Jul 20, 2020 at 04:32:13PM +0200, Michal Hocko wrote:
> On Mon 20-07-20 09:58:57, Alan Stern wrote:
> [...]
> > Can you provide the contents of /sys/kernel/debug/usb/devices and also a
>
> attached.

It looks like you've got just two devices, only one of which is in use,
on bus 1 (the non-SuperSpeed bus) and nothing on bus 2.

> > usbmon trace showing a few rounds of this recurring activity?
>
> This is not my area so I will need some help here. I assume I should
> look for debug/usb/usbmon which contains quite some files for me
> 0s 0u 1s 1t 1u 2s 2t 2u
> most of them provide data when cating them.

The interesting files are 1u (for bus 1) and 2u (for bus 2). At the
moment, though, we don't know which bus the troublesome
device/controller is on.

> > section of the dmesg log with dynamic debugging enabled for the usbcore
> > module, as well.
>
> Could you give me more details steps please?

Do:

sudo echo 'module usbcore =p' >/debug/dynamic_debug/control

Then wait long enough for some interesting messages to appear in the
kernel log (it should only take a few seconds if the worker thread is as
busy as you say) and collect the output from the dmesg command.

To turn dynamic debugging back off when you're finished, use the same
command with "=p" changed to "-p".

Alan Stern

2020-07-20 16:35:58

by Michal Hocko

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon 20-07-20 11:12:55, Alan Stern wrote:
[...]
> sudo echo 'module usbcore =p' >/debug/dynamic_debug/control
>
> Then wait long enough for some interesting messages to appear in the
> kernel log (it should only take a few seconds if the worker thread is as
> busy as you say) and collect the output from the dmesg command.
>
> To turn dynamic debugging back off when you're finished, use the same
> command with "=p" changed to "-p".

[ 95.395815] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.395824] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.395838] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.396016] hub 2-0:1.0: hub_suspend
[ 95.396023] usb usb2: bus auto-suspend, wakeup 1
[ 95.396029] usb usb2: bus suspend fail, err -16
[ 95.396030] hub 2-0:1.0: hub_resume
[ 95.396094] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.396160] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.396166] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.396259] hub 2-0:1.0: hub_suspend
[ 95.396300] usb usb2: bus auto-suspend, wakeup 1
[ 95.396321] usb usb2: bus suspend fail, err -16
[ 95.396332] hub 2-0:1.0: hub_resume
[ 95.396436] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.396443] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.396449] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.396582] hub 2-0:1.0: hub_suspend
[ 95.396587] usb usb2: bus auto-suspend, wakeup 1
[ 95.396594] usb usb2: bus suspend fail, err -16
[ 95.396595] hub 2-0:1.0: hub_resume
[ 95.396718] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.396725] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.396731] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.396870] hub 2-0:1.0: hub_suspend
[ 95.396874] usb usb2: bus auto-suspend, wakeup 1
[ 95.396881] usb usb2: bus suspend fail, err -16
[ 95.396882] hub 2-0:1.0: hub_resume
[ 95.396911] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.396917] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.396975] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.397063] hub 2-0:1.0: hub_suspend
[ 95.397068] usb usb2: bus auto-suspend, wakeup 1
[ 95.397095] usb usb2: bus suspend fail, err -16
[ 95.397105] hub 2-0:1.0: hub_resume
[ 95.397200] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.397207] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.397213] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.397367] hub 2-0:1.0: hub_suspend
[ 95.397372] usb usb2: bus auto-suspend, wakeup 1
[ 95.397379] usb usb2: bus suspend fail, err -16
[ 95.397380] hub 2-0:1.0: hub_resume
[ 95.397412] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.397418] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.397470] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.397560] hub 2-0:1.0: hub_suspend
[ 95.397565] usb usb2: bus auto-suspend, wakeup 1
[ 95.397599] usb usb2: bus suspend fail, err -16
[ 95.397610] hub 2-0:1.0: hub_resume
[ 95.397740] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.397746] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.397753] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.397888] hub 2-0:1.0: hub_suspend
[ 95.397893] usb usb2: bus auto-suspend, wakeup 1
[ 95.397900] usb usb2: bus suspend fail, err -16
[ 95.397901] hub 2-0:1.0: hub_resume
[ 95.397989] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.398030] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.398036] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.398120] hub 2-0:1.0: hub_suspend
[ 95.398158] usb usb2: bus auto-suspend, wakeup 1
[ 95.398189] usb usb2: bus suspend fail, err -16
[ 95.398190] hub 2-0:1.0: hub_resume
[ 95.398231] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.398237] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.398243] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.398381] hub 2-0:1.0: hub_suspend
[ 95.398385] usb usb2: bus auto-suspend, wakeup 1
[ 95.398392] usb usb2: bus suspend fail, err -16
[ 95.398393] hub 2-0:1.0: hub_resume
[ 95.398479] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.398523] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.398529] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.398620] hub 2-0:1.0: hub_suspend
[ 95.398678] usb usb2: bus auto-suspend, wakeup 1
[ 95.398685] usb usb2: bus suspend fail, err -16
[ 95.398686] hub 2-0:1.0: hub_resume
[ 95.398717] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.398724] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.398730] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.398869] hub 2-0:1.0: hub_suspend
[ 95.398874] usb usb2: bus auto-suspend, wakeup 1
[ 95.398880] usb usb2: bus suspend fail, err -16
[ 95.398882] hub 2-0:1.0: hub_resume
[ 95.399004] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.399011] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.399016] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.399152] hub 2-0:1.0: hub_suspend
[ 95.399157] usb usb2: bus auto-suspend, wakeup 1
[ 95.399164] usb usb2: bus suspend fail, err -16
[ 95.399165] hub 2-0:1.0: hub_resume
[ 95.399194] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.399201] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.399241] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.399346] hub 2-0:1.0: hub_suspend
[ 95.399350] usb usb2: bus auto-suspend, wakeup 1
[ 95.399357] usb usb2: bus suspend fail, err -16
[ 95.399358] hub 2-0:1.0: hub_resume
[ 95.399481] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.399488] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.399493] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.399630] hub 2-0:1.0: hub_suspend
[ 95.399634] usb usb2: bus auto-suspend, wakeup 1
[ 95.399641] usb usb2: bus suspend fail, err -16
[ 95.399642] hub 2-0:1.0: hub_resume
[ 95.399671] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.399707] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.399750] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.399934] hub 2-0:1.0: hub_suspend
[ 95.399985] usb usb2: bus auto-suspend, wakeup 1
[ 95.399992] usb usb2: bus suspend fail, err -16
[ 95.399995] hub 2-0:1.0: hub_resume
[ 95.400027] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.400034] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.400093] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.400228] hub 2-0:1.0: hub_suspend
[ 95.400233] usb usb2: bus auto-suspend, wakeup 1
[ 95.400267] usb usb2: bus suspend fail, err -16
[ 95.400276] hub 2-0:1.0: hub_resume
[ 95.400363] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.400370] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.400377] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.400514] hub 2-0:1.0: hub_suspend
[ 95.400519] usb usb2: bus auto-suspend, wakeup 1
[ 95.400526] usb usb2: bus suspend fail, err -16
[ 95.400527] hub 2-0:1.0: hub_resume
[ 95.400558] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
[ 95.400565] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
[ 95.400571] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
[ 95.400709] hub 2-0:1.0: hub_suspend
[ 95.400714] usb usb2: bus auto-suspend, wakeup 1
[ 95.400721] usb usb2: bus suspend fail, err -16
[ 95.400722] hub 2-0:1.0: hub_resume
--
Michal Hocko
SUSE Labs

2020-07-20 17:38:44

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon, Jul 20, 2020 at 06:33:55PM +0200, Michal Hocko wrote:
> On Mon 20-07-20 11:12:55, Alan Stern wrote:
> [...]
> > sudo echo 'module usbcore =p' >/debug/dynamic_debug/control
> >
> > Then wait long enough for some interesting messages to appear in the
> > kernel log (it should only take a few seconds if the worker thread is as
> > busy as you say) and collect the output from the dmesg command.
> >
> > To turn dynamic debugging back off when you're finished, use the same
> > command with "=p" changed to "-p".
>
> [ 95.395815] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
> [ 95.395824] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
> [ 95.395838] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
> [ 95.396016] hub 2-0:1.0: hub_suspend
> [ 95.396023] usb usb2: bus auto-suspend, wakeup 1
> [ 95.396029] usb usb2: bus suspend fail, err -16

Okay, that's the problem. You may get more specific information about
the suspend failure if you turn on dynamic debugging for the xhci-hcd
module (same command as above but with xhci-hcd instead of usbcore).

If you want a quick way to prevent this from happening, you can do:

sudo echo on >/sys/bus/usb/devices/usb2/power/control

That won't actually fix anything, but it will stop the system from
automatically trying to suspend usb2 root hub.

Alan Stern

2020-07-20 17:48:20

by Michal Hocko

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon 20-07-20 13:38:07, Alan Stern wrote:
> On Mon, Jul 20, 2020 at 06:33:55PM +0200, Michal Hocko wrote:
> > On Mon 20-07-20 11:12:55, Alan Stern wrote:
> > [...]
> > > sudo echo 'module usbcore =p' >/debug/dynamic_debug/control
> > >
> > > Then wait long enough for some interesting messages to appear in the
> > > kernel log (it should only take a few seconds if the worker thread is as
> > > busy as you say) and collect the output from the dmesg command.
> > >
> > > To turn dynamic debugging back off when you're finished, use the same
> > > command with "=p" changed to "-p".
> >
> > [ 95.395815] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
> > [ 95.395824] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
> > [ 95.395838] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
> > [ 95.396016] hub 2-0:1.0: hub_suspend
> > [ 95.396023] usb usb2: bus auto-suspend, wakeup 1
> > [ 95.396029] usb usb2: bus suspend fail, err -16
>
> Okay, that's the problem. You may get more specific information about
> the suspend failure if you turn on dynamic debugging for the xhci-hcd
> module (same command as above but with xhci-hcd instead of usbcore).

Would that help to pinpoint the problem down?

> If you want a quick way to prevent this from happening, you can do:
>
> sudo echo on >/sys/bus/usb/devices/usb2/power/control
>
> That won't actually fix anything, but it will stop the system from
> automatically trying to suspend usb2 root hub.

Yes, that helped to put some relief to the machine.

--
Michal Hocko
SUSE Labs

2020-07-20 17:49:16

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon, Jul 20, 2020 at 07:45:30PM +0200, Michal Hocko wrote:
> On Mon 20-07-20 13:38:07, Alan Stern wrote:
> > On Mon, Jul 20, 2020 at 06:33:55PM +0200, Michal Hocko wrote:
> > > On Mon 20-07-20 11:12:55, Alan Stern wrote:
> > > [...]
> > > > sudo echo 'module usbcore =p' >/debug/dynamic_debug/control
> > > >
> > > > Then wait long enough for some interesting messages to appear in the
> > > > kernel log (it should only take a few seconds if the worker thread is as
> > > > busy as you say) and collect the output from the dmesg command.
> > > >
> > > > To turn dynamic debugging back off when you're finished, use the same
> > > > command with "=p" changed to "-p".
> > >
> > > [ 95.395815] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
> > > [ 95.395824] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
> > > [ 95.395838] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
> > > [ 95.396016] hub 2-0:1.0: hub_suspend
> > > [ 95.396023] usb usb2: bus auto-suspend, wakeup 1
> > > [ 95.396029] usb usb2: bus suspend fail, err -16
> >
> > Okay, that's the problem. You may get more specific information about
> > the suspend failure if you turn on dynamic debugging for the xhci-hcd
> > module (same command as above but with xhci-hcd instead of usbcore).
>
> Would that help to pinpoint the problem down?

It might. I can't say for sure without seeing the output.

Alan Stern

2020-07-20 18:16:37

by Michal Hocko

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon 20-07-20 13:48:12, Alan Stern wrote:
> On Mon, Jul 20, 2020 at 07:45:30PM +0200, Michal Hocko wrote:
> > On Mon 20-07-20 13:38:07, Alan Stern wrote:
> > > On Mon, Jul 20, 2020 at 06:33:55PM +0200, Michal Hocko wrote:
> > > > On Mon 20-07-20 11:12:55, Alan Stern wrote:
> > > > [...]
> > > > > sudo echo 'module usbcore =p' >/debug/dynamic_debug/control
> > > > >
> > > > > Then wait long enough for some interesting messages to appear in the
> > > > > kernel log (it should only take a few seconds if the worker thread is as
> > > > > busy as you say) and collect the output from the dmesg command.
> > > > >
> > > > > To turn dynamic debugging back off when you're finished, use the same
> > > > > command with "=p" changed to "-p".
> > > >
> > > > [ 95.395815] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
> > > > [ 95.395824] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
> > > > [ 95.395838] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
> > > > [ 95.396016] hub 2-0:1.0: hub_suspend
> > > > [ 95.396023] usb usb2: bus auto-suspend, wakeup 1
> > > > [ 95.396029] usb usb2: bus suspend fail, err -16
> > >
> > > Okay, that's the problem. You may get more specific information about
> > > the suspend failure if you turn on dynamic debugging for the xhci-hcd
> > > module (same command as above but with xhci-hcd instead of usbcore).
> >
> > Would that help to pinpoint the problem down?
>
> It might. I can't say for sure without seeing the output.

echo 'module xhci-hcd =p' >/sys/kernel/debug/dynamic_debug/control

doesn't yield to any output. I have tried both with and without usbcore
debug output enabled. In the former there is no new output and for the
later I get the same I have posted earlier.
--
Michal Hocko
SUSE Labs

2020-07-20 20:03:20

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon, Jul 20, 2020 at 08:16:05PM +0200, Michal Hocko wrote:
> On Mon 20-07-20 13:48:12, Alan Stern wrote:
> > On Mon, Jul 20, 2020 at 07:45:30PM +0200, Michal Hocko wrote:
> > > On Mon 20-07-20 13:38:07, Alan Stern wrote:
> > > > On Mon, Jul 20, 2020 at 06:33:55PM +0200, Michal Hocko wrote:
> > > > > On Mon 20-07-20 11:12:55, Alan Stern wrote:
> > > > > [...]
> > > > > > sudo echo 'module usbcore =p' >/debug/dynamic_debug/control
> > > > > >
> > > > > > Then wait long enough for some interesting messages to appear in the
> > > > > > kernel log (it should only take a few seconds if the worker thread is as
> > > > > > busy as you say) and collect the output from the dmesg command.
> > > > > >
> > > > > > To turn dynamic debugging back off when you're finished, use the same
> > > > > > command with "=p" changed to "-p".
> > > > >
> > > > > [ 95.395815] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
> > > > > [ 95.395824] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
> > > > > [ 95.395838] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
> > > > > [ 95.396016] hub 2-0:1.0: hub_suspend
> > > > > [ 95.396023] usb usb2: bus auto-suspend, wakeup 1
> > > > > [ 95.396029] usb usb2: bus suspend fail, err -16
> > > >
> > > > Okay, that's the problem. You may get more specific information about
> > > > the suspend failure if you turn on dynamic debugging for the xhci-hcd
> > > > module (same command as above but with xhci-hcd instead of usbcore).
> > >
> > > Would that help to pinpoint the problem down?
> >
> > It might. I can't say for sure without seeing the output.
>
> echo 'module xhci-hcd =p' >/sys/kernel/debug/dynamic_debug/control
>
> doesn't yield to any output. I have tried both with and without usbcore
> debug output enabled. In the former there is no new output and for the
> later I get the same I have posted earlier.

Sorry, my mistake. The module name needs to be "xhci_hcd" with an '_'
character, not a '-' character -- the same as what shows up in the lsmod
output.

You could also try collect a usbmon trace from the .../2u file. It does
seem that bus 2 is where the problem occurs.

Alan Stern

2020-07-21 06:02:40

by Michal Hocko

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon 20-07-20 16:02:43, Alan Stern wrote:
> On Mon, Jul 20, 2020 at 08:16:05PM +0200, Michal Hocko wrote:
> > On Mon 20-07-20 13:48:12, Alan Stern wrote:
> > > On Mon, Jul 20, 2020 at 07:45:30PM +0200, Michal Hocko wrote:
> > > > On Mon 20-07-20 13:38:07, Alan Stern wrote:
> > > > > On Mon, Jul 20, 2020 at 06:33:55PM +0200, Michal Hocko wrote:
> > > > > > On Mon 20-07-20 11:12:55, Alan Stern wrote:
> > > > > > [...]
> > > > > > > sudo echo 'module usbcore =p' >/debug/dynamic_debug/control
> > > > > > >
> > > > > > > Then wait long enough for some interesting messages to appear in the
> > > > > > > kernel log (it should only take a few seconds if the worker thread is as
> > > > > > > busy as you say) and collect the output from the dmesg command.
> > > > > > >
> > > > > > > To turn dynamic debugging back off when you're finished, use the same
> > > > > > > command with "=p" changed to "-p".
> > > > > >
> > > > > > [ 95.395815] hub 2-0:1.0: state 7 ports 6 chg 0006 evt 0000
> > > > > > [ 95.395824] usb usb2-port1: status 0088, change 0000, 5.0 Gb/s
> > > > > > [ 95.395838] usb usb2-port2: status 0088, change 0000, 5.0 Gb/s
> > > > > > [ 95.396016] hub 2-0:1.0: hub_suspend
> > > > > > [ 95.396023] usb usb2: bus auto-suspend, wakeup 1
> > > > > > [ 95.396029] usb usb2: bus suspend fail, err -16
> > > > >
> > > > > Okay, that's the problem. You may get more specific information about
> > > > > the suspend failure if you turn on dynamic debugging for the xhci-hcd
> > > > > module (same command as above but with xhci-hcd instead of usbcore).
> > > >
> > > > Would that help to pinpoint the problem down?
> > >
> > > It might. I can't say for sure without seeing the output.
> >
> > echo 'module xhci-hcd =p' >/sys/kernel/debug/dynamic_debug/control
> >
> > doesn't yield to any output. I have tried both with and without usbcore
> > debug output enabled. In the former there is no new output and for the
> > later I get the same I have posted earlier.
>
> Sorry, my mistake. The module name needs to be "xhci_hcd" with an '_'
> character, not a '-' character -- the same as what shows up in the lsmod
> output.


[14766.973734] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
[14766.973738] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
[14766.973742] xhci_hcd 0000:00:14.0: Get port status 2-3 read: 0xe0002a0, return 0x2a0
[14766.973746] xhci_hcd 0000:00:14.0: Get port status 2-4 read: 0xe0002a0, return 0x2a0
[14766.973750] xhci_hcd 0000:00:14.0: Get port status 2-5 read: 0xe0002a0, return 0x2a0
[14766.973754] xhci_hcd 0000:00:14.0: Get port status 2-6 read: 0xe0002a0, return 0x2a0
[14766.973759] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
[14766.973763] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
[14766.973771] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 0 status = 0xe000088
[14766.973780] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 1 status = 0xe000088
[14766.973789] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 2 status = 0xe0002a0
[14766.973798] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 3 status = 0xe0002a0
[14766.973807] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 4 status = 0xe0002a0
[14766.973816] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 5 status = 0xe0002a0
[14766.973830] xhci_hcd 0000:00:14.0: Bus suspend bailout, port over-current detected

Repeating again and again. The last message suggests a HW problem? But
why does the kernel try the same thing over and over?

> You could also try collect a usbmon trace from the .../2u file. It does
> seem that bus 2 is where the problem occurs.

a random sample

ffff8882447743c0 2551256361 S Ii:2:001:1 -115:2048 4 <
ffff888040808cc0 2551256367 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551256373 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551256377 S Co:2:001:0 s 23 03 001b 0701 0000 0
ffff888040808cc0 2551256384 C Co:2:001:0 0 0
ffff888040808cc0 2551256432 C Co:2:001:0 0 0
ffff888040808cc0 2551256482 C Co:2:001:0 0 0
ffff888040808cc0 2551256531 C Co:2:001:0 0 0
ffff888040808cc0 2551256535 S Co:2:001:0 s 23 03 001b 0705 0000 0
ffff888040808cc0 2551256542 C Co:2:001:0 0 0
ffff888040808cc0 2551256551 C Co:2:001:0 0 0
ffff8882447743c0 2551256555 C Ii:2:001:1 -2:2048 0
ffff888040808cc0 2551256562 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff888040808cc0 2551256566 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551256568 S Ci:2:001:0 s a3 00 0000 0002 0004 4 <
ffff888040808cc0 2551256573 S Ci:2:001:0 s a3 00 0000 0003 0004 4 <
ffff888040808cc0 2551256579 S Ci:2:001:0 s a3 00 0000 0004 0004 4 <
ffff888040808cc0 2551256584 S Ci:2:001:0 s a3 00 0000 0005 0004 4 <
ffff888040808cc0 2551256625 S Ci:2:001:0 s a3 00 0000 0006 0004 4 <
ffff888040808cc0 2551256686 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff888040808cc0 2551256689 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551256735 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551256745 C Co:2:001:0 0 0
ffff888040808cc0 2551256748 S Co:2:001:0 s 23 03 001b 0702 0000 0
ffff888040808cc0 2551256755 C Co:2:001:0 0 0
ffff888040808cc0 2551256757 S Co:2:001:0 s 23 03 001b 0703 0000 0
ffff888040808cc0 2551256764 C Co:2:001:0 0 0
ffff888040808cc0 2551256802 C Co:2:001:0 0 0
ffff888040808cc0 2551256843 C Co:2:001:0 0 0
ffff888040808cc0 2551256871 C Co:2:001:0 0 0
ffff8882447743c0 2551256914 C Ii:2:001:1 -2:2048 0
ffff888040808cc0 2551256922 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff888040808cc0 2551256925 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551256927 S Ci:2:001:0 s a3 00 0000 0002 0004 4 <
ffff888040808cc0 2551256933 S Ci:2:001:0 s a3 00 0000 0003 0004 4 <
ffff888040808cc0 2551256936 C Ci:2:001:0 0 4 = a0020000
ffff888040808cc0 2551256942 C Ci:2:001:0 0 4 = a0020000
ffff8882447743c0 2551257018 S Ii:2:001:1 -115:2048 4 <
ffff888040808cc0 2551257070 S Ci:2:001:0 s a3 00 0000 0002 0004 4 <
ffff888040808cc0 2551257106 C Co:2:001:0 0 0
ffff888040808cc0 2551257115 C Co:2:001:0 0 0
ffff888040808cc0 2551257125 C Co:2:001:0 0 0
ffff888040808cc0 2551257288 C Co:2:001:0 0 0
ffff888040808cc0 2551257290 S Co:2:001:0 s 23 03 001b 0706 0000 0
ffff888040808cc0 2551257297 C Co:2:001:0 0 0
ffff8882447743c0 2551257300 C Ii:2:001:1 -2:2048 0
ffff888040808cc0 2551257329 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff888040808cc0 2551257367 S Ci:2:001:0 s a3 00 0000 0002 0004 4 <
ffff888040808cc0 2551257408 S Ci:2:001:0 s a3 00 0000 0004 0004 4 <
ffff888040808cc0 2551257461 C Ci:2:001:0 0 4 = a0020000
ffff8882447743c0 2551257762 S Ii:2:001:1 -115:2048 4 <
ffff888040808cc0 2551257815 S Co:2:001:0 s 23 03 001b 0701 0000 0
ffff888040808cc0 2551257823 C Co:2:001:0 0 0
ffff888040808cc0 2551257839 C Co:2:001:0 0 0
ffff888040808cc0 2551257877 C Co:2:001:0 0 0
ffff888040808cc0 2551257947 C Co:2:001:0 0 0
ffff888040808cc0 2551257996 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff888040808cc0 2551258005 S Ci:2:001:0 s a3 00 0000 0003 0004 4 <
ffff888040808cc0 2551258017 C Ci:2:001:0 0 4 = a0020000
ffff888040808cc0 2551258057 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
ffff888040808cc0 2551258140 C Co:2:001:0 0 0
ffff888040808cc0 2551258179 C Co:2:001:0 0 0
ffff888040808cc0 2551258257 C Co:2:001:0 0 0
ffff888040808cc0 2551258265 C Co:2:001:0 0 0
ffff888040808cc0 2551258273 C Co:2:001:0 0 0
ffff888040808cc0 2551258286 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551258294 C Ci:2:001:0 0 4 = a0020000
ffff888040808cc0 2551258391 C Ci:2:001:0 0 4 = a0020000
ffff888040808cc0 2551258448 C Ci:2:001:0 0 4 = 88000000
ffff888040808cc0 2551258462 C Co:2:001:0 0 0
ffff888040808cc0 2551258470 C Co:2:001:0 0 0
ffff888040808cc0 2551258486 C Co:2:001:0 0 0
--
Michal Hocko
SUSE Labs

2020-07-21 14:34:19

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Tue, Jul 21, 2020 at 07:59:17AM +0200, Michal Hocko wrote:
> > Sorry, my mistake. The module name needs to be "xhci_hcd" with an '_'
> > character, not a '-' character -- the same as what shows up in the lsmod
> > output.
>
>
> [14766.973734] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
> [14766.973738] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
> [14766.973742] xhci_hcd 0000:00:14.0: Get port status 2-3 read: 0xe0002a0, return 0x2a0
> [14766.973746] xhci_hcd 0000:00:14.0: Get port status 2-4 read: 0xe0002a0, return 0x2a0
> [14766.973750] xhci_hcd 0000:00:14.0: Get port status 2-5 read: 0xe0002a0, return 0x2a0
> [14766.973754] xhci_hcd 0000:00:14.0: Get port status 2-6 read: 0xe0002a0, return 0x2a0
> [14766.973759] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
> [14766.973763] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88

According to the xHCI specification, those 02a0 values are normal and
the 0088 values indicate the port is disabled and has an over-current
condition. I don't know about the e000 bits in the upper part of the
word; according to my copy of the spec those bits should be 0.

If your machine has only two physical SuperSpeed (USB-3) ports then
perhaps the other four ports are internally wired in a way that creates
a permanent over-current indication.

> [14766.973771] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 0 status = 0xe000088
> [14766.973780] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 1 status = 0xe000088
> [14766.973789] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 2 status = 0xe0002a0
> [14766.973798] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 3 status = 0xe0002a0
> [14766.973807] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 4 status = 0xe0002a0
> [14766.973816] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 5 status = 0xe0002a0
> [14766.973830] xhci_hcd 0000:00:14.0: Bus suspend bailout, port over-current detected
>
> Repeating again and again. The last message suggests a HW problem? But
> why does the kernel try the same thing over and over?

Because over-current is supposed to be a transient condition that goes
away quickly. It means there's a short circuit or something similar.

> > You could also try collect a usbmon trace from the .../2u file. It does
> > seem that bus 2 is where the problem occurs.
>
> a random sample

...
> ffff888040808cc0 2551256922 S Ci:2:001:0 s a3 00 0000 0001 0004 4 <
> ffff888040808cc0 2551256925 C Ci:2:001:0 0 4 = 88000000
> ffff888040808cc0 2551256927 S Ci:2:001:0 s a3 00 0000 0002 0004 4 <
> ffff888040808cc0 2551256933 S Ci:2:001:0 s a3 00 0000 0003 0004 4 <
> ffff888040808cc0 2551256936 C Ci:2:001:0 0 4 = a0020000
> ffff888040808cc0 2551256942 C Ci:2:001:0 0 4 = a0020000
...

The usbmon output has a lot of gaps, but here we can see some of the
port status data contains 0x0088 (four bytes little-endian) and some
contains 0x02a0 -- the same as what the debugging log says.

So yes, this looks like a hardware design error. Turning off
autosuspend by writing to the sysfs power/control file is probably the
best way to handle the problem.

Alan Stern

2020-07-21 16:01:24

by Michal Hocko

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Tue 21-07-20 10:33:25, Alan Stern wrote:
[...]

Thanks a lot for your analysis. The laptop is slowly dying so this can
be related.

> So yes, this looks like a hardware design error. Turning off
> autosuspend by writing to the sysfs power/control file is probably the
> best way to handle the problem.

I can use that workaround. But it seems that this also prevents me from
suspending the maching into RAM with
PM: Some devices failed to suspend, or early wake event detected

:/

If there is any workaround for that then I would be really happy but
considering the overal state of the machine I suspect this is not the
case.

Thanks again for your great help!
--
Michal Hocko
SUSE Labs

2020-07-21 16:16:00

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Tue, Jul 21, 2020 at 06:00:25PM +0200, Michal Hocko wrote:
> On Tue 21-07-20 10:33:25, Alan Stern wrote:
> [...]
>
> Thanks a lot for your analysis. The laptop is slowly dying so this can
> be related.
>
> > So yes, this looks like a hardware design error. Turning off
> > autosuspend by writing to the sysfs power/control file is probably the
> > best way to handle the problem.
>
> I can use that workaround. But it seems that this also prevents me from
> suspending the maching into RAM with
> PM: Some devices failed to suspend, or early wake event detected
>
> :/
>
> If there is any workaround for that then I would be really happy but
> considering the overal state of the machine I suspect this is not the
> case.

As far as I know there is no workaround.

Of course, you could build your own personal kernel with the
over-current test removed from the driver. But keeping that up-to-date
would be a nuisance.

> Thanks again for your great help!

You're welcome.

Alan Stern

2020-08-29 09:51:17

by Salvatore Bonaccorso

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

Hi Alan,

I'm following up on this thread because a user in Debian (Dirk, Cc'ed)
as well encountered the same/similar issue:

On Tue, Jul 21, 2020 at 10:33:25AM -0400, Alan Stern wrote:
> On Tue, Jul 21, 2020 at 07:59:17AM +0200, Michal Hocko wrote:
> > > Sorry, my mistake. The module name needs to be "xhci_hcd" with an '_'
> > > character, not a '-' character -- the same as what shows up in the lsmod
> > > output.
> >
> >
> > [14766.973734] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
> > [14766.973738] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
> > [14766.973742] xhci_hcd 0000:00:14.0: Get port status 2-3 read: 0xe0002a0, return 0x2a0
> > [14766.973746] xhci_hcd 0000:00:14.0: Get port status 2-4 read: 0xe0002a0, return 0x2a0
> > [14766.973750] xhci_hcd 0000:00:14.0: Get port status 2-5 read: 0xe0002a0, return 0x2a0
> > [14766.973754] xhci_hcd 0000:00:14.0: Get port status 2-6 read: 0xe0002a0, return 0x2a0
> > [14766.973759] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
> > [14766.973763] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
>
> According to the xHCI specification, those 02a0 values are normal and
> the 0088 values indicate the port is disabled and has an over-current
> condition. I don't know about the e000 bits in the upper part of the
> word; according to my copy of the spec those bits should be 0.
>
> If your machine has only two physical SuperSpeed (USB-3) ports then
> perhaps the other four ports are internally wired in a way that creates
> a permanent over-current indication.
>
> > [14766.973771] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 0 status = 0xe000088
> > [14766.973780] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 1 status = 0xe000088
> > [14766.973789] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 2 status = 0xe0002a0
> > [14766.973798] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 3 status = 0xe0002a0
> > [14766.973807] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 4 status = 0xe0002a0
> > [14766.973816] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 5 status = 0xe0002a0
> > [14766.973830] xhci_hcd 0000:00:14.0: Bus suspend bailout, port over-current detected
> >
> > Repeating again and again. The last message suggests a HW problem? But
> > why does the kernel try the same thing over and over?
>
> Because over-current is supposed to be a transient condition that goes
> away quickly. It means there's a short circuit or something similar.

Dirk exprienced the same issue aand enabled dynamic debugging showed
similar pattern. His dmesg excerpt is attached.

The Debian report is at https://bugs.debian.org/966703

What could be tracked down is that the issue is uncovered since
e9fb08d617bf ("xhci: prevent bus suspend if a roothub port detected a
over-current condition") which was applied in 5.7-rc3 and backported
to several stable releases (v5.6.8, v5.4.36 and v4.19.119).

Dirk found additionally:

> I just found out, that if none of the two USB ports is connected, there
> are two kworker processes with permanently high CPU load, if one USB
> port is connected and the other not, there is one such kworker process,
> and if both USB ports are connected, there is no kworker process with
> high CPU load.
> I think, this supports your suspicion that these kworker processes are
> connected with the overcurrent condition for both USB ports that I also
> see in the dmesg output.

Reverting the above commit covers the problem again. But I'm not
exprienced enough here to claim if this is a HW issue or if the Kernel
should handle the situation otherwise. Is there anything else Dirk can
provide?

Regards,
Salvatore


Attachments:
(No filename) (3.80 kB)
dmesg.txt.gz (27.82 kB)
Download all attachments

2020-08-29 16:02:41

by Alan Stern

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Sat, Aug 29, 2020 at 11:50:03AM +0200, Salvatore Bonaccorso wrote:
> Hi Alan,
>
> I'm following up on this thread because a user in Debian (Dirk, Cc'ed)
> as well encountered the same/similar issue:
>
> On Tue, Jul 21, 2020 at 10:33:25AM -0400, Alan Stern wrote:
> > On Tue, Jul 21, 2020 at 07:59:17AM +0200, Michal Hocko wrote:
> > > > Sorry, my mistake. The module name needs to be "xhci_hcd" with an '_'
> > > > character, not a '-' character -- the same as what shows up in the lsmod
> > > > output.
> > >
> > >
> > > [14766.973734] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
> > > [14766.973738] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
> > > [14766.973742] xhci_hcd 0000:00:14.0: Get port status 2-3 read: 0xe0002a0, return 0x2a0
> > > [14766.973746] xhci_hcd 0000:00:14.0: Get port status 2-4 read: 0xe0002a0, return 0x2a0
> > > [14766.973750] xhci_hcd 0000:00:14.0: Get port status 2-5 read: 0xe0002a0, return 0x2a0
> > > [14766.973754] xhci_hcd 0000:00:14.0: Get port status 2-6 read: 0xe0002a0, return 0x2a0
> > > [14766.973759] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
> > > [14766.973763] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
> >
> > According to the xHCI specification, those 02a0 values are normal and
> > the 0088 values indicate the port is disabled and has an over-current
> > condition. I don't know about the e000 bits in the upper part of the
> > word; according to my copy of the spec those bits should be 0.
> >
> > If your machine has only two physical SuperSpeed (USB-3) ports then
> > perhaps the other four ports are internally wired in a way that creates
> > a permanent over-current indication.
> >
> > > [14766.973771] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 0 status = 0xe000088
> > > [14766.973780] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 1 status = 0xe000088
> > > [14766.973789] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 2 status = 0xe0002a0
> > > [14766.973798] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 3 status = 0xe0002a0
> > > [14766.973807] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 4 status = 0xe0002a0
> > > [14766.973816] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 5 status = 0xe0002a0
> > > [14766.973830] xhci_hcd 0000:00:14.0: Bus suspend bailout, port over-current detected
> > >
> > > Repeating again and again. The last message suggests a HW problem? But
> > > why does the kernel try the same thing over and over?
> >
> > Because over-current is supposed to be a transient condition that goes
> > away quickly. It means there's a short circuit or something similar.
>
> Dirk exprienced the same issue aand enabled dynamic debugging showed
> similar pattern. His dmesg excerpt is attached.
>
> The Debian report is at https://bugs.debian.org/966703
>
> What could be tracked down is that the issue is uncovered since
> e9fb08d617bf ("xhci: prevent bus suspend if a roothub port detected a
> over-current condition") which was applied in 5.7-rc3 and backported
> to several stable releases (v5.6.8, v5.4.36 and v4.19.119).
>
> Dirk found additionally:
>
> > I just found out, that if none of the two USB ports is connected, there
> > are two kworker processes with permanently high CPU load, if one USB
> > port is connected and the other not, there is one such kworker process,
> > and if both USB ports are connected, there is no kworker process with
> > high CPU load.
> > I think, this supports your suspicion that these kworker processes are
> > connected with the overcurrent condition for both USB ports that I also
> > see in the dmesg output.
>
> Reverting the above commit covers the problem again. But I'm not
> exprienced enough here to claim if this is a HW issue or if the Kernel
> should handle the situation otherwise. Is there anything else Dirk can
> provide?

It is undoubtedly a hardware issue. The dmesg extract shows that ports
1-10, 1-11, and 2-5 (which is probably the same port as one of the
others) have overcurrent conditions; I'm guessing that these are the
ports which have external connections.

What were the devices Dirk plugged in that got rid of the kworker
processes? In particular, were they USB-2 or USB-3? (The dmesg log for
when the devices were first attached can answer these questions.)

As far as I know, there is no way for the kernel to work around this
problem. Preventing the controller from going into runtime suspend is
probably the best solution.

Perhaps Mathias (the xhci-hcd maintainer) will have more suggestions.

Alan Stern

2020-08-31 11:34:58

by Mathias Nyman

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On 29.8.2020 18.59, Alan Stern wrote:
> On Sat, Aug 29, 2020 at 11:50:03AM +0200, Salvatore Bonaccorso wrote:
>> Hi Alan,
>>
>> I'm following up on this thread because a user in Debian (Dirk, Cc'ed)
>> as well encountered the same/similar issue:
>>
>> On Tue, Jul 21, 2020 at 10:33:25AM -0400, Alan Stern wrote:
>>> On Tue, Jul 21, 2020 at 07:59:17AM +0200, Michal Hocko wrote:
>>>>> Sorry, my mistake. The module name needs to be "xhci_hcd" with an '_'
>>>>> character, not a '-' character -- the same as what shows up in the lsmod
>>>>> output.
>>>>
>>>>
>>>> [14766.973734] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
>>>> [14766.973738] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
>>>> [14766.973742] xhci_hcd 0000:00:14.0: Get port status 2-3 read: 0xe0002a0, return 0x2a0
>>>> [14766.973746] xhci_hcd 0000:00:14.0: Get port status 2-4 read: 0xe0002a0, return 0x2a0
>>>> [14766.973750] xhci_hcd 0000:00:14.0: Get port status 2-5 read: 0xe0002a0, return 0x2a0
>>>> [14766.973754] xhci_hcd 0000:00:14.0: Get port status 2-6 read: 0xe0002a0, return 0x2a0
>>>> [14766.973759] xhci_hcd 0000:00:14.0: Get port status 2-1 read: 0xe000088, return 0x88
>>>> [14766.973763] xhci_hcd 0000:00:14.0: Get port status 2-2 read: 0xe000088, return 0x88
>>>
>>> According to the xHCI specification, those 02a0 values are normal and
>>> the 0088 values indicate the port is disabled and has an over-current
>>> condition. I don't know about the e000 bits in the upper part of the
>>> word; according to my copy of the spec those bits should be 0.

That's a 0x0e000088 where the 0e00 bits are the wake bits. Leading zeroes are not shown.

>>>
>>> If your machine has only two physical SuperSpeed (USB-3) ports then
>>> perhaps the other four ports are internally wired in a way that creates
>>> a permanent over-current indication.
>>>
>>>> [14766.973771] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 0 status = 0xe000088
>>>> [14766.973780] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 1 status = 0xe000088
>>>> [14766.973789] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 2 status = 0xe0002a0
>>>> [14766.973798] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 3 status = 0xe0002a0
>>>> [14766.973807] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 4 status = 0xe0002a0
>>>> [14766.973816] xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 5 status = 0xe0002a0
>>>> [14766.973830] xhci_hcd 0000:00:14.0: Bus suspend bailout, port over-current detected
>>>>
>>>> Repeating again and again. The last message suggests a HW problem? But
>>>> why does the kernel try the same thing over and over?
>>>
>>> Because over-current is supposed to be a transient condition that goes
>>> away quickly. It means there's a short circuit or something similar.
>>
>> Dirk exprienced the same issue aand enabled dynamic debugging showed
>> similar pattern. His dmesg excerpt is attached.
>>
>> The Debian report is at https://bugs.debian.org/966703
>>
>> What could be tracked down is that the issue is uncovered since
>> e9fb08d617bf ("xhci: prevent bus suspend if a roothub port detected a
>> over-current condition") which was applied in 5.7-rc3 and backported
>> to several stable releases (v5.6.8, v5.4.36 and v4.19.119).
>>
>> Dirk found additionally:
>>
>>> I just found out, that if none of the two USB ports is connected, there
>>> are two kworker processes with permanently high CPU load, if one USB
>>> port is connected and the other not, there is one such kworker process,
>>> and if both USB ports are connected, there is no kworker process with
>>> high CPU load.
>>> I think, this supports your suspicion that these kworker processes are
>>> connected with the overcurrent condition for both USB ports that I also
>>> see in the dmesg output.
>>
>> Reverting the above commit covers the problem again. But I'm not
>> exprienced enough here to claim if this is a HW issue or if the Kernel
>> should handle the situation otherwise. Is there anything else Dirk can
>> provide?
>
> It is undoubtedly a hardware issue. The dmesg extract shows that ports
> 1-10, 1-11, and 2-5 (which is probably the same port as one of the
> others) have overcurrent conditions; I'm guessing that these are the
> ports which have external connections.
>
> What were the devices Dirk plugged in that got rid of the kworker
> processes? In particular, were they USB-2 or USB-3? (The dmesg log for
> when the devices were first attached can answer these questions.)
>
> As far as I know, there is no way for the kernel to work around this
> problem. Preventing the controller from going into runtime suspend is
> probably the best solution.
>
> Perhaps Mathias (the xhci-hcd maintainer) will have more suggestions.

In the original case the over-current condition was always quickly
resolved and returning -EBUSY did the trick.

xhci specs say that over-current active bit shall cleared by hardware once
the over-current condition is no longer present, it's not much the driver can do.
(xhci 5.4.8 - Port status and control register)

I can't come up with any good solution to this right now.
Only bad ideas such as
a. Add a sleep to the over-current case,
doesn't solve anything else than the ~100% cpu hogging part of the problem
b. After some retry limit of returning -EBUSY we return success and just hope
for the best the xHC won't hang in this case.

Not sure how much additional complex code it is worth doing because of a couple cases
that seems to be broken hardware. If we get more cases, or can
point to some specific setup with broken design we can create a quirk for it.

-Mathias







2020-09-01 16:46:36

by Michal Hocko

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

On Mon 31-08-20 14:37:10, Mathias Nyman wrote:
[...]
> I can't come up with any good solution to this right now.
> Only bad ideas such as
> a. Add a sleep to the over-current case,
> doesn't solve anything else than the ~100% cpu hogging part of the problem

This sounds like a better thing from the user space point of view. I do
not have any insight on what kind of other side effects this might have
so I didn't dare to try that on my piece of (broken) HW. I do not see
the problem all the time and I plan to replace it soon anyway.

Considering that tweaking the power management helps maybe that could be
done automagically after many consecutive failures.

Just my 2c
--
Michal Hocko
SUSE Labs

2020-09-04 16:01:09

by Dirk Kostrewa

[permalink] [raw]
Subject: Re: kworker/0:3+pm hogging CPU

Hi,

meanwhile, I convinced Dell that I have a hardware issue (and not a
Linux issue), and Dell has replaced the mainboard of my laptop. After
that, both the USB over-current kernel messages and the kworker
processes with permanent high CPU load are gone. So, this was indeed a
hardware issue!

Many thanks for your feedback and help!

Best regards,

Dirk.

Am 01.09.20 um 17:27 schrieb Michal Hocko:

> On Mon 31-08-20 14:37:10, Mathias Nyman wrote:
> [...]
>> I can't come up with any good solution to this right now.
>> Only bad ideas such as
>> a. Add a sleep to the over-current case,
>> doesn't solve anything else than the ~100% cpu hogging part of the problem
> This sounds like a better thing from the user space point of view. I do
> not have any insight on what kind of other side effects this might have
> so I didn't dare to try that on my piece of (broken) HW. I do not see
> the problem all the time and I plan to replace it soon anyway.
>
> Considering that tweaking the power management helps maybe that could be
> done automagically after many consecutive failures.
>
> Just my 2c