2024-03-04 16:41:26

by Alan Stern

[permalink] [raw]
Subject: Re: [Bug] INFO: task hung in hub_activate

On Mon, Mar 04, 2024 at 08:10:02PM +0800, Sam Sun wrote:
> Dear developers and maintainers,
>
> We encountered a task hung in function hub_activate(). It was reported
> before by Syzbot several years ago
> (https://groups.google.com/g/syzkaller-lts-bugs/c/_komEgHj03Y/m/rbcVKyLXBwAJ),
> but no repro at that time. We have a C repro this time and kernel
> config is attached to this email. The bug report is listed below.

Never mind the rest of the kernel log; I figured out what's going on.
Here are the important parts:

> ppid:8106 flags:0x00000006
> Call Trace:
> <TASK>
> context_switch kernel/sched/core.c:5376 [inline]
> __schedule+0xcea/0x59e0 kernel/sched/core.c:6688
> __schedule_loop kernel/sched/core.c:6763 [inline]
> schedule+0xe9/0x270 kernel/sched/core.c:6778
> schedule_preempt_disabled+0x13/0x20 kernel/sched/core.c:6835
> __mutex_lock_common kernel/locking/mutex.c:679 [inline]
> __mutex_lock+0x509/0x940 kernel/locking/mutex.c:747
> device_lock include/linux/device.h:992 [inline]
> usb_deauthorize_interface+0x4d/0x130 drivers/usb/core/message.c:1789
> interface_authorized_store+0xaf/0x110 drivers/usb/core/sysfs.c:1178
> dev_attr_store+0x54/0x80 drivers/base/core.c:2366

usb_deauthorize_interface() starts by calling device_lock() on the
usb_interface's parent usb_device.

> ppid:8109 flags:0x00004006
> Call Trace:
> <TASK>
> context_switch kernel/sched/core.c:5376 [inline]
> __schedule+0xcea/0x59e0 kernel/sched/core.c:6688
> __schedule_loop kernel/sched/core.c:6763 [inline]
> schedule+0xe9/0x270 kernel/sched/core.c:6778
> kernfs_drain+0x36c/0x550 fs/kernfs/dir.c:505
> __kernfs_remove+0x280/0x650 fs/kernfs/dir.c:1465
> kernfs_remove_by_name_ns+0xb4/0x130 fs/kernfs/dir.c:1673
> kernfs_remove_by_name include/linux/kernfs.h:623 [inline]
> remove_files+0x96/0x1c0 fs/sysfs/group.c:28
> sysfs_remove_group+0x8b/0x180 fs/sysfs/group.c:292
> sysfs_remove_groups fs/sysfs/group.c:316 [inline]
> sysfs_remove_groups+0x60/0xa0 fs/sysfs/group.c:308
> device_remove_groups drivers/base/core.c:2734 [inline]
> device_remove_attrs+0x192/0x290 drivers/base/core.c:2909
> device_del+0x391/0xa30 drivers/base/core.c:3813
> usb_disable_device+0x360/0x7b0 drivers/usb/core/message.c:1416
> usb_set_configuration+0x1243/0x1c40 drivers/usb/core/message.c:2063
> usb_deauthorize_device+0xe4/0x110 drivers/usb/core/hub.c:2638
> authorized_store+0x122/0x140 drivers/usb/core/sysfs.c:747
> dev_attr_store+0x54/0x80 drivers/base/core.c:2366

Among other things, usb_disable_device() calls device_del() for the
usb_device's child interfaces.

For brevity, let A be the parent usb_device and let B be the child
usb_interface. Then in broad terms, we have:

CPU 0 CPU 1
----------------------------- ----------------------------
usb_deauthorize_device(A)
device_lock(A) usb_deauthorize_interface(B)
usb_set_configuration(A, -1) device_lock(A)
usb_disable_device(B)
device_del(B)
sysfs_remove_group(B, intf_attrs)

The problem now is:

1. The kernfs core (kernfs_drain) on CPU 0 can't remove the
intf_attrs sysfs attribute group while CPU 1 is in the middle
of running a callback routine for one of the attribute files
in that group.

2. The callback routine on CPU 1 can't grab A's lock while CPU 0
is holding it.

Result: deadlock.

This seems to be the only case where an interface sysfs callback
routine tries to acquire the parent device's lock. That lock is
needed here because when an interface is deauthorized, the kernel has
to unbind the driver for that interface -- and binding or unbinding a
USB interface driver requires that the parent device's lock be held.

Three ideas stand out. First, the device_lock() call should be
interruptible, because it is called when a user process writes to the
"authorized" attribute file. But that alone won't fix the problem.

Second, we could avoid the deadlock by adding a timeout to this
device_lock() call. But we probably don't want a deauthorize
operation to fail because of a timeout from a contested lock.

Third, this must be a generic problem. It will occur any time a sysfs
attribute callback tries to lock its device while another process is
trying to unregister that device.

We faced this sort of problem some years ago when we were worrying
about "suicidal" attributes -- ones which would unregister their own
devices. I don't remember what the fix was or how it worked. But we
need something like it here.

Greg and Tejun, any ideas? Is it possible somehow for an attribute file
to be removed while its callback is still running?

Alan Stern