+ [email protected]
+ [email protected]
+ [email protected]
and changed subject
On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
> > Hi all,
> >
> > over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
> > called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
> > Examples are [1] [2] [3]. A typical backtrace is:
> >
> > <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
> > <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
> > <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > <6>[ 3490.516680] PGD 0 P4D 0
> > <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
> > <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
> > <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
> > <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
> > <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
> > <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
> > <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
> > <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
> > <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
> > <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
> > <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
> > <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
> > <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
> > <4>[ 3490.516745] Call Trace:
> > <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
> > <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
> > <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
> > <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
> > <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
> > <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
> > <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
> > <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
> > <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
> > <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
> > <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
> > <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
> > <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
> > <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
> > <4>[ 3490.516848] ksys_ioctl+0x58/0x83
> > <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
> > <4>[ 3490.516858] do_syscall_64+0x54/0xde
> >
> > I have been able to reproduce the problem on a Chromebook by strategically placing
> > msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
> > problem boils down to lack of protection against device removal in usb_set_interface()
> > [and/or possibly other callers of usb_ifnum_to_if()].
> >
> > Sequence of events is roughly as follows:
> >
> > - usb_set_interface() is called and proceeds to some point, possibly to
> > mutex_lock(hcd->bandwidth_mutex);
> > - Device removal event is detected, and usb_disable_device() is called
>
> At this point all interface drivers get unbound (their disconnect
> routines are called).
>
> > - usb_disable_device() starts removing actconfig data. It has removed
> > and cleared dev->actconfig->interface[i], but not dev->actconfig
> > - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
> > usb_ifnum_to_if()
> > - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
> > dev->actconfig->interface[i] is NULL
> > - crash
> >
> > Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
> > in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
> > still be racy since there is still no lock against device removal. I have not tried
> > calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
>
> As far as I know, protecting against these races is the responsibility
> of the USB interface drivers. They must make sure that their disconnect
> routines block until all outstanding calls to usb_set_interface return
> (in fact, until all outstanding device accesses have finished).
>
> For instance, in the log extract you showed, it's obvious that the
> uvc_start_streaming routine was running after the disconnect routine had
> returned, which looks like a bug in itself: Once the disconnect routine
> returns, the driver is not supposed to try to access the device at all
> because some other driver may now be bound to it.
>
> We can't just call usb_lock_device from within usb_set_interface,
> because usb_set_interface is often called with that lock already held.
>
I had a closer look into the uvcvideo driver and compared it to other usb
drivers, including drivers in drivers/media/usb/ which connect to the video
subsystem.
The usbvideo driver lacks protection against calls to uvc_disconnect() while
calls into file operations are ongoing. This is pretty widespread, and not
even limited to file operations (for example, there is a worker which is only
canceled in uvc_delete, not in ucv_disconnect). The existing protection only
ensures that no file operations are started after the call to ucv_disconnect,
but that is insufficient.
Other drivers do have that protection and make sure that no usb operations
can happen after the disconnect call.
The only remedy I can see is to rework the usbvideo driver and add the
necessary protections. At first glance, it looks like this may be a
substantial amount of work. I'd sign up for that, but before I start,
I would like to get input from the usbvideo community. Is such an effort
already going on ? If yes, how can I help ? If not, is the problem
understood and accepted ? Are there any ideas on how to solve it ?
Thanks,
Guenter
Hi Guenter,
CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
file operations and disconnect in V4L2.
On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
> + [email protected]
> + [email protected]
> + [email protected]
>
> and changed subject
>
> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
> > On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
> > > Hi all,
> > >
> > > over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
> > > called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
> > > Examples are [1] [2] [3]. A typical backtrace is:
> > >
> > > <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
> > > <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
> > > <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > > <6>[ 3490.516680] PGD 0 P4D 0
> > > <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
> > > <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
> > > <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
> > > <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
> > > <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
> > > <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
> > > <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
> > > <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
> > > <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
> > > <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
> > > <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
> > > <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
> > > <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
> > > <4>[ 3490.516745] Call Trace:
> > > <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
> > > <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
> > > <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
> > > <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
> > > <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
> > > <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
> > > <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
> > > <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
> > > <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
> > > <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
> > > <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
> > > <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
> > > <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
> > > <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
> > > <4>[ 3490.516848] ksys_ioctl+0x58/0x83
> > > <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
> > > <4>[ 3490.516858] do_syscall_64+0x54/0xde
> > >
> > > I have been able to reproduce the problem on a Chromebook by strategically placing
> > > msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
> > > problem boils down to lack of protection against device removal in usb_set_interface()
> > > [and/or possibly other callers of usb_ifnum_to_if()].
> > >
> > > Sequence of events is roughly as follows:
> > >
> > > - usb_set_interface() is called and proceeds to some point, possibly to
> > > mutex_lock(hcd->bandwidth_mutex);
> > > - Device removal event is detected, and usb_disable_device() is called
> >
> > At this point all interface drivers get unbound (their disconnect
> > routines are called).
> >
> > > - usb_disable_device() starts removing actconfig data. It has removed
> > > and cleared dev->actconfig->interface[i], but not dev->actconfig
> > > - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
> > > usb_ifnum_to_if()
> > > - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
> > > dev->actconfig->interface[i] is NULL
> > > - crash
> > >
> > > Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
> > > in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
> > > still be racy since there is still no lock against device removal. I have not tried
> > > calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
> >
> > As far as I know, protecting against these races is the responsibility
> > of the USB interface drivers. They must make sure that their disconnect
> > routines block until all outstanding calls to usb_set_interface return
> > (in fact, until all outstanding device accesses have finished).
> >
> > For instance, in the log extract you showed, it's obvious that the
> > uvc_start_streaming routine was running after the disconnect routine had
> > returned, which looks like a bug in itself: Once the disconnect routine
> > returns, the driver is not supposed to try to access the device at all
> > because some other driver may now be bound to it.
> >
> > We can't just call usb_lock_device from within usb_set_interface,
> > because usb_set_interface is often called with that lock already held.
> >
> I had a closer look into the uvcvideo driver and compared it to other usb
> drivers, including drivers in drivers/media/usb/ which connect to the video
> subsystem.
>
> The usbvideo driver lacks protection against calls to uvc_disconnect() while
Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
have been called usbvideo if the former hadn't already been in use.
> calls into file operations are ongoing. This is pretty widespread, and not
> even limited to file operations (for example, there is a worker which is only
> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
> ensures that no file operations are started after the call to ucv_disconnect,
> but that is insufficient.
>
> Other drivers do have that protection and make sure that no usb operations
> can happen after the disconnect call.
>
> The only remedy I can see is to rework the usbvideo driver and add the
> necessary protections. At first glance, it looks like this may be a
> substantial amount of work. I'd sign up for that, but before I start,
> I would like to get input from the usbvideo community. Is such an effort
> already going on ? If yes, how can I help ? If not, is the problem
> understood and accepted ? Are there any ideas on how to solve it ?
This is something that has been discussed before, and needs to be solved
in the V4L2 framework itself, not in individual drivers. Not only would
this avoid rolling out the same code manually everywhere (in different
incorrect ways, as races are difficult to solve and implementations are
more often wrong than right), but it will also avoid similar issues for
non-USB devices.
It shouldn't take more than two flags (to track user-space operations in
progress and disconnection), a spinlock and a wait queue entry. I'm not
sure if someone has already given it a try, and don't recall why this
hasn't been done yet, as it should be fairly straightforward.
On the UVC side, the work queue probably has to be flushed in
uvc_disconnect(). I'd keep the destroy call in uvc_delete() though.
Please make sure to look for potential race conditions between the URB
completion handler and the .disconnect() handler (they shouldn't be any,
but I haven't checked lately myself).
--
Regards,
Laurent Pinchart
On 8/16/20 5:18 AM, Laurent Pinchart wrote:
> Hi Guenter,
>
> CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
> file operations and disconnect in V4L2.
>
> On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
>> + [email protected]
>> + [email protected]
>> + [email protected]
>>
>> and changed subject
>>
>> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
>>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
>>>> Hi all,
>>>>
>>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
>>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
>>>> Examples are [1] [2] [3]. A typical backtrace is:
>>>>
>>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
>>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
>>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>>>> <6>[ 3490.516680] PGD 0 P4D 0
>>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
>>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
>>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
>>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
>>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
>>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
>>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
>>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
>>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
>>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
>>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
>>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
>>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
>>>> <4>[ 3490.516745] Call Trace:
>>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
>>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
>>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
>>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
>>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
>>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
>>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
>>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
>>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
>>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
>>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
>>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
>>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
>>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
>>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
>>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
>>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
>>>>
>>>> I have been able to reproduce the problem on a Chromebook by strategically placing
>>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
>>>> problem boils down to lack of protection against device removal in usb_set_interface()
>>>> [and/or possibly other callers of usb_ifnum_to_if()].
>>>>
>>>> Sequence of events is roughly as follows:
>>>>
>>>> - usb_set_interface() is called and proceeds to some point, possibly to
>>>> mutex_lock(hcd->bandwidth_mutex);
>>>> - Device removal event is detected, and usb_disable_device() is called
>>>
>>> At this point all interface drivers get unbound (their disconnect
>>> routines are called).
>>>
>>>> - usb_disable_device() starts removing actconfig data. It has removed
>>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
>>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
>>>> usb_ifnum_to_if()
>>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
>>>> dev->actconfig->interface[i] is NULL
>>>> - crash
>>>>
>>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
>>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
>>>> still be racy since there is still no lock against device removal. I have not tried
>>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
>>>
>>> As far as I know, protecting against these races is the responsibility
>>> of the USB interface drivers. They must make sure that their disconnect
>>> routines block until all outstanding calls to usb_set_interface return
>>> (in fact, until all outstanding device accesses have finished).
>>>
>>> For instance, in the log extract you showed, it's obvious that the
>>> uvc_start_streaming routine was running after the disconnect routine had
>>> returned, which looks like a bug in itself: Once the disconnect routine
>>> returns, the driver is not supposed to try to access the device at all
>>> because some other driver may now be bound to it.
>>>
>>> We can't just call usb_lock_device from within usb_set_interface,
>>> because usb_set_interface is often called with that lock already held.
>>>
>> I had a closer look into the uvcvideo driver and compared it to other usb
>> drivers, including drivers in drivers/media/usb/ which connect to the video
>> subsystem.
>>
>> The usbvideo driver lacks protection against calls to uvc_disconnect() while
>
> Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
> have been called usbvideo if the former hadn't already been in use.
>
Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
>> calls into file operations are ongoing. This is pretty widespread, and not
>> even limited to file operations (for example, there is a worker which is only
>> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
>> ensures that no file operations are started after the call to ucv_disconnect,
>> but that is insufficient.
>>
>> Other drivers do have that protection and make sure that no usb operations
>> can happen after the disconnect call.
>>
>> The only remedy I can see is to rework the usbvideo driver and add the
>> necessary protections. At first glance, it looks like this may be a
>> substantial amount of work. I'd sign up for that, but before I start,
>> I would like to get input from the usbvideo community. Is such an effort
>> already going on ? If yes, how can I help ? If not, is the problem
>> understood and accepted ? Are there any ideas on how to solve it ?
>
> This is something that has been discussed before, and needs to be solved
> in the V4L2 framework itself, not in individual drivers. Not only would
> this avoid rolling out the same code manually everywhere (in different
> incorrect ways, as races are difficult to solve and implementations are
> more often wrong than right), but it will also avoid similar issues for
> non-USB devices.
>
You mean code that ensures that no user-space v4l2 operation is in progress
after video_device_unregister / v4l2_device_unregister return ? I agree,
that would simplify the necessary changes on the uvc side.
I actually came from the other side - I assumed that there is a reason
that is not already the case, and that the problem therefore has to be
resolved on the driver side.
So I guess the next question is: Is this already being addressed on the
v4l2 side ?
> It shouldn't take more than two flags (to track user-space operations in
> progress and disconnection), a spinlock and a wait queue entry. I'm not
> sure if someone has already given it a try, and don't recall why this
> hasn't been done yet, as it should be fairly straightforward.
>
> On the UVC side, the work queue probably has to be flushed in
> uvc_disconnect(). I'd keep the destroy call in uvc_delete() though.
> Please make sure to look for potential race conditions between the URB
> completion handler and the .disconnect() handler (they shouldn't be any,
> but I haven't checked lately myself).
>
My current solution for this problem is to call uvc_ctrl_cleanup_device()
from uvc_disconnect(), after uvc_unregister_video(). An alternative might
be to add a uvc_ctrl_stop_device() function which would just cancel the
worker.
Thanks,
Guenter
Hi Guenter,
On Sun, Aug 16, 2020 at 08:54:18AM -0700, Guenter Roeck wrote:
> On 8/16/20 5:18 AM, Laurent Pinchart wrote:
> > Hi Guenter,
> >
> > CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
> > file operations and disconnect in V4L2.
> >
> > On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
> >> + [email protected]
> >> + [email protected]
> >> + [email protected]
> >>
> >> and changed subject
> >>
> >> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
> >>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
> >>>> Hi all,
> >>>>
> >>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
> >>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
> >>>> Examples are [1] [2] [3]. A typical backtrace is:
> >>>>
> >>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
> >>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
> >>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> >>>> <6>[ 3490.516680] PGD 0 P4D 0
> >>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
> >>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
> >>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
> >>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
> >>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
> >>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
> >>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
> >>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
> >>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
> >>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
> >>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
> >>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
> >>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
> >>>> <4>[ 3490.516745] Call Trace:
> >>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
> >>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
> >>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
> >>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
> >>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
> >>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
> >>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
> >>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
> >>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
> >>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
> >>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
> >>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
> >>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
> >>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
> >>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
> >>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
> >>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
> >>>>
> >>>> I have been able to reproduce the problem on a Chromebook by strategically placing
> >>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
> >>>> problem boils down to lack of protection against device removal in usb_set_interface()
> >>>> [and/or possibly other callers of usb_ifnum_to_if()].
> >>>>
> >>>> Sequence of events is roughly as follows:
> >>>>
> >>>> - usb_set_interface() is called and proceeds to some point, possibly to
> >>>> mutex_lock(hcd->bandwidth_mutex);
> >>>> - Device removal event is detected, and usb_disable_device() is called
> >>>
> >>> At this point all interface drivers get unbound (their disconnect
> >>> routines are called).
> >>>
> >>>> - usb_disable_device() starts removing actconfig data. It has removed
> >>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
> >>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
> >>>> usb_ifnum_to_if()
> >>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
> >>>> dev->actconfig->interface[i] is NULL
> >>>> - crash
> >>>>
> >>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
> >>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
> >>>> still be racy since there is still no lock against device removal. I have not tried
> >>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
> >>>
> >>> As far as I know, protecting against these races is the responsibility
> >>> of the USB interface drivers. They must make sure that their disconnect
> >>> routines block until all outstanding calls to usb_set_interface return
> >>> (in fact, until all outstanding device accesses have finished).
> >>>
> >>> For instance, in the log extract you showed, it's obvious that the
> >>> uvc_start_streaming routine was running after the disconnect routine had
> >>> returned, which looks like a bug in itself: Once the disconnect routine
> >>> returns, the driver is not supposed to try to access the device at all
> >>> because some other driver may now be bound to it.
> >>>
> >>> We can't just call usb_lock_device from within usb_set_interface,
> >>> because usb_set_interface is often called with that lock already held.
> >>>
> >> I had a closer look into the uvcvideo driver and compared it to other usb
> >> drivers, including drivers in drivers/media/usb/ which connect to the video
> >> subsystem.
> >>
> >> The usbvideo driver lacks protection against calls to uvc_disconnect() while
> >
> > Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
> > have been called usbvideo if the former hadn't already been in use.
>
> Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
No worries.
> >> calls into file operations are ongoing. This is pretty widespread, and not
> >> even limited to file operations (for example, there is a worker which is only
> >> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
> >> ensures that no file operations are started after the call to ucv_disconnect,
> >> but that is insufficient.
> >>
> >> Other drivers do have that protection and make sure that no usb operations
> >> can happen after the disconnect call.
> >>
> >> The only remedy I can see is to rework the usbvideo driver and add the
> >> necessary protections. At first glance, it looks like this may be a
> >> substantial amount of work. I'd sign up for that, but before I start,
> >> I would like to get input from the usbvideo community. Is such an effort
> >> already going on ? If yes, how can I help ? If not, is the problem
> >> understood and accepted ? Are there any ideas on how to solve it ?
> >
> > This is something that has been discussed before, and needs to be solved
> > in the V4L2 framework itself, not in individual drivers. Not only would
> > this avoid rolling out the same code manually everywhere (in different
> > incorrect ways, as races are difficult to solve and implementations are
> > more often wrong than right), but it will also avoid similar issues for
> > non-USB devices.
>
> You mean code that ensures that no user-space v4l2 operation is in progress
> after video_device_unregister / v4l2_device_unregister return ? I agree,
> that would simplify the necessary changes on the uvc side.
I was thinking about adding a new function to be called from the
disconnect handler to implement the wait on end of userspace access, but
video_device_unregister() seems an even better idea.
v4l2_device_unregister() is probably not very useful as v4l2_device
isn't exposed to userspace, only video_device is (and v4l2_subdev and
media_device, but that's a different story, although probably still an
issue for the latter in the UVC driver).
We also have a v4l2_device_disconnect() function which is supposed to
handle hot-pluggable device disconnection, but it's fairly useless (I'd
even say harmful as it gives the illusion that hotplugging is correctly
handled, while in reality the media subsystem is plagged by hot-unplug
issues :-S).
> I actually came from the other side - I assumed that there is a reason
> that is not already the case, and that the problem therefore has to be
> resolved on the driver side.
>
> So I guess the next question is: Is this already being addressed on the
> v4l2 side ?
I'm not aware of anyone working on this.
> > It shouldn't take more than two flags (to track user-space operations in
> > progress and disconnection), a spinlock and a wait queue entry. I'm not
> > sure if someone has already given it a try, and don't recall why this
> > hasn't been done yet, as it should be fairly straightforward.
> >
> > On the UVC side, the work queue probably has to be flushed in
> > uvc_disconnect(). I'd keep the destroy call in uvc_delete() though.
> > Please make sure to look for potential race conditions between the URB
> > completion handler and the .disconnect() handler (they shouldn't be any,
> > but I haven't checked lately myself).
>
> My current solution for this problem is to call uvc_ctrl_cleanup_device()
> from uvc_disconnect(), after uvc_unregister_video().
I'd rather avoid that, as the cleanup functions in the UVC driver are
generally meant to free memory when the last user disappears. While no
new userspace operation will be started after disconnection once the
above fix will be in place, there's one operation we can't avoid: the
file release. This will access some of the memory allocated by the
driver, and while the current implementation probably doesn't access in
.release() any memory freed by uvc_ctrl_cleanup_device(), I think it's a
good practice to only shut down the userspace API in .disconnect(), and
free memory when the last reference is released.
> An alternative might
> be to add a uvc_ctrl_stop_device() function which would just cancel the
> worker.
I think that would be best. Should stream->async_wq (in uvc_video.c) be
similarly flushed ? The driver does so in stream->async_wq(), called
from uvc_video_stop_transfer(), itself called from
uvc_video_stop_streaming() (among other places, that are either error
paths or system suspend handling). The call stack goes to
uvc_stop_streaming(), and, through the videobuf2 helpers, to
vb2_queue_release() called by uvc_queue_release() itself called by
uvc_v4l2_release() (in the non-disconnect case,
uvc_video_stop_streaming() will be called through videobuf2 by
uvc_queue_streamoff(), in response to a VIDIOC_STREAMOFF ioctl). We thus
flush the workqueue too late, and also access the device in
uvc_video_stop_streaming() long after .disconnect() returns.
I think uvc_video_stop_streaming() could be called in uvc_disconnect()
after uvc_unregister_video().
--
Regards,
Laurent Pinchart
On 17/08/2020 01:51, Laurent Pinchart wrote:
> Hi Guenter,
>
> On Sun, Aug 16, 2020 at 08:54:18AM -0700, Guenter Roeck wrote:
>> On 8/16/20 5:18 AM, Laurent Pinchart wrote:
>>> Hi Guenter,
>>>
>>> CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
>>> file operations and disconnect in V4L2.
>>>
>>> On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
>>>> + [email protected]
>>>> + [email protected]
>>>> + [email protected]
>>>>
>>>> and changed subject
>>>>
>>>> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
>>>>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
>>>>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
>>>>>> Examples are [1] [2] [3]. A typical backtrace is:
>>>>>>
>>>>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
>>>>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
>>>>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>>>>>> <6>[ 3490.516680] PGD 0 P4D 0
>>>>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
>>>>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
>>>>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
>>>>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
>>>>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
>>>>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
>>>>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
>>>>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
>>>>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
>>>>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
>>>>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
>>>>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
>>>>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
>>>>>> <4>[ 3490.516745] Call Trace:
>>>>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
>>>>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
>>>>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
>>>>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
>>>>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
>>>>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
>>>>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
>>>>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
>>>>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
>>>>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
>>>>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
>>>>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
>>>>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
>>>>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
>>>>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
>>>>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
>>>>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
>>>>>>
>>>>>> I have been able to reproduce the problem on a Chromebook by strategically placing
>>>>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
>>>>>> problem boils down to lack of protection against device removal in usb_set_interface()
>>>>>> [and/or possibly other callers of usb_ifnum_to_if()].
>>>>>>
>>>>>> Sequence of events is roughly as follows:
>>>>>>
>>>>>> - usb_set_interface() is called and proceeds to some point, possibly to
>>>>>> mutex_lock(hcd->bandwidth_mutex);
>>>>>> - Device removal event is detected, and usb_disable_device() is called
>>>>>
>>>>> At this point all interface drivers get unbound (their disconnect
>>>>> routines are called).
>>>>>
>>>>>> - usb_disable_device() starts removing actconfig data. It has removed
>>>>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
>>>>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
>>>>>> usb_ifnum_to_if()
>>>>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
>>>>>> dev->actconfig->interface[i] is NULL
>>>>>> - crash
>>>>>>
>>>>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
>>>>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
>>>>>> still be racy since there is still no lock against device removal. I have not tried
>>>>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
>>>>>
>>>>> As far as I know, protecting against these races is the responsibility
>>>>> of the USB interface drivers. They must make sure that their disconnect
>>>>> routines block until all outstanding calls to usb_set_interface return
>>>>> (in fact, until all outstanding device accesses have finished).
>>>>>
>>>>> For instance, in the log extract you showed, it's obvious that the
>>>>> uvc_start_streaming routine was running after the disconnect routine had
>>>>> returned, which looks like a bug in itself: Once the disconnect routine
>>>>> returns, the driver is not supposed to try to access the device at all
>>>>> because some other driver may now be bound to it.
>>>>>
>>>>> We can't just call usb_lock_device from within usb_set_interface,
>>>>> because usb_set_interface is often called with that lock already held.
>>>>>
>>>> I had a closer look into the uvcvideo driver and compared it to other usb
>>>> drivers, including drivers in drivers/media/usb/ which connect to the video
>>>> subsystem.
>>>>
>>>> The usbvideo driver lacks protection against calls to uvc_disconnect() while
>>>
>>> Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
>>> have been called usbvideo if the former hadn't already been in use.
>>
>> Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
>
> No worries.
>
>>>> calls into file operations are ongoing. This is pretty widespread, and not
>>>> even limited to file operations (for example, there is a worker which is only
>>>> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
>>>> ensures that no file operations are started after the call to ucv_disconnect,
>>>> but that is insufficient.
>>>>
>>>> Other drivers do have that protection and make sure that no usb operations
>>>> can happen after the disconnect call.
>>>>
>>>> The only remedy I can see is to rework the usbvideo driver and add the
>>>> necessary protections. At first glance, it looks like this may be a
>>>> substantial amount of work. I'd sign up for that, but before I start,
>>>> I would like to get input from the usbvideo community. Is such an effort
>>>> already going on ? If yes, how can I help ? If not, is the problem
>>>> understood and accepted ? Are there any ideas on how to solve it ?
>>>
>>> This is something that has been discussed before, and needs to be solved
>>> in the V4L2 framework itself, not in individual drivers. Not only would
>>> this avoid rolling out the same code manually everywhere (in different
>>> incorrect ways, as races are difficult to solve and implementations are
>>> more often wrong than right), but it will also avoid similar issues for
>>> non-USB devices.
>>
>> You mean code that ensures that no user-space v4l2 operation is in progress
>> after video_device_unregister / v4l2_device_unregister return ? I agree,
>> that would simplify the necessary changes on the uvc side.
>
> I was thinking about adding a new function to be called from the
> disconnect handler to implement the wait on end of userspace access, but
> video_device_unregister() seems an even better idea.
> v4l2_device_unregister() is probably not very useful as v4l2_device
> isn't exposed to userspace, only video_device is (and v4l2_subdev and
> media_device, but that's a different story, although probably still an
> issue for the latter in the UVC driver).
Actually, all that is needed is to take the ioctl serialization lock in the disconnect
function.
See last paragraph in 1.4.1 here:
https://hverkuil.home.xs4all.nl/spec/driver-api/v4l2-dev.html
Since uvc uses its own lock, you need to take that one.
>
> We also have a v4l2_device_disconnect() function which is supposed to
> handle hot-pluggable device disconnection, but it's fairly useless (I'd
> even say harmful as it gives the illusion that hotplugging is correctly
> handled, while in reality the media subsystem is plagged by hot-unplug
> issues :-S).
The v4l2_device_disconnect() is there to remove a v4l2_dev reference to
the device that is about to be removed when the disconnect() exists.
Otherwise v4l2_dev->dev would point to a missing device.
However, I wonder if it is still needed: commit 236c5441d703 from 2011 added
code to take a reference to v4l2_dev->dev in v4l2_device_register(). This
should prevent the device from disappearing until v4l2_device_unregister() is
called. I suspect that v4l2_device_disconnect() can be removed completely, and
instead v4l2_device_unregister() just calls put_device(v4l2_dev->dev).
I don't like v4l2_device_disconnect() either, so if this works, then that would
be a nice simplification.
Regards,
Hans
>
>> I actually came from the other side - I assumed that there is a reason
>> that is not already the case, and that the problem therefore has to be
>> resolved on the driver side.
>>
>> So I guess the next question is: Is this already being addressed on the
>> v4l2 side ?
>
> I'm not aware of anyone working on this.
>
>>> It shouldn't take more than two flags (to track user-space operations in
>>> progress and disconnection), a spinlock and a wait queue entry. I'm not
>>> sure if someone has already given it a try, and don't recall why this
>>> hasn't been done yet, as it should be fairly straightforward.
>>>
>>> On the UVC side, the work queue probably has to be flushed in
>>> uvc_disconnect(). I'd keep the destroy call in uvc_delete() though.
>>> Please make sure to look for potential race conditions between the URB
>>> completion handler and the .disconnect() handler (they shouldn't be any,
>>> but I haven't checked lately myself).
>>
>> My current solution for this problem is to call uvc_ctrl_cleanup_device()
>> from uvc_disconnect(), after uvc_unregister_video().
>
> I'd rather avoid that, as the cleanup functions in the UVC driver are
> generally meant to free memory when the last user disappears. While no
> new userspace operation will be started after disconnection once the
> above fix will be in place, there's one operation we can't avoid: the
> file release. This will access some of the memory allocated by the
> driver, and while the current implementation probably doesn't access in
> .release() any memory freed by uvc_ctrl_cleanup_device(), I think it's a
> good practice to only shut down the userspace API in .disconnect(), and
> free memory when the last reference is released.
>
>> An alternative might
>> be to add a uvc_ctrl_stop_device() function which would just cancel the
>> worker.
>
> I think that would be best. Should stream->async_wq (in uvc_video.c) be
> similarly flushed ? The driver does so in stream->async_wq(), called
> from uvc_video_stop_transfer(), itself called from
> uvc_video_stop_streaming() (among other places, that are either error
> paths or system suspend handling). The call stack goes to
> uvc_stop_streaming(), and, through the videobuf2 helpers, to
> vb2_queue_release() called by uvc_queue_release() itself called by
> uvc_v4l2_release() (in the non-disconnect case,
> uvc_video_stop_streaming() will be called through videobuf2 by
> uvc_queue_streamoff(), in response to a VIDIOC_STREAMOFF ioctl). We thus
> flush the workqueue too late, and also access the device in
> uvc_video_stop_streaming() long after .disconnect() returns.
>
> I think uvc_video_stop_streaming() could be called in uvc_disconnect()
> after uvc_unregister_video().
>
Hi Hans,
On Mon, Aug 17, 2020 at 01:00:49PM +0200, Hans Verkuil wrote:
> On 17/08/2020 01:51, Laurent Pinchart wrote:
> > On Sun, Aug 16, 2020 at 08:54:18AM -0700, Guenter Roeck wrote:
> >> On 8/16/20 5:18 AM, Laurent Pinchart wrote:
> >>> CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
> >>> file operations and disconnect in V4L2.
> >>>
> >>> On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
> >>>> + [email protected]
> >>>> + [email protected]
> >>>> + [email protected]
> >>>>
> >>>> and changed subject
> >>>>
> >>>> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
> >>>>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
> >>>>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
> >>>>>> Examples are [1] [2] [3]. A typical backtrace is:
> >>>>>>
> >>>>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
> >>>>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
> >>>>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> >>>>>> <6>[ 3490.516680] PGD 0 P4D 0
> >>>>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
> >>>>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
> >>>>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
> >>>>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
> >>>>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
> >>>>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
> >>>>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
> >>>>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
> >>>>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
> >>>>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
> >>>>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
> >>>>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
> >>>>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
> >>>>>> <4>[ 3490.516745] Call Trace:
> >>>>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
> >>>>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
> >>>>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
> >>>>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
> >>>>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
> >>>>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
> >>>>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
> >>>>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
> >>>>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
> >>>>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
> >>>>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
> >>>>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
> >>>>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
> >>>>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
> >>>>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
> >>>>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
> >>>>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
> >>>>>>
> >>>>>> I have been able to reproduce the problem on a Chromebook by strategically placing
> >>>>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
> >>>>>> problem boils down to lack of protection against device removal in usb_set_interface()
> >>>>>> [and/or possibly other callers of usb_ifnum_to_if()].
> >>>>>>
> >>>>>> Sequence of events is roughly as follows:
> >>>>>>
> >>>>>> - usb_set_interface() is called and proceeds to some point, possibly to
> >>>>>> mutex_lock(hcd->bandwidth_mutex);
> >>>>>> - Device removal event is detected, and usb_disable_device() is called
> >>>>>
> >>>>> At this point all interface drivers get unbound (their disconnect
> >>>>> routines are called).
> >>>>>
> >>>>>> - usb_disable_device() starts removing actconfig data. It has removed
> >>>>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
> >>>>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
> >>>>>> usb_ifnum_to_if()
> >>>>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
> >>>>>> dev->actconfig->interface[i] is NULL
> >>>>>> - crash
> >>>>>>
> >>>>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
> >>>>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
> >>>>>> still be racy since there is still no lock against device removal. I have not tried
> >>>>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
> >>>>>
> >>>>> As far as I know, protecting against these races is the responsibility
> >>>>> of the USB interface drivers. They must make sure that their disconnect
> >>>>> routines block until all outstanding calls to usb_set_interface return
> >>>>> (in fact, until all outstanding device accesses have finished).
> >>>>>
> >>>>> For instance, in the log extract you showed, it's obvious that the
> >>>>> uvc_start_streaming routine was running after the disconnect routine had
> >>>>> returned, which looks like a bug in itself: Once the disconnect routine
> >>>>> returns, the driver is not supposed to try to access the device at all
> >>>>> because some other driver may now be bound to it.
> >>>>>
> >>>>> We can't just call usb_lock_device from within usb_set_interface,
> >>>>> because usb_set_interface is often called with that lock already held.
> >>>>>
> >>>> I had a closer look into the uvcvideo driver and compared it to other usb
> >>>> drivers, including drivers in drivers/media/usb/ which connect to the video
> >>>> subsystem.
> >>>>
> >>>> The usbvideo driver lacks protection against calls to uvc_disconnect() while
> >>>
> >>> Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
> >>> have been called usbvideo if the former hadn't already been in use.
> >>
> >> Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
> >
> > No worries.
> >
> >>>> calls into file operations are ongoing. This is pretty widespread, and not
> >>>> even limited to file operations (for example, there is a worker which is only
> >>>> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
> >>>> ensures that no file operations are started after the call to ucv_disconnect,
> >>>> but that is insufficient.
> >>>>
> >>>> Other drivers do have that protection and make sure that no usb operations
> >>>> can happen after the disconnect call.
> >>>>
> >>>> The only remedy I can see is to rework the usbvideo driver and add the
> >>>> necessary protections. At first glance, it looks like this may be a
> >>>> substantial amount of work. I'd sign up for that, but before I start,
> >>>> I would like to get input from the usbvideo community. Is such an effort
> >>>> already going on ? If yes, how can I help ? If not, is the problem
> >>>> understood and accepted ? Are there any ideas on how to solve it ?
> >>>
> >>> This is something that has been discussed before, and needs to be solved
> >>> in the V4L2 framework itself, not in individual drivers. Not only would
> >>> this avoid rolling out the same code manually everywhere (in different
> >>> incorrect ways, as races are difficult to solve and implementations are
> >>> more often wrong than right), but it will also avoid similar issues for
> >>> non-USB devices.
> >>
> >> You mean code that ensures that no user-space v4l2 operation is in progress
> >> after video_device_unregister / v4l2_device_unregister return ? I agree,
> >> that would simplify the necessary changes on the uvc side.
> >
> > I was thinking about adding a new function to be called from the
> > disconnect handler to implement the wait on end of userspace access, but
> > video_device_unregister() seems an even better idea.
> > v4l2_device_unregister() is probably not very useful as v4l2_device
> > isn't exposed to userspace, only video_device is (and v4l2_subdev and
> > media_device, but that's a different story, although probably still an
> > issue for the latter in the UVC driver).
>
> Actually, all that is needed is to take the ioctl serialization lock in the disconnect
> function.
It's not just ioctls though, the other file operations also need to be
handled (read, write, mmap).
> See last paragraph in 1.4.1 here:
>
> https://hverkuil.home.xs4all.nl/spec/driver-api/v4l2-dev.html
>
> Since uvc uses its own lock, you need to take that one.
Drivers that use their own lock do so to avoid serializing all ioctls.
This means that different ioctls may be covered by different locks
(possibly with part of some ioctls running without locking). I don't
think we can just dismiss the issue saying those drivers need to
implement the disconnection manually. It would be much better to
integrate handling of userspace access with video_device_unregister()
like proposed above, as that will work for all drivers in a transparent
way. It would also be fairly simple to implement in the V4L2 core.
> > We also have a v4l2_device_disconnect() function which is supposed to
> > handle hot-pluggable device disconnection, but it's fairly useless (I'd
> > even say harmful as it gives the illusion that hotplugging is correctly
> > handled, while in reality the media subsystem is plagged by hot-unplug
> > issues :-S).
>
> The v4l2_device_disconnect() is there to remove a v4l2_dev reference to
> the device that is about to be removed when the disconnect() exists.
> Otherwise v4l2_dev->dev would point to a missing device.
>
> However, I wonder if it is still needed: commit 236c5441d703 from 2011 added
> code to take a reference to v4l2_dev->dev in v4l2_device_register(). This
> should prevent the device from disappearing until v4l2_device_unregister() is
> called. I suspect that v4l2_device_disconnect() can be removed completely, and
> instead v4l2_device_unregister() just calls put_device(v4l2_dev->dev).
>
> I don't like v4l2_device_disconnect() either, so if this works, then that would
> be a nice simplification.
>
> >> I actually came from the other side - I assumed that there is a reason
> >> that is not already the case, and that the problem therefore has to be
> >> resolved on the driver side.
> >>
> >> So I guess the next question is: Is this already being addressed on the
> >> v4l2 side ?
> >
> > I'm not aware of anyone working on this.
> >
> >>> It shouldn't take more than two flags (to track user-space operations in
> >>> progress and disconnection), a spinlock and a wait queue entry. I'm not
> >>> sure if someone has already given it a try, and don't recall why this
> >>> hasn't been done yet, as it should be fairly straightforward.
> >>>
> >>> On the UVC side, the work queue probably has to be flushed in
> >>> uvc_disconnect(). I'd keep the destroy call in uvc_delete() though.
> >>> Please make sure to look for potential race conditions between the URB
> >>> completion handler and the .disconnect() handler (they shouldn't be any,
> >>> but I haven't checked lately myself).
> >>
> >> My current solution for this problem is to call uvc_ctrl_cleanup_device()
> >> from uvc_disconnect(), after uvc_unregister_video().
> >
> > I'd rather avoid that, as the cleanup functions in the UVC driver are
> > generally meant to free memory when the last user disappears. While no
> > new userspace operation will be started after disconnection once the
> > above fix will be in place, there's one operation we can't avoid: the
> > file release. This will access some of the memory allocated by the
> > driver, and while the current implementation probably doesn't access in
> > .release() any memory freed by uvc_ctrl_cleanup_device(), I think it's a
> > good practice to only shut down the userspace API in .disconnect(), and
> > free memory when the last reference is released.
> >
> >> An alternative might
> >> be to add a uvc_ctrl_stop_device() function which would just cancel the
> >> worker.
> >
> > I think that would be best. Should stream->async_wq (in uvc_video.c) be
> > similarly flushed ? The driver does so in stream->async_wq(), called
> > from uvc_video_stop_transfer(), itself called from
> > uvc_video_stop_streaming() (among other places, that are either error
> > paths or system suspend handling). The call stack goes to
> > uvc_stop_streaming(), and, through the videobuf2 helpers, to
> > vb2_queue_release() called by uvc_queue_release() itself called by
> > uvc_v4l2_release() (in the non-disconnect case,
> > uvc_video_stop_streaming() will be called through videobuf2 by
> > uvc_queue_streamoff(), in response to a VIDIOC_STREAMOFF ioctl). We thus
> > flush the workqueue too late, and also access the device in
> > uvc_video_stop_streaming() long after .disconnect() returns.
> >
> > I think uvc_video_stop_streaming() could be called in uvc_disconnect()
> > after uvc_unregister_video().
--
Regards,
Laurent Pinchart
On 19/08/2020 03:30, Laurent Pinchart wrote:
> Hi Hans,
>
> On Mon, Aug 17, 2020 at 01:00:49PM +0200, Hans Verkuil wrote:
>> On 17/08/2020 01:51, Laurent Pinchart wrote:
>>> On Sun, Aug 16, 2020 at 08:54:18AM -0700, Guenter Roeck wrote:
>>>> On 8/16/20 5:18 AM, Laurent Pinchart wrote:
>>>>> CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
>>>>> file operations and disconnect in V4L2.
>>>>>
>>>>> On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
>>>>>> + [email protected]
>>>>>> + [email protected]
>>>>>> + [email protected]
>>>>>>
>>>>>> and changed subject
>>>>>>
>>>>>> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
>>>>>>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
>>>>>>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
>>>>>>>> Examples are [1] [2] [3]. A typical backtrace is:
>>>>>>>>
>>>>>>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
>>>>>>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
>>>>>>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>>>>>>>> <6>[ 3490.516680] PGD 0 P4D 0
>>>>>>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
>>>>>>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
>>>>>>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
>>>>>>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
>>>>>>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
>>>>>>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
>>>>>>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
>>>>>>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
>>>>>>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
>>>>>>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
>>>>>>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
>>>>>>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
>>>>>>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
>>>>>>>> <4>[ 3490.516745] Call Trace:
>>>>>>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
>>>>>>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
>>>>>>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
>>>>>>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
>>>>>>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
>>>>>>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
>>>>>>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
>>>>>>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
>>>>>>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
>>>>>>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
>>>>>>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
>>>>>>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
>>>>>>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
>>>>>>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
>>>>>>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
>>>>>>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
>>>>>>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
>>>>>>>>
>>>>>>>> I have been able to reproduce the problem on a Chromebook by strategically placing
>>>>>>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
>>>>>>>> problem boils down to lack of protection against device removal in usb_set_interface()
>>>>>>>> [and/or possibly other callers of usb_ifnum_to_if()].
>>>>>>>>
>>>>>>>> Sequence of events is roughly as follows:
>>>>>>>>
>>>>>>>> - usb_set_interface() is called and proceeds to some point, possibly to
>>>>>>>> mutex_lock(hcd->bandwidth_mutex);
>>>>>>>> - Device removal event is detected, and usb_disable_device() is called
>>>>>>>
>>>>>>> At this point all interface drivers get unbound (their disconnect
>>>>>>> routines are called).
>>>>>>>
>>>>>>>> - usb_disable_device() starts removing actconfig data. It has removed
>>>>>>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
>>>>>>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
>>>>>>>> usb_ifnum_to_if()
>>>>>>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
>>>>>>>> dev->actconfig->interface[i] is NULL
>>>>>>>> - crash
>>>>>>>>
>>>>>>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
>>>>>>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
>>>>>>>> still be racy since there is still no lock against device removal. I have not tried
>>>>>>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
>>>>>>>
>>>>>>> As far as I know, protecting against these races is the responsibility
>>>>>>> of the USB interface drivers. They must make sure that their disconnect
>>>>>>> routines block until all outstanding calls to usb_set_interface return
>>>>>>> (in fact, until all outstanding device accesses have finished).
>>>>>>>
>>>>>>> For instance, in the log extract you showed, it's obvious that the
>>>>>>> uvc_start_streaming routine was running after the disconnect routine had
>>>>>>> returned, which looks like a bug in itself: Once the disconnect routine
>>>>>>> returns, the driver is not supposed to try to access the device at all
>>>>>>> because some other driver may now be bound to it.
>>>>>>>
>>>>>>> We can't just call usb_lock_device from within usb_set_interface,
>>>>>>> because usb_set_interface is often called with that lock already held.
>>>>>>>
>>>>>> I had a closer look into the uvcvideo driver and compared it to other usb
>>>>>> drivers, including drivers in drivers/media/usb/ which connect to the video
>>>>>> subsystem.
>>>>>>
>>>>>> The usbvideo driver lacks protection against calls to uvc_disconnect() while
>>>>>
>>>>> Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
>>>>> have been called usbvideo if the former hadn't already been in use.
>>>>
>>>> Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
>>>
>>> No worries.
>>>
>>>>>> calls into file operations are ongoing. This is pretty widespread, and not
>>>>>> even limited to file operations (for example, there is a worker which is only
>>>>>> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
>>>>>> ensures that no file operations are started after the call to ucv_disconnect,
>>>>>> but that is insufficient.
>>>>>>
>>>>>> Other drivers do have that protection and make sure that no usb operations
>>>>>> can happen after the disconnect call.
>>>>>>
>>>>>> The only remedy I can see is to rework the usbvideo driver and add the
>>>>>> necessary protections. At first glance, it looks like this may be a
>>>>>> substantial amount of work. I'd sign up for that, but before I start,
>>>>>> I would like to get input from the usbvideo community. Is such an effort
>>>>>> already going on ? If yes, how can I help ? If not, is the problem
>>>>>> understood and accepted ? Are there any ideas on how to solve it ?
>>>>>
>>>>> This is something that has been discussed before, and needs to be solved
>>>>> in the V4L2 framework itself, not in individual drivers. Not only would
>>>>> this avoid rolling out the same code manually everywhere (in different
>>>>> incorrect ways, as races are difficult to solve and implementations are
>>>>> more often wrong than right), but it will also avoid similar issues for
>>>>> non-USB devices.
>>>>
>>>> You mean code that ensures that no user-space v4l2 operation is in progress
>>>> after video_device_unregister / v4l2_device_unregister return ? I agree,
>>>> that would simplify the necessary changes on the uvc side.
>>>
>>> I was thinking about adding a new function to be called from the
>>> disconnect handler to implement the wait on end of userspace access, but
>>> video_device_unregister() seems an even better idea.
>>> v4l2_device_unregister() is probably not very useful as v4l2_device
>>> isn't exposed to userspace, only video_device is (and v4l2_subdev and
>>> media_device, but that's a different story, although probably still an
>>> issue for the latter in the UVC driver).
>>
>> Actually, all that is needed is to take the ioctl serialization lock in the disconnect
>> function.
>
> It's not just ioctls though, the other file operations also need to be
> handled (read, write, mmap).
Correct. And AFAIK all vb2-based drivers do take that serialization lock in
the file ops.
>
>> See last paragraph in 1.4.1 here:
>>
>> https://hverkuil.home.xs4all.nl/spec/driver-api/v4l2-dev.html
>>
>> Since uvc uses its own lock, you need to take that one.
>
> Drivers that use their own lock do so to avoid serializing all ioctls.
Let's agree to disagree :-)
In my experience it is just too hard to keep track of locking and with
very little advantages. You are the only developer that I know of that
insists on doing your own locking. Luckily you are very good at your
job, but everyone else uses the v4l2/vb2 core locking.
> This means that different ioctls may be covered by different locks
> (possibly with part of some ioctls running without locking). I don't
> think we can just dismiss the issue saying those drivers need to
> implement the disconnection manually. It would be much better to
> integrate handling of userspace access with video_device_unregister()
> like proposed above, as that will work for all drivers in a transparent
> way. It would also be fairly simple to implement in the V4L2 core.
I'm not really sure what you want. Should video_unregister_device()
take the core lock (i.e. vdev->lock)?
Regards,
Hans
>
>>> We also have a v4l2_device_disconnect() function which is supposed to
>>> handle hot-pluggable device disconnection, but it's fairly useless (I'd
>>> even say harmful as it gives the illusion that hotplugging is correctly
>>> handled, while in reality the media subsystem is plagged by hot-unplug
>>> issues :-S).
>>
>> The v4l2_device_disconnect() is there to remove a v4l2_dev reference to
>> the device that is about to be removed when the disconnect() exists.
>> Otherwise v4l2_dev->dev would point to a missing device.
>>
>> However, I wonder if it is still needed: commit 236c5441d703 from 2011 added
>> code to take a reference to v4l2_dev->dev in v4l2_device_register(). This
>> should prevent the device from disappearing until v4l2_device_unregister() is
>> called. I suspect that v4l2_device_disconnect() can be removed completely, and
>> instead v4l2_device_unregister() just calls put_device(v4l2_dev->dev).
>>
>> I don't like v4l2_device_disconnect() either, so if this works, then that would
>> be a nice simplification.
>>
>>>> I actually came from the other side - I assumed that there is a reason
>>>> that is not already the case, and that the problem therefore has to be
>>>> resolved on the driver side.
>>>>
>>>> So I guess the next question is: Is this already being addressed on the
>>>> v4l2 side ?
>>>
>>> I'm not aware of anyone working on this.
>>>
>>>>> It shouldn't take more than two flags (to track user-space operations in
>>>>> progress and disconnection), a spinlock and a wait queue entry. I'm not
>>>>> sure if someone has already given it a try, and don't recall why this
>>>>> hasn't been done yet, as it should be fairly straightforward.
>>>>>
>>>>> On the UVC side, the work queue probably has to be flushed in
>>>>> uvc_disconnect(). I'd keep the destroy call in uvc_delete() though.
>>>>> Please make sure to look for potential race conditions between the URB
>>>>> completion handler and the .disconnect() handler (they shouldn't be any,
>>>>> but I haven't checked lately myself).
>>>>
>>>> My current solution for this problem is to call uvc_ctrl_cleanup_device()
>>>> from uvc_disconnect(), after uvc_unregister_video().
>>>
>>> I'd rather avoid that, as the cleanup functions in the UVC driver are
>>> generally meant to free memory when the last user disappears. While no
>>> new userspace operation will be started after disconnection once the
>>> above fix will be in place, there's one operation we can't avoid: the
>>> file release. This will access some of the memory allocated by the
>>> driver, and while the current implementation probably doesn't access in
>>> .release() any memory freed by uvc_ctrl_cleanup_device(), I think it's a
>>> good practice to only shut down the userspace API in .disconnect(), and
>>> free memory when the last reference is released.
>>>
>>>> An alternative might
>>>> be to add a uvc_ctrl_stop_device() function which would just cancel the
>>>> worker.
>>>
>>> I think that would be best. Should stream->async_wq (in uvc_video.c) be
>>> similarly flushed ? The driver does so in stream->async_wq(), called
>>> from uvc_video_stop_transfer(), itself called from
>>> uvc_video_stop_streaming() (among other places, that are either error
>>> paths or system suspend handling). The call stack goes to
>>> uvc_stop_streaming(), and, through the videobuf2 helpers, to
>>> vb2_queue_release() called by uvc_queue_release() itself called by
>>> uvc_v4l2_release() (in the non-disconnect case,
>>> uvc_video_stop_streaming() will be called through videobuf2 by
>>> uvc_queue_streamoff(), in response to a VIDIOC_STREAMOFF ioctl). We thus
>>> flush the workqueue too late, and also access the device in
>>> uvc_video_stop_streaming() long after .disconnect() returns.
>>>
>>> I think uvc_video_stop_streaming() could be called in uvc_disconnect()
>>> after uvc_unregister_video().
>
Hi Hans,
On Wed, Aug 19, 2020 at 09:27:05AM +0200, Hans Verkuil wrote:
> On 19/08/2020 03:30, Laurent Pinchart wrote:
> > On Mon, Aug 17, 2020 at 01:00:49PM +0200, Hans Verkuil wrote:
> >> On 17/08/2020 01:51, Laurent Pinchart wrote:
> >>> On Sun, Aug 16, 2020 at 08:54:18AM -0700, Guenter Roeck wrote:
> >>>> On 8/16/20 5:18 AM, Laurent Pinchart wrote:
> >>>>> CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
> >>>>> file operations and disconnect in V4L2.
> >>>>>
> >>>>> On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
> >>>>>> + [email protected]
> >>>>>> + [email protected]
> >>>>>> + [email protected]
> >>>>>>
> >>>>>> and changed subject
> >>>>>>
> >>>>>> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
> >>>>>>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
> >>>>>>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
> >>>>>>>> Examples are [1] [2] [3]. A typical backtrace is:
> >>>>>>>>
> >>>>>>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
> >>>>>>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
> >>>>>>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> >>>>>>>> <6>[ 3490.516680] PGD 0 P4D 0
> >>>>>>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
> >>>>>>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
> >>>>>>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
> >>>>>>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
> >>>>>>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
> >>>>>>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
> >>>>>>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
> >>>>>>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
> >>>>>>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
> >>>>>>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
> >>>>>>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
> >>>>>>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
> >>>>>>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
> >>>>>>>> <4>[ 3490.516745] Call Trace:
> >>>>>>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
> >>>>>>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
> >>>>>>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
> >>>>>>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
> >>>>>>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
> >>>>>>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
> >>>>>>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
> >>>>>>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
> >>>>>>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
> >>>>>>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
> >>>>>>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
> >>>>>>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
> >>>>>>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
> >>>>>>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
> >>>>>>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
> >>>>>>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
> >>>>>>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
> >>>>>>>>
> >>>>>>>> I have been able to reproduce the problem on a Chromebook by strategically placing
> >>>>>>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
> >>>>>>>> problem boils down to lack of protection against device removal in usb_set_interface()
> >>>>>>>> [and/or possibly other callers of usb_ifnum_to_if()].
> >>>>>>>>
> >>>>>>>> Sequence of events is roughly as follows:
> >>>>>>>>
> >>>>>>>> - usb_set_interface() is called and proceeds to some point, possibly to
> >>>>>>>> mutex_lock(hcd->bandwidth_mutex);
> >>>>>>>> - Device removal event is detected, and usb_disable_device() is called
> >>>>>>>
> >>>>>>> At this point all interface drivers get unbound (their disconnect
> >>>>>>> routines are called).
> >>>>>>>
> >>>>>>>> - usb_disable_device() starts removing actconfig data. It has removed
> >>>>>>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
> >>>>>>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
> >>>>>>>> usb_ifnum_to_if()
> >>>>>>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
> >>>>>>>> dev->actconfig->interface[i] is NULL
> >>>>>>>> - crash
> >>>>>>>>
> >>>>>>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
> >>>>>>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
> >>>>>>>> still be racy since there is still no lock against device removal. I have not tried
> >>>>>>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
> >>>>>>>
> >>>>>>> As far as I know, protecting against these races is the responsibility
> >>>>>>> of the USB interface drivers. They must make sure that their disconnect
> >>>>>>> routines block until all outstanding calls to usb_set_interface return
> >>>>>>> (in fact, until all outstanding device accesses have finished).
> >>>>>>>
> >>>>>>> For instance, in the log extract you showed, it's obvious that the
> >>>>>>> uvc_start_streaming routine was running after the disconnect routine had
> >>>>>>> returned, which looks like a bug in itself: Once the disconnect routine
> >>>>>>> returns, the driver is not supposed to try to access the device at all
> >>>>>>> because some other driver may now be bound to it.
> >>>>>>>
> >>>>>>> We can't just call usb_lock_device from within usb_set_interface,
> >>>>>>> because usb_set_interface is often called with that lock already held.
> >>>>>>>
> >>>>>> I had a closer look into the uvcvideo driver and compared it to other usb
> >>>>>> drivers, including drivers in drivers/media/usb/ which connect to the video
> >>>>>> subsystem.
> >>>>>>
> >>>>>> The usbvideo driver lacks protection against calls to uvc_disconnect() while
> >>>>>
> >>>>> Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
> >>>>> have been called usbvideo if the former hadn't already been in use.
> >>>>
> >>>> Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
> >>>
> >>> No worries.
> >>>
> >>>>>> calls into file operations are ongoing. This is pretty widespread, and not
> >>>>>> even limited to file operations (for example, there is a worker which is only
> >>>>>> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
> >>>>>> ensures that no file operations are started after the call to ucv_disconnect,
> >>>>>> but that is insufficient.
> >>>>>>
> >>>>>> Other drivers do have that protection and make sure that no usb operations
> >>>>>> can happen after the disconnect call.
> >>>>>>
> >>>>>> The only remedy I can see is to rework the usbvideo driver and add the
> >>>>>> necessary protections. At first glance, it looks like this may be a
> >>>>>> substantial amount of work. I'd sign up for that, but before I start,
> >>>>>> I would like to get input from the usbvideo community. Is such an effort
> >>>>>> already going on ? If yes, how can I help ? If not, is the problem
> >>>>>> understood and accepted ? Are there any ideas on how to solve it ?
> >>>>>
> >>>>> This is something that has been discussed before, and needs to be solved
> >>>>> in the V4L2 framework itself, not in individual drivers. Not only would
> >>>>> this avoid rolling out the same code manually everywhere (in different
> >>>>> incorrect ways, as races are difficult to solve and implementations are
> >>>>> more often wrong than right), but it will also avoid similar issues for
> >>>>> non-USB devices.
> >>>>
> >>>> You mean code that ensures that no user-space v4l2 operation is in progress
> >>>> after video_device_unregister / v4l2_device_unregister return ? I agree,
> >>>> that would simplify the necessary changes on the uvc side.
> >>>
> >>> I was thinking about adding a new function to be called from the
> >>> disconnect handler to implement the wait on end of userspace access, but
> >>> video_device_unregister() seems an even better idea.
> >>> v4l2_device_unregister() is probably not very useful as v4l2_device
> >>> isn't exposed to userspace, only video_device is (and v4l2_subdev and
> >>> media_device, but that's a different story, although probably still an
> >>> issue for the latter in the UVC driver).
> >>
> >> Actually, all that is needed is to take the ioctl serialization lock in the disconnect
> >> function.
> >
> > It's not just ioctls though, the other file operations also need to be
> > handled (read, write, mmap).
>
> Correct. And AFAIK all vb2-based drivers do take that serialization lock in
> the file ops.
>
> >> See last paragraph in 1.4.1 here:
> >>
> >> https://hverkuil.home.xs4all.nl/spec/driver-api/v4l2-dev.html
> >>
> >> Since uvc uses its own lock, you need to take that one.
> >
> > Drivers that use their own lock do so to avoid serializing all ioctls.
>
> Let's agree to disagree :-)
>
> In my experience it is just too hard to keep track of locking and with
> very little advantages. You are the only developer that I know of that
> insists on doing your own locking. Luckily you are very good at your
> job, but everyone else uses the v4l2/vb2 core locking.
>
> > This means that different ioctls may be covered by different locks
> > (possibly with part of some ioctls running without locking). I don't
> > think we can just dismiss the issue saying those drivers need to
> > implement the disconnection manually. It would be much better to
> > integrate handling of userspace access with video_device_unregister()
> > like proposed above, as that will work for all drivers in a transparent
> > way. It would also be fairly simple to implement in the V4L2 core.
>
> I'm not really sure what you want. Should video_unregister_device()
> take the core lock (i.e. vdev->lock)?
struct video_device {
...
wait_queue_head_t uapi_wait;
bool uapi_call_in_progress;
bool unregister_in_progress;
...
};
static inline int video_device_uapi_call_enter(struct video_device *vdev)
{
int ret = 0;
spin_lock(&vdev->uapi_wait.lock);
if (likely(!vdev->unregister_in_progress)) {
vdev->uapi_call_in_progress = true;
} else {
ret = -ENOTCONN;
}
spin_unlock(&vdev->uapi_wait.lock);
return ret;
}
static inline void video_device_uapi_call_exit(struct video_device *vdev)
{
spin_lock(&vdev->uapi_wait.lock);
vdev->uapi_call_in_progress = false;
wake_up_locked(&vdev->uapi_wait);
spin_unlock(&vdev->uapi_wait.lock);
}
void video_unregister_device(struct video_device *vdev)
{
...
...
}
void video_unregister_device(struct video_device *vdev)
{
/* Check if vdev was ever registered at all */
if (!vdev || !video_is_registered(vdev))
return;
+ spin_lock(&vdev->uapi_wait.lock);
+ vdev->unregister_in_progress = true;
+ if (vdev->uapi_call_in_progress)
+ wait_event_interruptible_locked(vdev->uapi_wait,
+ !vdev->uapi_call_in_progress);
+ spin_unlock(&vdev->uapi_wait.lock);
+
mutex_lock(&videodev_lock);
/* This must be in a critical section to prevent a race with v4l2_open.
* Once this bit has been cleared video_get may never be called again.
*/
clear_bit(V4L2_FL_REGISTERED, &vdev->flags);
mutex_unlock(&videodev_lock);
device_unregister(&vdev->dev);
}
static long v4l2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
struct video_device *vdev = video_devdata(filp);
- int ret = -ENODEV;
+ int ret;
+ ret = video_device_uapi_call_enter(vdev);
+ if (ret)
+ return ret;
+
if (vdev->fops->unlocked_ioctl) {
if (video_is_registered(vdev))
ret = vdev->fops->unlocked_ioctl(filp, cmd, arg);
+ else
+ ret = -ENODEV;
} else
ret = -ENOTTY;
+ video_device_uapi_call_exit(vdev);
+
return ret;
}
and similarly for the other fops. A bit more work is needed to merge
unregister_in_progress with the V4L2_FL_REGISTERED bit.
> >>> We also have a v4l2_device_disconnect() function which is supposed to
> >>> handle hot-pluggable device disconnection, but it's fairly useless (I'd
> >>> even say harmful as it gives the illusion that hotplugging is correctly
> >>> handled, while in reality the media subsystem is plagged by hot-unplug
> >>> issues :-S).
> >>
> >> The v4l2_device_disconnect() is there to remove a v4l2_dev reference to
> >> the device that is about to be removed when the disconnect() exists.
> >> Otherwise v4l2_dev->dev would point to a missing device.
> >>
> >> However, I wonder if it is still needed: commit 236c5441d703 from 2011 added
> >> code to take a reference to v4l2_dev->dev in v4l2_device_register(). This
> >> should prevent the device from disappearing until v4l2_device_unregister() is
> >> called. I suspect that v4l2_device_disconnect() can be removed completely, and
> >> instead v4l2_device_unregister() just calls put_device(v4l2_dev->dev).
> >>
> >> I don't like v4l2_device_disconnect() either, so if this works, then that would
> >> be a nice simplification.
> >>
> >>>> I actually came from the other side - I assumed that there is a reason
> >>>> that is not already the case, and that the problem therefore has to be
> >>>> resolved on the driver side.
> >>>>
> >>>> So I guess the next question is: Is this already being addressed on the
> >>>> v4l2 side ?
> >>>
> >>> I'm not aware of anyone working on this.
> >>>
> >>>>> It shouldn't take more than two flags (to track user-space operations in
> >>>>> progress and disconnection), a spinlock and a wait queue entry. I'm not
> >>>>> sure if someone has already given it a try, and don't recall why this
> >>>>> hasn't been done yet, as it should be fairly straightforward.
> >>>>>
> >>>>> On the UVC side, the work queue probably has to be flushed in
> >>>>> uvc_disconnect(). I'd keep the destroy call in uvc_delete() though.
> >>>>> Please make sure to look for potential race conditions between the URB
> >>>>> completion handler and the .disconnect() handler (they shouldn't be any,
> >>>>> but I haven't checked lately myself).
> >>>>
> >>>> My current solution for this problem is to call uvc_ctrl_cleanup_device()
> >>>> from uvc_disconnect(), after uvc_unregister_video().
> >>>
> >>> I'd rather avoid that, as the cleanup functions in the UVC driver are
> >>> generally meant to free memory when the last user disappears. While no
> >>> new userspace operation will be started after disconnection once the
> >>> above fix will be in place, there's one operation we can't avoid: the
> >>> file release. This will access some of the memory allocated by the
> >>> driver, and while the current implementation probably doesn't access in
> >>> .release() any memory freed by uvc_ctrl_cleanup_device(), I think it's a
> >>> good practice to only shut down the userspace API in .disconnect(), and
> >>> free memory when the last reference is released.
> >>>
> >>>> An alternative might
> >>>> be to add a uvc_ctrl_stop_device() function which would just cancel the
> >>>> worker.
> >>>
> >>> I think that would be best. Should stream->async_wq (in uvc_video.c) be
> >>> similarly flushed ? The driver does so in stream->async_wq(), called
> >>> from uvc_video_stop_transfer(), itself called from
> >>> uvc_video_stop_streaming() (among other places, that are either error
> >>> paths or system suspend handling). The call stack goes to
> >>> uvc_stop_streaming(), and, through the videobuf2 helpers, to
> >>> vb2_queue_release() called by uvc_queue_release() itself called by
> >>> uvc_v4l2_release() (in the non-disconnect case,
> >>> uvc_video_stop_streaming() will be called through videobuf2 by
> >>> uvc_queue_streamoff(), in response to a VIDIOC_STREAMOFF ioctl). We thus
> >>> flush the workqueue too late, and also access the device in
> >>> uvc_video_stop_streaming() long after .disconnect() returns.
> >>>
> >>> I think uvc_video_stop_streaming() could be called in uvc_disconnect()
> >>> after uvc_unregister_video().
--
Regards,
Laurent Pinchart
On Wed, Aug 19, 2020 at 04:30:02AM +0300, Laurent Pinchart wrote:
> Hi Hans,
>
> On Mon, Aug 17, 2020 at 01:00:49PM +0200, Hans Verkuil wrote:
> > On 17/08/2020 01:51, Laurent Pinchart wrote:
> > > On Sun, Aug 16, 2020 at 08:54:18AM -0700, Guenter Roeck wrote:
> > >> On 8/16/20 5:18 AM, Laurent Pinchart wrote:
> > >>> CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
> > >>> file operations and disconnect in V4L2.
> > >>>
> > >>> On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
> > >>>> + [email protected]
> > >>>> + [email protected]
> > >>>> + [email protected]
> > >>>>
> > >>>> and changed subject
> > >>>>
> > >>>> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
> > >>>>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
> > >>>>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
> > >>>>>> Examples are [1] [2] [3]. A typical backtrace is:
> > >>>>>>
> > >>>>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
> > >>>>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
> > >>>>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > >>>>>> <6>[ 3490.516680] PGD 0 P4D 0
> > >>>>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
> > >>>>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
> > >>>>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
> > >>>>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
> > >>>>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
> > >>>>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
> > >>>>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
> > >>>>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
> > >>>>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
> > >>>>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
> > >>>>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
> > >>>>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
> > >>>>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >>>>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
> > >>>>>> <4>[ 3490.516745] Call Trace:
> > >>>>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
> > >>>>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
> > >>>>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
> > >>>>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
> > >>>>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
> > >>>>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
> > >>>>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
> > >>>>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
> > >>>>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
> > >>>>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
> > >>>>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
> > >>>>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
> > >>>>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
> > >>>>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
> > >>>>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
> > >>>>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
> > >>>>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
> > >>>>>>
> > >>>>>> I have been able to reproduce the problem on a Chromebook by strategically placing
> > >>>>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
> > >>>>>> problem boils down to lack of protection against device removal in usb_set_interface()
> > >>>>>> [and/or possibly other callers of usb_ifnum_to_if()].
> > >>>>>>
> > >>>>>> Sequence of events is roughly as follows:
> > >>>>>>
> > >>>>>> - usb_set_interface() is called and proceeds to some point, possibly to
> > >>>>>> mutex_lock(hcd->bandwidth_mutex);
> > >>>>>> - Device removal event is detected, and usb_disable_device() is called
> > >>>>>
> > >>>>> At this point all interface drivers get unbound (their disconnect
> > >>>>> routines are called).
> > >>>>>
> > >>>>>> - usb_disable_device() starts removing actconfig data. It has removed
> > >>>>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
> > >>>>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
> > >>>>>> usb_ifnum_to_if()
> > >>>>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
> > >>>>>> dev->actconfig->interface[i] is NULL
> > >>>>>> - crash
> > >>>>>>
> > >>>>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
> > >>>>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
> > >>>>>> still be racy since there is still no lock against device removal. I have not tried
> > >>>>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
> > >>>>>
> > >>>>> As far as I know, protecting against these races is the responsibility
> > >>>>> of the USB interface drivers. They must make sure that their disconnect
> > >>>>> routines block until all outstanding calls to usb_set_interface return
> > >>>>> (in fact, until all outstanding device accesses have finished).
> > >>>>>
> > >>>>> For instance, in the log extract you showed, it's obvious that the
> > >>>>> uvc_start_streaming routine was running after the disconnect routine had
> > >>>>> returned, which looks like a bug in itself: Once the disconnect routine
> > >>>>> returns, the driver is not supposed to try to access the device at all
> > >>>>> because some other driver may now be bound to it.
> > >>>>>
> > >>>>> We can't just call usb_lock_device from within usb_set_interface,
> > >>>>> because usb_set_interface is often called with that lock already held.
> > >>>>>
> > >>>> I had a closer look into the uvcvideo driver and compared it to other usb
> > >>>> drivers, including drivers in drivers/media/usb/ which connect to the video
> > >>>> subsystem.
> > >>>>
> > >>>> The usbvideo driver lacks protection against calls to uvc_disconnect() while
> > >>>
> > >>> Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
> > >>> have been called usbvideo if the former hadn't already been in use.
> > >>
> > >> Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
> > >
> > > No worries.
> > >
> > >>>> calls into file operations are ongoing. This is pretty widespread, and not
> > >>>> even limited to file operations (for example, there is a worker which is only
> > >>>> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
> > >>>> ensures that no file operations are started after the call to ucv_disconnect,
> > >>>> but that is insufficient.
> > >>>>
> > >>>> Other drivers do have that protection and make sure that no usb operations
> > >>>> can happen after the disconnect call.
> > >>>>
> > >>>> The only remedy I can see is to rework the usbvideo driver and add the
> > >>>> necessary protections. At first glance, it looks like this may be a
> > >>>> substantial amount of work. I'd sign up for that, but before I start,
> > >>>> I would like to get input from the usbvideo community. Is such an effort
> > >>>> already going on ? If yes, how can I help ? If not, is the problem
> > >>>> understood and accepted ? Are there any ideas on how to solve it ?
> > >>>
> > >>> This is something that has been discussed before, and needs to be solved
> > >>> in the V4L2 framework itself, not in individual drivers. Not only would
> > >>> this avoid rolling out the same code manually everywhere (in different
> > >>> incorrect ways, as races are difficult to solve and implementations are
> > >>> more often wrong than right), but it will also avoid similar issues for
> > >>> non-USB devices.
> > >>
> > >> You mean code that ensures that no user-space v4l2 operation is in progress
> > >> after video_device_unregister / v4l2_device_unregister return ? I agree,
> > >> that would simplify the necessary changes on the uvc side.
> > >
> > > I was thinking about adding a new function to be called from the
> > > disconnect handler to implement the wait on end of userspace access, but
> > > video_device_unregister() seems an even better idea.
> > > v4l2_device_unregister() is probably not very useful as v4l2_device
> > > isn't exposed to userspace, only video_device is (and v4l2_subdev and
> > > media_device, but that's a different story, although probably still an
> > > issue for the latter in the UVC driver).
> >
> > Actually, all that is needed is to take the ioctl serialization lock in the disconnect
> > function.
>
> It's not just ioctls though, the other file operations also need to be
> handled (read, write, mmap).
>
> > See last paragraph in 1.4.1 here:
> >
> > https://hverkuil.home.xs4all.nl/spec/driver-api/v4l2-dev.html
> >
> > Since uvc uses its own lock, you need to take that one.
>
> Drivers that use their own lock do so to avoid serializing all ioctls.
> This means that different ioctls may be covered by different locks
> (possibly with part of some ioctls running without locking). I don't
> think we can just dismiss the issue saying those drivers need to
> implement the disconnection manually. It would be much better to
> integrate handling of userspace access with video_device_unregister()
> like proposed above, as that will work for all drivers in a transparent
> way. It would also be fairly simple to implement in the V4L2 core.
>
I don't think it is going to be that simple. Here is a traceback I was able
to collect after playing with msleep() in various locations. This is
_after_ uvc_disconnect() has returned, but before usb_disable_device()
set dev->actconfig to NULL.
[ 9619.003726] config->interface[0] is NULL
[ 9619.009114] WARNING: CPU: 0 PID: 2757 at drivers/usb/core/usb.c:285 usb_ifnum_to_if+0x81/0x85
^^^ added log message and WARN() to prevent crash
[ 9619.018647] Modules linked in: snd_seq_dummy snd_seq bridge stp llc veth tun nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp xfrm6_mode_tunnel xfrm6_mode_transport xfrm4_mode_tunnel xfrm4_mode_transport esp6 ah6 ip6t_REJECT ip6t_ipv6header wacom rfcomm cmac algif_hash algif_skcipher af_alg uinput snd_soc_sst_bxt_da7219_max98357a snd_soc_hdac_hdmi snd_soc_dmic snd_soc_skl_ssp_clk snd_soc_skl snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core snd_intel_nhlt snd_hda_core snd_soc_max98357a acpi_als kfifo_buf industrialio snd_soc_da7219 ipt_MASQUERADE fuse snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device uvcvideo videobuf2_v4l2 videobuf2_common videobuf2_vmalloc videobuf2_memops iwlmvm iwl7000_mac80211 lzo_rle lzo_compress
[ 9619.097545] zram r8152 mii iwlwifi cfg80211 btusb btrtl btintel btbcm bluetooth ecdh_generic joydev
[ 9619.107762] CPU: 0 PID: 2757 Comm: inotify_reader Not tainted 4.19.139 #20
[ 9619.115443] Hardware name: Google Phaser/Phaser, BIOS Google_Phaser.10952.0.0 08/09/2018
[ 9619.124490] RIP: 0010:usb_ifnum_to_if+0x81/0x85
[ 9619.129542] Code: 02 44 39 f1 74 0a 48 ff c6 48 39 c6 72 df 31 db 48 89 d8 5b 41 5e 41 5f 5d c3 31 db 48 c7 c7 08 f8 cd a8 31 c0 e8 dd 1b 9f ff <0f> 0b eb e2 0f 1f 44 00 00 55 48 89 e5 44 8b 47 10 31 c0 45 85 c0
[ 9619.150530] RSP: 0018:ffff9ee20141fa58 EFLAGS: 00010246
[ 9619.156368] RAX: 438a0e5a525f1800 RBX: 0000000000000000 RCX: 0000000000000000
[ 9619.164339] RDX: ffff975477a1de90 RSI: ffff975477a153d0 RDI: ffff975477a153d0
[ 9619.172309] RBP: ffff9ee20141fa70 R08: 000000000000002c R09: 002daec189138d78
[ 9619.180279] R10: 0000001000000000 R11: ffffffffa7da42e6 R12: ffff975459594800
[ 9619.188241] R13: 0000000000000001 R14: 0000000000000001 R15: ffff975465376400
[ 9619.196212] FS: 00007ddebffd6700(0000) GS:ffff975477a00000(0000) knlGS:0000000000000000
[ 9619.205251] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9619.211670] CR2: 0000000012c83000 CR3: 00000001bbaf8000 CR4: 0000000000340ef0
[ 9619.219641] Call Trace:
[ 9619.222378] usb_set_interface+0x3b/0x2f3
[ 9619.226862] uvc_video_stop_streaming+0x38/0x81 [uvcvideo]
[ 9619.232983] uvc_stop_streaming+0x21/0x49 [uvcvideo]
[ 9619.238531] __vb2_queue_cancel+0x33/0x249 [videobuf2_common]
[ 9619.244951] vb2_core_queue_release+0x1c/0x45 [videobuf2_common]
[ 9619.251662] uvc_queue_release+0x26/0x32 [uvcvideo]
[ 9619.257114] uvc_v4l2_release+0x41/0xdd [uvcvideo]
[ 9619.262469] v4l2_release+0x99/0xed
[ 9619.266367] __fput+0xf0/0x1ea
[ 9619.269778] task_work_run+0x7f/0xa7
[ 9619.273769] do_exit+0x1d1/0x6eb
[ 9619.277375] do_group_exit+0x9d/0xac
[ 9619.281367] get_signal+0x135/0x482
[ 9619.285262] do_signal+0x4a/0x63c
[ 9619.288965] exit_to_usermode_loop+0x86/0xa5
[ 9619.293732] do_syscall_64+0x171/0x269
[ 9619.297919] ? __do_page_fault+0x272/0x474
[ 9619.302495] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9619.308156] RIP: 0033:0x7ddec156dc26
[ 9619.312161] Code: Bad RIP value.
[ 9619.315763] RSP: 002b:00007ddebffd5c10 EFLAGS: 00000293 ORIG_RAX: 0000000000000017
[ 9619.324221] RAX: fffffffffffffdfe RBX: 000000000000000a RCX: 00007ddec156dc26
[ 9619.332194] RDX: 0000000000000000 RSI: 00007ddebffd5d28 RDI: 000000000000000a
[ 9619.340166] RBP: 00007ddebffd5c50 R08: 0000000000000000 R09: 0000000000000000
[ 9619.348156] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ddebffd5d28
[ 9619.356127] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
usb_set_interface() should not be called anymore after uvc_disconnect(),
or at east I think so (is that documented anywhere ?).
Yet, that obviously happens, and it happens completely outside lock
control. And this is just one instance where I was actually able
to see the problem. I am quite sure that there are more.
> > > We also have a v4l2_device_disconnect() function which is supposed to
> > > handle hot-pluggable device disconnection, but it's fairly useless (I'd
> > > even say harmful as it gives the illusion that hotplugging is correctly
> > > handled, while in reality the media subsystem is plagged by hot-unplug
> > > issues :-S).
> >
> > The v4l2_device_disconnect() is there to remove a v4l2_dev reference to
> > the device that is about to be removed when the disconnect() exists.
> > Otherwise v4l2_dev->dev would point to a missing device.
> >
> > However, I wonder if it is still needed: commit 236c5441d703 from 2011 added
> > code to take a reference to v4l2_dev->dev in v4l2_device_register(). This
> > should prevent the device from disappearing until v4l2_device_unregister() is
> > called. I suspect that v4l2_device_disconnect() can be removed completely, and
> > instead v4l2_device_unregister() just calls put_device(v4l2_dev->dev).
> >
> > I don't like v4l2_device_disconnect() either, so if this works, then that would
> > be a nice simplification.
> >
Maybe I am missing something, but v4l2_device_unregister() already calls
v4l2_device_disconnect(). With that, I don't see a difference to just calling
v4l2_device_unregister().
Anyway, I'll keep digging.
Thanks,
Guenter
On Wed, Aug 19, 2020 at 04:08:51PM -0700, Guenter Roeck wrote:
> usb_set_interface() should not be called anymore after uvc_disconnect(),
> or at east I think so (is that documented anywhere ?).
It may be documented somewhere, but basically it goes without saying.
A main feature of the device model design is that drivers get bound to
devices by having their probe routine called, and they get unbound by
having their disconnect routine called. It should go without saying
that once a driver is unbound from a device, it must not communicate
with that device any more.
It might be nice if this requirement could be enforced (say in the USB
core), but doing so is impractical. It would require every I/O request
to include some sort of cookie proving that the caller is authorized to
make the request. That's not how the kernel works; it trusts drivers
to generally do the right thing without constant checking.
Alan Stern
Hi Guenter,
On Wed, Aug 19, 2020 at 04:08:51PM -0700, Guenter Roeck wrote:
> On Wed, Aug 19, 2020 at 04:30:02AM +0300, Laurent Pinchart wrote:
> > On Mon, Aug 17, 2020 at 01:00:49PM +0200, Hans Verkuil wrote:
> > > On 17/08/2020 01:51, Laurent Pinchart wrote:
> > > > On Sun, Aug 16, 2020 at 08:54:18AM -0700, Guenter Roeck wrote:
> > > >> On 8/16/20 5:18 AM, Laurent Pinchart wrote:
> > > >>> CC'ing Hans Verkuil and Sakari Ailus for the discussion about handling
> > > >>> file operations and disconnect in V4L2.
> > > >>>
> > > >>> On Sat, Aug 15, 2020 at 05:33:15PM -0700, Guenter Roeck wrote:
> > > >>>> + [email protected]
> > > >>>> + [email protected]
> > > >>>> + [email protected]
> > > >>>>
> > > >>>> and changed subject
> > > >>>>
> > > >>>> On Fri, Aug 14, 2020 at 10:07:39PM -0400, Alan Stern wrote:
> > > >>>>> On Fri, Aug 14, 2020 at 04:07:03PM -0700, Guenter Roeck wrote:
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> over time, there have been a number of reports of crashes in usb_ifnum_to_if(),
> > > >>>>>> called from usb_hcd_alloc_bandwidth, which is in turn called from usb_set_interface().
> > > >>>>>> Examples are [1] [2] [3]. A typical backtrace is:
> > > >>>>>>
> > > >>>>>> <3>[ 3489.445468] intel_sst_acpi 808622A8:00: sst: Busy wait failed, cant send this msg
> > > >>>>>> <6>[ 3490.507273] usb 1-4: USB disconnect, device number 3
> > > >>>>>> <1>[ 3490.516670] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > > >>>>>> <6>[ 3490.516680] PGD 0 P4D 0
> > > >>>>>> <4>[ 3490.516687] Oops: 0000 [#1] PREEMPT SMP PTI
> > > >>>>>> <4>[ 3490.516693] CPU: 0 PID: 5633 Comm: V4L2CaptureThre Not tainted 4.19.113-08536-g5d29ca36db06 #1
> > > >>>>>> <4>[ 3490.516696] Hardware name: GOOGLE Edgar, BIOS Google_Edgar.7287.167.156 03/25/2019
> > > >>>>>> <4>[ 3490.516706] RIP: 0010:usb_ifnum_to_if+0x29/0x40
> > > >>>>>> <4>[ 3490.516710] Code: ee 0f 1f 44 00 00 55 48 89 e5 48 8b 8f f8 03 00 00 48 85 c9 74 27 44 0f b6 41 04 4d 85 c0 74 1d 31 ff 48 8b 84 f9 98 00 00 00 <48> 8b 10 0f b6 52 02 39 f2 74 0a 48 ff c7 4c 39 c7 72 e5 31 c0 5d
> > > >>>>>> <4>[ 3490.516714] RSP: 0018:ffffa46f42a47a80 EFLAGS: 00010246
> > > >>>>>> <4>[ 3490.516718] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff904a396c9000
> > > >>>>>> <4>[ 3490.516721] RDX: ffff904a39641320 RSI: 0000000000000001 RDI: 0000000000000000
> > > >>>>>> <4>[ 3490.516724] RBP: ffffa46f42a47a80 R08: 0000000000000002 R09: 0000000000000000
> > > >>>>>> <4>[ 3490.516727] R10: 0000000000009975 R11: 0000000000000009 R12: 0000000000000000
> > > >>>>>> <4>[ 3490.516731] R13: ffff904a396b3800 R14: ffff904a39e88000 R15: 0000000000000000
> > > >>>>>> <4>[ 3490.516735] FS: 00007f396448e700(0000) GS:ffff904a3ba00000(0000) knlGS:0000000000000000
> > > >>>>>> <4>[ 3490.516738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >>>>>> <4>[ 3490.516742] CR2: 0000000000000000 CR3: 000000016cb46000 CR4: 00000000001006f0
> > > >>>>>> <4>[ 3490.516745] Call Trace:
> > > >>>>>> <4>[ 3490.516756] usb_hcd_alloc_bandwidth+0x1ee/0x30f
> > > >>>>>> <4>[ 3490.516762] usb_set_interface+0x1a3/0x2b7
> > > >>>>>> <4>[ 3490.516773] uvc_video_start_transfer+0x29b/0x4b8 [uvcvideo]
> > > >>>>>> <4>[ 3490.516781] uvc_video_start_streaming+0x91/0xdd [uvcvideo]
> > > >>>>>> <4>[ 3490.516787] uvc_start_streaming+0x28/0x5d [uvcvideo]
> > > >>>>>> <4>[ 3490.516795] vb2_start_streaming+0x61/0x143 [videobuf2_common]
> > > >>>>>> <4>[ 3490.516801] vb2_core_streamon+0xf7/0x10f [videobuf2_common]
> > > >>>>>> <4>[ 3490.516807] uvc_queue_streamon+0x2e/0x41 [uvcvideo]
> > > >>>>>> <4>[ 3490.516814] uvc_ioctl_streamon+0x42/0x5c [uvcvideo]
> > > >>>>>> <4>[ 3490.516820] __video_do_ioctl+0x33d/0x42a
> > > >>>>>> <4>[ 3490.516826] video_usercopy+0x34e/0x5ff
> > > >>>>>> <4>[ 3490.516831] ? video_ioctl2+0x16/0x16
> > > >>>>>> <4>[ 3490.516837] v4l2_ioctl+0x46/0x53
> > > >>>>>> <4>[ 3490.516843] do_vfs_ioctl+0x50a/0x76f
> > > >>>>>> <4>[ 3490.516848] ksys_ioctl+0x58/0x83
> > > >>>>>> <4>[ 3490.516853] __x64_sys_ioctl+0x1a/0x1e
> > > >>>>>> <4>[ 3490.516858] do_syscall_64+0x54/0xde
> > > >>>>>>
> > > >>>>>> I have been able to reproduce the problem on a Chromebook by strategically placing
> > > >>>>>> msleep() calls into usb_set_interface() and usb_disable_device(). Ultimately, the
> > > >>>>>> problem boils down to lack of protection against device removal in usb_set_interface()
> > > >>>>>> [and/or possibly other callers of usb_ifnum_to_if()].
> > > >>>>>>
> > > >>>>>> Sequence of events is roughly as follows:
> > > >>>>>>
> > > >>>>>> - usb_set_interface() is called and proceeds to some point, possibly to
> > > >>>>>> mutex_lock(hcd->bandwidth_mutex);
> > > >>>>>> - Device removal event is detected, and usb_disable_device() is called
> > > >>>>>
> > > >>>>> At this point all interface drivers get unbound (their disconnect
> > > >>>>> routines are called).
> > > >>>>>
> > > >>>>>> - usb_disable_device() starts removing actconfig data. It has removed
> > > >>>>>> and cleared dev->actconfig->interface[i], but not dev->actconfig
> > > >>>>>> - usb_set_interface() calls usb_hcd_alloc_bandwidth(), which calls
> > > >>>>>> usb_ifnum_to_if()
> > > >>>>>> - In usb_ifnum_to_if(), dev->actconfig is not NULL, but
> > > >>>>>> dev->actconfig->interface[i] is NULL
> > > >>>>>> - crash
> > > >>>>>>
> > > >>>>>> Question is what we can do about this. Checking if dev->state != USB_STATE_NOTATTACHED
> > > >>>>>> in usb_ifnum_to_if() might be a possible approach, but strictly speaking it would
> > > >>>>>> still be racy since there is still no lock against device removal. I have not tried
> > > >>>>>> calling usb_lock_device() in usb_set_interface() - would that possibly be an option ?
> > > >>>>>
> > > >>>>> As far as I know, protecting against these races is the responsibility
> > > >>>>> of the USB interface drivers. They must make sure that their disconnect
> > > >>>>> routines block until all outstanding calls to usb_set_interface return
> > > >>>>> (in fact, until all outstanding device accesses have finished).
> > > >>>>>
> > > >>>>> For instance, in the log extract you showed, it's obvious that the
> > > >>>>> uvc_start_streaming routine was running after the disconnect routine had
> > > >>>>> returned, which looks like a bug in itself: Once the disconnect routine
> > > >>>>> returns, the driver is not supposed to try to access the device at all
> > > >>>>> because some other driver may now be bound to it.
> > > >>>>>
> > > >>>>> We can't just call usb_lock_device from within usb_set_interface,
> > > >>>>> because usb_set_interface is often called with that lock already held.
> > > >>>>>
> > > >>>> I had a closer look into the uvcvideo driver and compared it to other usb
> > > >>>> drivers, including drivers in drivers/media/usb/ which connect to the video
> > > >>>> subsystem.
> > > >>>>
> > > >>>> The usbvideo driver lacks protection against calls to uvc_disconnect() while
> > > >>>
> > > >>> Are you confusing usbvideo and uvcvideo ? Both exist, and uvcvideo would
> > > >>> have been called usbvideo if the former hadn't already been in use.
> > > >>
> > > >> Yes, sorry :-(. I am not sure how s/uvc/usb/ happened.
> > > >
> > > > No worries.
> > > >
> > > >>>> calls into file operations are ongoing. This is pretty widespread, and not
> > > >>>> even limited to file operations (for example, there is a worker which is only
> > > >>>> canceled in uvc_delete, not in ucv_disconnect). The existing protection only
> > > >>>> ensures that no file operations are started after the call to ucv_disconnect,
> > > >>>> but that is insufficient.
> > > >>>>
> > > >>>> Other drivers do have that protection and make sure that no usb operations
> > > >>>> can happen after the disconnect call.
> > > >>>>
> > > >>>> The only remedy I can see is to rework the usbvideo driver and add the
> > > >>>> necessary protections. At first glance, it looks like this may be a
> > > >>>> substantial amount of work. I'd sign up for that, but before I start,
> > > >>>> I would like to get input from the usbvideo community. Is such an effort
> > > >>>> already going on ? If yes, how can I help ? If not, is the problem
> > > >>>> understood and accepted ? Are there any ideas on how to solve it ?
> > > >>>
> > > >>> This is something that has been discussed before, and needs to be solved
> > > >>> in the V4L2 framework itself, not in individual drivers. Not only would
> > > >>> this avoid rolling out the same code manually everywhere (in different
> > > >>> incorrect ways, as races are difficult to solve and implementations are
> > > >>> more often wrong than right), but it will also avoid similar issues for
> > > >>> non-USB devices.
> > > >>
> > > >> You mean code that ensures that no user-space v4l2 operation is in progress
> > > >> after video_device_unregister / v4l2_device_unregister return ? I agree,
> > > >> that would simplify the necessary changes on the uvc side.
> > > >
> > > > I was thinking about adding a new function to be called from the
> > > > disconnect handler to implement the wait on end of userspace access, but
> > > > video_device_unregister() seems an even better idea.
> > > > v4l2_device_unregister() is probably not very useful as v4l2_device
> > > > isn't exposed to userspace, only video_device is (and v4l2_subdev and
> > > > media_device, but that's a different story, although probably still an
> > > > issue for the latter in the UVC driver).
> > >
> > > Actually, all that is needed is to take the ioctl serialization lock in the disconnect
> > > function.
> >
> > It's not just ioctls though, the other file operations also need to be
> > handled (read, write, mmap).
> >
> > > See last paragraph in 1.4.1 here:
> > >
> > > https://hverkuil.home.xs4all.nl/spec/driver-api/v4l2-dev.html
> > >
> > > Since uvc uses its own lock, you need to take that one.
> >
> > Drivers that use their own lock do so to avoid serializing all ioctls.
> > This means that different ioctls may be covered by different locks
> > (possibly with part of some ioctls running without locking). I don't
> > think we can just dismiss the issue saying those drivers need to
> > implement the disconnection manually. It would be much better to
> > integrate handling of userspace access with video_device_unregister()
> > like proposed above, as that will work for all drivers in a transparent
> > way. It would also be fairly simple to implement in the V4L2 core.
>
> I don't think it is going to be that simple. Here is a traceback I was able
> to collect after playing with msleep() in various locations. This is
> _after_ uvc_disconnect() has returned, but before usb_disable_device()
> set dev->actconfig to NULL.
>
> [ 9619.003726] config->interface[0] is NULL
> [ 9619.009114] WARNING: CPU: 0 PID: 2757 at drivers/usb/core/usb.c:285 usb_ifnum_to_if+0x81/0x85
> ^^^ added log message and WARN() to prevent crash
>
> [ 9619.018647] Modules linked in: snd_seq_dummy snd_seq bridge stp llc veth tun nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp xfrm6_mode_tunnel xfrm6_mode_transport xfrm4_mode_tunnel xfrm4_mode_transport esp6 ah6 ip6t_REJECT ip6t_ipv6header wacom rfcomm cmac algif_hash algif_skcipher af_alg uinput snd_soc_sst_bxt_da7219_max98357a snd_soc_hdac_hdmi snd_soc_dmic snd_soc_skl_ssp_clk snd_soc_skl snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core snd_intel_nhlt snd_hda_core snd_soc_max98357a acpi_als kfifo_buf industrialio snd_soc_da7219 ipt_MASQUERADE fuse snd_usb_audio snd_usbmidi_lib snd_hwdep snd_rawmidi snd_seq_device uvcvideo videobuf2_v4l2 videobuf2_common videobuf2_vmalloc videobuf2_memops iwlmvm iwl7000_mac80211 lzo_rle lzo_compress
> [ 9619.097545] zram r8152 mii iwlwifi cfg80211 btusb btrtl btintel btbcm bluetooth ecdh_generic joydev
> [ 9619.107762] CPU: 0 PID: 2757 Comm: inotify_reader Not tainted 4.19.139 #20
> [ 9619.115443] Hardware name: Google Phaser/Phaser, BIOS Google_Phaser.10952.0.0 08/09/2018
> [ 9619.124490] RIP: 0010:usb_ifnum_to_if+0x81/0x85
> [ 9619.129542] Code: 02 44 39 f1 74 0a 48 ff c6 48 39 c6 72 df 31 db 48 89 d8 5b 41 5e 41 5f 5d c3 31 db 48 c7 c7 08 f8 cd a8 31 c0 e8 dd 1b 9f ff <0f> 0b eb e2 0f 1f 44 00 00 55 48 89 e5 44 8b 47 10 31 c0 45 85 c0
> [ 9619.150530] RSP: 0018:ffff9ee20141fa58 EFLAGS: 00010246
> [ 9619.156368] RAX: 438a0e5a525f1800 RBX: 0000000000000000 RCX: 0000000000000000
> [ 9619.164339] RDX: ffff975477a1de90 RSI: ffff975477a153d0 RDI: ffff975477a153d0
> [ 9619.172309] RBP: ffff9ee20141fa70 R08: 000000000000002c R09: 002daec189138d78
> [ 9619.180279] R10: 0000001000000000 R11: ffffffffa7da42e6 R12: ffff975459594800
> [ 9619.188241] R13: 0000000000000001 R14: 0000000000000001 R15: ffff975465376400
> [ 9619.196212] FS: 00007ddebffd6700(0000) GS:ffff975477a00000(0000) knlGS:0000000000000000
> [ 9619.205251] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9619.211670] CR2: 0000000012c83000 CR3: 00000001bbaf8000 CR4: 0000000000340ef0
> [ 9619.219641] Call Trace:
> [ 9619.222378] usb_set_interface+0x3b/0x2f3
> [ 9619.226862] uvc_video_stop_streaming+0x38/0x81 [uvcvideo]
> [ 9619.232983] uvc_stop_streaming+0x21/0x49 [uvcvideo]
> [ 9619.238531] __vb2_queue_cancel+0x33/0x249 [videobuf2_common]
> [ 9619.244951] vb2_core_queue_release+0x1c/0x45 [videobuf2_common]
> [ 9619.251662] uvc_queue_release+0x26/0x32 [uvcvideo]
> [ 9619.257114] uvc_v4l2_release+0x41/0xdd [uvcvideo]
> [ 9619.262469] v4l2_release+0x99/0xed
> [ 9619.266367] __fput+0xf0/0x1ea
> [ 9619.269778] task_work_run+0x7f/0xa7
> [ 9619.273769] do_exit+0x1d1/0x6eb
> [ 9619.277375] do_group_exit+0x9d/0xac
> [ 9619.281367] get_signal+0x135/0x482
> [ 9619.285262] do_signal+0x4a/0x63c
> [ 9619.288965] exit_to_usermode_loop+0x86/0xa5
> [ 9619.293732] do_syscall_64+0x171/0x269
> [ 9619.297919] ? __do_page_fault+0x272/0x474
> [ 9619.302495] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9619.308156] RIP: 0033:0x7ddec156dc26
> [ 9619.312161] Code: Bad RIP value.
> [ 9619.315763] RSP: 002b:00007ddebffd5c10 EFLAGS: 00000293 ORIG_RAX: 0000000000000017
> [ 9619.324221] RAX: fffffffffffffdfe RBX: 000000000000000a RCX: 00007ddec156dc26
> [ 9619.332194] RDX: 0000000000000000 RSI: 00007ddebffd5d28 RDI: 000000000000000a
> [ 9619.340166] RBP: 00007ddebffd5c50 R08: 0000000000000000 R09: 0000000000000000
> [ 9619.348156] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ddebffd5d28
> [ 9619.356127] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>
> usb_set_interface() should not be called anymore after uvc_disconnect(),
> or at east I think so (is that documented anywhere ?).
> Yet, that obviously happens, and it happens completely outside lock
> control. And this is just one instance where I was actually able
> to see the problem. I am quite sure that there are more.
Let's hope there are not too many :-) As you can see from the stack
trace, this happens at .release() (a.k.a. last close()) time. This code
path is the only one that the V4L2 core can't protect fully for drivers.
The good news is that there's thus only one code path that drivers would
need to handle manually.
I think we could fix this one by calling uvc_queue_release() in
uvc_disconnect(), after unregistering the video devices. The
uvc_queue_release() call in the .release() path would then become a
no-op, as everything will be stopped already.
> > > > We also have a v4l2_device_disconnect() function which is supposed to
> > > > handle hot-pluggable device disconnection, but it's fairly useless (I'd
> > > > even say harmful as it gives the illusion that hotplugging is correctly
> > > > handled, while in reality the media subsystem is plagged by hot-unplug
> > > > issues :-S).
> > >
> > > The v4l2_device_disconnect() is there to remove a v4l2_dev reference to
> > > the device that is about to be removed when the disconnect() exists.
> > > Otherwise v4l2_dev->dev would point to a missing device.
> > >
> > > However, I wonder if it is still needed: commit 236c5441d703 from 2011 added
> > > code to take a reference to v4l2_dev->dev in v4l2_device_register(). This
> > > should prevent the device from disappearing until v4l2_device_unregister() is
> > > called. I suspect that v4l2_device_disconnect() can be removed completely, and
> > > instead v4l2_device_unregister() just calls put_device(v4l2_dev->dev).
> > >
> > > I don't like v4l2_device_disconnect() either, so if this works, then that would
> > > be a nice simplification.
>
> Maybe I am missing something, but v4l2_device_unregister() already calls
> v4l2_device_disconnect(). With that, I don't see a difference to just calling
> v4l2_device_unregister().
>
> Anyway, I'll keep digging.
--
Regards,
Laurent Pinchart
On 8/20/20 3:15 AM, Laurent Pinchart wrote:
[ ... ]
>> usb_set_interface() should not be called anymore after uvc_disconnect(),
>> or at east I think so (is that documented anywhere ?).
>> Yet, that obviously happens, and it happens completely outside lock
>> control. And this is just one instance where I was actually able
>> to see the problem. I am quite sure that there are more.
>
> Let's hope there are not too many :-) As you can see from the stack
> trace, this happens at .release() (a.k.a. last close()) time. This code
> path is the only one that the V4L2 core can't protect fully for drivers.
> The good news is that there's thus only one code path that drivers would
> need to handle manually.
>
> I think we could fix this one by calling uvc_queue_release() in
> uvc_disconnect(), after unregistering the video devices. The
> uvc_queue_release() call in the .release() path would then become a
> no-op, as everything will be stopped already.
>
I'll give that a try.
I plan to add some instrumentation into the code to check if there are any
other usb calls after uvc_disconnect(). We'll see if that was the only one,
or if I can catch more.
Thanks,
Guenter