LinuxLists.cc - usbip: somtimes stalls at kernel

2010-12-13 22:36:58

Subject: usbip: somtimes stalls at kernel_recvmsg()

Hi,

I'm working with usbip and I sometimes see a stall when I run
the "lsusb" command from the userspace. I added some debug messages
and it seems that the kernel_recvmsg() in
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/staging/usbip/usbip_common.c;h=210ef16bab8d271a52e5d36cd1994aad57ad99e1;hb=HEAD

This is the only place I could find where the TCP messages are arriving in
the usbip code.

What happens if a message does not arrive? Does it stall forever? If
yes, how can the kernel_recvmsg() call changed to handle some timeout?

Márton Németh

2010-12-16 07:31:41

by Németh Márton

[permalink] [raw]

Subject: Re: usbip: somtimes stalls at kernel_recvmsg()

Németh Márton wrote:
> I'm working with usbip and I sometimes see a stall when I run
> the "lsusb" command from the userspace. I added some debug messages
> and it seems that the kernel_recvmsg() in
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/staging/usbip/usbip_common.c;h=210ef16bab8d271a52e5d36cd1994aad57ad99e1;hb=HEAD
>
> This is the only place I could find where the TCP messages are arriving in
> the usbip code.
>
> What happens if a message does not arrive? Does it stall forever? If
> yes, how can the kernel_recvmsg() call changed to handle some timeout?

I found that the userspace manpage of recvmsg(2) ("man recvmsg") contains description
of the "flags" parameter. I suppose the parameters and behaviour of the userspace
recvmsg() is the same as the kernelspace kernel_recvmsg().

Márton Németh

2010-12-16 22:57:18

by Max Vozeler

[permalink] [raw]

Subject: Re: usbip: somtimes stalls at kernel_recvmsg()

Hi Németh,

On 16.12.2010 08:31, Németh Márton wrote:
> Németh Márton wrote:
>> I'm working with usbip and I sometimes see a stall when I run
>> the "lsusb" command from the userspace.

Does it eventually recover?

>> I added some debug messages
>> and it seems that the kernel_recvmsg() in
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/staging/usbip/usbip_common.c;h=210ef16bab8d271a52e5d36cd1994aad57ad99e1;hb=HEAD
>>
>> This is the only place I could find where the TCP messages are arriving in
>> the usbip code.
>>
>> What happens if a message does not arrive? Does it stall forever?

Yes, it will block until detached or until a TCP
timeout or error closes the connection.

The TCP timeout can take several minutes.

>> yes, how can the kernel_recvmsg() call changed to handle some timeout?
>
> I found that the userspace manpage of recvmsg(2) ("man recvmsg") contains description
> of the "flags" parameter. I suppose the parameters and behaviour of the userspace
> recvmsg() is the same as the kernelspace kernel_recvmsg().

I recently faced this problem, too.

The solution I arrived at was to set SO_RCVTIMEO
and SO_SNDTIMEO socket opts in the tools together
with the patch below.

The patch works around a lack of heart-beat in
the usbip protocol which would otherwise make idle
connections time out.

(It wont apply cleanly but hopefully conveys the idea.).

Extending the protocol for a heart-beat message
doesn't seem to be possible without breaking the
compatibility at the same time.

I was planning to submit it on the weekend along
with the tool changes to set the timeouts.

Max

---
From: Max Vozeler <[email protected]>
Date: Mon, 13 Dec 2010 00:39:14 +0100
Subject: [PATCH 1/3] vhci: allow SO_RCVTIMEO on the socket

In case of unanswered replies, a receive timeout is
considered a connection error and the device will be
shut down and removed.

This is a workaround for the lack of heart-beat
messages in the USBIP protocol. It allows userspace
to set a maximum latency for the connection.
---
vhci_rx.c | 21 ++++++++++++++++++++-
1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/vhci_rx.c b/vhci_rx.c
index bc16dc4..9a1fe80 100644
--- a/vhci_rx.c
+++ b/vhci_rx.c
@@ -197,6 +197,19 @@ static void vhci_recv_ret_unlink(struct vhci_device *vdev,
return;
}

+static int vhci_priv_tx_empty(struct vhci_device *vdev)
+{
+ int empty = 0;
+
+ spin_lock(&vdev->priv_lock);
+
+ empty = list_empty(&vdev->priv_rx);
+
+ spin_unlock(&vdev->priv_lock);
+
+ return empty;
+}
+
/* recv a pdu */
static void vhci_rx_pdu(struct usbip_device *ud)
{
@@ -216,8 +229,14 @@ static void vhci_rx_pdu(struct usbip_device *ud)
usbip_uinfo("connection reset by peer\n");
else if (ret == -ETIMEDOUT)
usbip_uinfo("connection timed out\n");
- else if (ret != -ERESTARTSYS)
+ else if (ret == -EAGAIN) {
+ /* connection was idle? */
+ if (vhci_priv_tx_empty(vdev))
+ return;
+ usbip_uinfo("connection timed out with pending urbs\n");
+ } else if (ret != -ERESTARTSYS)
usbip_uinfo("xmit failed %d\n", ret);
+
usbip_event_add(ud, VDEV_EVENT_ERROR_TCP);
return;
}
--
1.7.2.3

2010-12-17 05:45:56

by Németh Márton

[permalink] [raw]

Subject: Re: usbip: sometimes stalls at kernel_recvmsg()

Hello Max,

Max Vozeler wrote:
> Hi Németh,
>
> On 16.12.2010 08:31, Németh Márton wrote:
>> Németh Márton wrote:
>>> I'm working with usbip and I sometimes see a stall when I run
>>> the "lsusb" command from the userspace.
>
> Does it eventually recover?

No, it doesn't. After 120 seconds messages are printed in dmesg
that the "lsusb" process is blocked more than 120 seconds.

>>> I added some debug messages
>>> and it seems that the kernel_recvmsg() in
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/staging/usbip/usbip_common.c;h=210ef16bab8d271a52e5d36cd1994aad57ad99e1;hb=HEAD
>>>
>>> This is the only place I could find where the TCP messages are arriving in
>>> the usbip code.
>>>
>>> What happens if a message does not arrive? Does it stall forever?
>
> Yes, it will block until detached or until a TCP
> timeout or error closes the connection.
>
> The TCP timeout can take several minutes.

What I don't really understand is that how is it possible that
a packet from TCP communication is lost? TCP resends the lost
packets automatically. In my case I run both the usbip server
and client on the same machine using the host name "localhost".
So I assume that there might be a protocol handshake problem
here.

There are, however cases when the timeout is needed, e.g. when
the usbip server and client are physically disconnected so the
TCP packet resend is no more possible.

Márton Németh

2010-12-20 22:22:34

by Max Vozeler

[permalink] [raw]

Subject: Re: usbip: sometimes stalls at kernel_recvmsg()

Hi Németh,

On 17.12.2010 06:45, Németh Márton wrote:
>>> Németh Márton wrote:
>>>> I'm working with usbip and I sometimes see a stall when I run
>>>> the "lsusb" command from the userspace.
>>
>> Does it eventually recover?
>
> No, it doesn't. After 120 seconds messages are printed in dmesg
> that the "lsusb" process is blocked more than 120 seconds.

Can you describe the sequence of events which
happened before the hang?

Was the device detached before or during lsusb?

Perhaps try echo t > /proc/sysrq-trigger to see
where exactly lsusb gets stuck.

I found processes can get stuck in usb_kill_urb
if they tried to unlink an URB, but the unlink
request was not answered before detach.

Perhaps this is related. I am attaching a patch
which fixes that bug for me, perhaps you could
try if it makes a difference?

>>>> I added some debug messages
>>>> and it seems that the kernel_recvmsg() in
>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/staging/usbip/usbip_common.c;h=210ef16bab8d271a52e5d36cd1994aad57ad99e1;hb=HEAD
>>>>
>>>> This is the only place I could find where the TCP messages are arriving in
>>>> the usbip code.
>>>>
>>>> What happens if a message does not arrive? Does it stall forever?
>>
>> Yes, it will block until detached or until a TCP
>> timeout or error closes the connection.
>>
>> The TCP timeout can take several minutes.
>
> What I don't really understand is that how is it possible that
> a packet from TCP communication is lost? TCP resends the lost
> packets automatically. In my case I run both the usbip server
> and client on the same machine using the host name "localhost".
> So I assume that there might be a protocol handshake problem
> here.

Agreed, it doesn't seem like a TCP error is
causing the hang in this case.

Max

--
Max Vozeler - Open Source Consulting and Development, http://vozeler.com
Post address: Max Vozeler, Lauteschlaegerstr. 8, 64289 Darmstadt, Germany
Phone: +49-6151-608186-0, Fax: +49-6151-608186-9, Mobile: +49-176-62450045
OpenPGP: 4096R/A3DBC803 - E37CFDA9A29DFF71E1D5 47A679F30022A3DBC803

Attachments:

0001-vhci-give-back-URBs-from-in-flight-unlink-requests.patch (1.89 kB)

2010-12-21 08:03:19

by Németh Márton

[permalink] [raw]

Subject: Re: usbip: sometimes stalls at kernel_recvmsg()

Hello Max,
Max Vozeler wrote:
> Hi Németh,
>
> On 17.12.2010 06:45, Németh Márton wrote:
>>>> Németh Márton wrote:
>>>>> I'm working with usbip and I sometimes see a stall when I run
>>>>> the "lsusb" command from the userspace.
>>> Does it eventually recover?
>> No, it doesn't. After 120 seconds messages are printed in dmesg
>> that the "lsusb" process is blocked more than 120 seconds.
>
> Can you describe the sequence of events which
> happened before the hang?

Boot EeePC 901. Attach hama AC-150 webcam. Execute the following
commands as root:

echo 0 >/sys/devices/platform/eeepc/camera
insmod /usr/src/linux/drivers/staging/usbip/usbip_common_mod.ko
insmod /usr/src/linux/drivers/staging/usbip/usbip.ko
usbip_bind_driver --list
usbip_bind_driver --usbip 3-2
usbipd

On a second xterm window execute the following commands:
insmod /usr/src/linux/drivers/staging/usbip/vhci-hcd.ko
lsusb
usbip --list localhost
usbip -a localhost 3-2
usbip -p
strace lsusb

Note that the hama AC-150 webcam has a snapshot button on it which
uses the interrupt endpoint. The interrupt URB is sent in when the
gspca_sonixj driver activates in order to catch the button events.
The hang maybe related, I don't know.

>
> Was the device detached before or during lsusb?
>
> Perhaps try echo t > /proc/sysrq-trigger to see
> where exactly lsusb gets stuck.
>
> I found processes can get stuck in usb_kill_urb
> if they tried to unlink an URB, but the unlink
> request was not answered before detach.
>
> Perhaps this is related. I am attaching a patch
> which fixes that bug for me, perhaps you could
> try if it makes a difference?

I'll try your patch later, thanks.

>
>>>>> I added some debug messages
>>>>> and it seems that the kernel_recvmsg() in
>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/staging/usbip/usbip_common.c;h=210ef16bab8d271a52e5d36cd1994aad57ad99e1;hb=HEAD
>>>>>
>>>>> This is the only place I could find where the TCP messages are arriving in
>>>>> the usbip code.
>>>>>
>>>>> What happens if a message does not arrive? Does it stall forever?
>>> Yes, it will block until detached or until a TCP
>>> timeout or error closes the connection.
>>>
>>> The TCP timeout can take several minutes.
>> What I don't really understand is that how is it possible that
>> a packet from TCP communication is lost? TCP resends the lost
>> packets automatically. In my case I run both the usbip server
>> and client on the same machine using the host name "localhost".
>> So I assume that there might be a protocol handshake problem
>> here.
>
> Agreed, it doesn't seem like a TCP error is
> causing the hang in this case.
>
> Max
>
>

2010-12-22 19:50:04

by Németh Márton

[permalink] [raw]

Subject: Re: usbip: sometimes stalls at kernel_recvmsg()

Németh Márton wrote:
> Hello Max,
> Max Vozeler wrote:
>> Hi Németh,
>>
>> On 17.12.2010 06:45, Németh Márton wrote:
>>>>> Németh Márton wrote:
>>>>>> I'm working with usbip and I sometimes see a stall when I run
>>>>>> the "lsusb" command from the userspace.
>>>> Does it eventually recover?
>>> No, it doesn't. After 120 seconds messages are printed in dmesg
>>> that the "lsusb" process is blocked more than 120 seconds.
>> Can you describe the sequence of events which
>> happened before the hang?
>
> Boot EeePC 901. Attach hama AC-150 webcam. Execute the following
> commands as root:
>
> echo 0 >/sys/devices/platform/eeepc/camera
> insmod /usr/src/linux/drivers/staging/usbip/usbip_common_mod.ko
> insmod /usr/src/linux/drivers/staging/usbip/usbip.ko
> usbip_bind_driver --list
> usbip_bind_driver --usbip 3-2
> usbipd
>
> On a second xterm window execute the following commands:
> insmod /usr/src/linux/drivers/staging/usbip/vhci-hcd.ko
> lsusb
> usbip --list localhost
> usbip -a localhost 3-2
> usbip -p
> strace lsusb
>
> Note that the hama AC-150 webcam has a snapshot button on it which
> uses the interrupt endpoint. The interrupt URB is sent in when the
> gspca_sonixj driver activates in order to catch the button events.
> The hang maybe related, I don't know.
>
>> Was the device detached before or during lsusb?
>>
>> Perhaps try echo t > /proc/sysrq-trigger to see
>> where exactly lsusb gets stuck.
>>
>> I found processes can get stuck in usb_kill_urb
>> if they tried to unlink an URB, but the unlink
>> request was not answered before detach.
>>
>> Perhaps this is related. I am attaching a patch
>> which fixes that bug for me, perhaps you could
>> try if it makes a difference?
>
> I'll try your patch later, thanks.

I have tried your patch with a little modification to access
the symbol pickup_urb_and_free_priv, for details see the patch
later in this mail.

Unfortunately "strace lsusb" still freezes. The last line shown
by strace is:

open("/dev/bus/usb/006/001", O_RDWR

where 006 is the bus ID of the virtual USB bus, the 001 is the attached
virtual USB device.

At this time the list of virtual USB port status is as follows:

# usbip -p
8 ports available

Port 00: <Port Initializing>
Port 01: <Port Available>
Port 02: <Port Available>
Port 03: <Port Available>
Port 04: <Port Available>
Port 05: <Port Available>
Port 06: <Port Available>
Port 07: <Port Available>

Márton Németh

---
diff -upr linux-2.6.37-rc5-orig/drivers/staging/usbip/vhci_hcd.c linux-2.6.37-rc5/drivers/staging/usbip/vhci_hcd.c
--- linux-2.6.37-rc5-orig/drivers/staging/usbip/vhci_hcd.c 2010-12-14 20:17:16.000000000 +0100
+++ linux-2.6.37-rc5/drivers/staging/usbip/vhci_hcd.c 2010-12-22 20:12:21.000000000 +0100
@@ -805,6 +805,8 @@ static int vhci_urb_dequeue(struct usb_h
return 0;
}

+struct urb *pickup_urb_and_free_priv(struct vhci_device *vdev,
+ __u32 seqnum);

static void vhci_device_unlink_cleanup(struct vhci_device *vdev)
{
@@ -813,11 +815,34 @@ static void vhci_device_unlink_cleanup(s
spin_lock(&vdev->priv_lock);

list_for_each_entry_safe(unlink, tmp, &vdev->unlink_tx, list) {
+ usbip_uinfo("unlink cleanup tx %lu\n", unlink->unlink_seqnum);
list_del(&unlink->list);
kfree(unlink);
}

list_for_each_entry_safe(unlink, tmp, &vdev->unlink_rx, list) {
+ struct urb *urb;
+
+ /* give back URB of unanswered unlink request */
+ usbip_uinfo("unlink cleanup rx %lu\n", unlink->unlink_seqnum);
+
+ urb = pickup_urb_and_free_priv(vdev, unlink->unlink_seqnum);
+ if (!urb) {
+ usbip_uinfo("the urb (seqnum %lu) was already given back\n",
+ unlink->unlink_seqnum);
+ list_del(&unlink->list);
+ kfree(unlink);
+ continue;
+ }
+
+ urb->status = -ENODEV;
+
+ spin_lock(&the_controller->lock);
+ usb_hcd_unlink_urb_from_ep(vhci_to_hcd(the_controller), urb);
+ spin_unlock(&the_controller->lock);
+
+ usb_hcd_giveback_urb(vhci_to_hcd(the_controller), urb, urb->status);
+
list_del(&unlink->list);
kfree(unlink);
}
diff -upr linux-2.6.37-rc5-orig/drivers/staging/usbip/vhci_rx.c linux-2.6.37-rc5/drivers/staging/usbip/vhci_rx.c
--- linux-2.6.37-rc5-orig/drivers/staging/usbip/vhci_rx.c 2010-10-20 22:30:22.000000000 +0200
+++ linux-2.6.37-rc5/drivers/staging/usbip/vhci_rx.c 2010-12-22 20:12:05.000000000 +0100
@@ -24,7 +24,7 @@

/* get URB from transmitted urb queue */
-static struct urb *pickup_urb_and_free_priv(struct vhci_device *vdev,
+struct urb *pickup_urb_and_free_priv(struct vhci_device *vdev,
__u32 seqnum)
{
struct vhci_priv *priv, *tmp;