Hey everybody,
This series introduces datagrams, packet scheduling, and sk_buff usage
to virtio vsock.
The usage of struct sk_buff benefits users by a) preparing vsock to use
other related systems that require sk_buff, such as sockmap and qdisc,
b) supporting basic congestion control via sock_alloc_send_skb, and c)
reducing copying when delivering packets to TAP.
The socket layer no longer forces errors to be -ENOMEM, as typically
userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
messages are being sent with option MSG_DONTWAIT.
The datagram work is based off previous patches by Jiang Wang[1].
The introduction of datagrams creates a transport layer fairness issue
where datagrams may freely starve streams of queue access. This happens
because, unlike streams, datagrams lack the transactions necessary for
calculating credits and throttling.
Previous proposals introduce changes to the spec to add an additional
virtqueue pair for datagrams[1]. Although this solution works, using
Linux's qdisc for packet scheduling leverages already existing systems,
avoids the need to change the virtio specification, and gives additional
capabilities. The usage of SFQ or fq_codel, for example, may solve the
transport layer starvation problem. It is easy to imagine other use
cases as well. For example, services of varying importance may be
assigned different priorities, and qdisc will apply appropriate
priority-based scheduling. By default, the system default pfifo qdisc is
used. The qdisc may be bypassed and legacy queuing is resumed by simply
setting the virtio-vsock%d network device to state DOWN. This technique
still allows vsock to work with zero-configuration.
In summary, this series introduces these major changes to vsock:
- virtio vsock supports datagrams
- virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
- Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
which applies the throttling threshold sk_sndbuf.
- The vsock socket layer supports returning errors other than -ENOMEM.
- This is used to return -EAGAIN when the sk_sndbuf threshold is
reached.
- virtio vsock uses a net_device, through which qdisc may be used.
- qdisc allows scheduling policies to be applied to vsock flows.
- Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
it may avoid datagrams from flooding out stream flows. The benefit
to this is that additional virtqueues are not needed for datagrams.
- The net_device and qdisc is bypassed by simply setting the
net_device state to DOWN.
[1]: https://lore.kernel.org/all/[email protected]/
Bobby Eshleman (5):
vsock: replace virtio_vsock_pkt with sk_buff
vsock: return errors other than -ENOMEM to socket
vsock: add netdev to vhost/virtio vsock
virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
virtio/vsock: add support for dgram
Jiang Wang (1):
vsock_test: add tests for vsock dgram
drivers/vhost/vsock.c | 238 ++++----
include/linux/virtio_vsock.h | 73 ++-
include/net/af_vsock.h | 2 +
include/uapi/linux/virtio_vsock.h | 2 +
net/vmw_vsock/af_vsock.c | 30 +-
net/vmw_vsock/hyperv_transport.c | 2 +-
net/vmw_vsock/virtio_transport.c | 237 +++++---
net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
net/vmw_vsock/vmci_transport.c | 9 +-
net/vmw_vsock/vsock_loopback.c | 51 +-
tools/testing/vsock/util.c | 105 ++++
tools/testing/vsock/util.h | 4 +
tools/testing/vsock/vsock_test.c | 195 ++++++
13 files changed, 1176 insertions(+), 543 deletions(-)
--
2.35.1
This patch supports dgram in virtio and on the vhost side.
Signed-off-by: Jiang Wang <[email protected]>
Signed-off-by: Bobby Eshleman <[email protected]>
---
drivers/vhost/vsock.c | 2 +-
include/net/af_vsock.h | 2 +
include/uapi/linux/virtio_vsock.h | 1 +
net/vmw_vsock/af_vsock.c | 26 +++-
net/vmw_vsock/virtio_transport.c | 2 +-
net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
6 files changed, 186 insertions(+), 20 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a5d1bdb786fe..3dc72a5647ca 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
int ret;
ret = vsock_core_register(&vhost_transport.transport,
- VSOCK_TRANSPORT_F_H2G);
+ VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
if (ret < 0)
return ret;
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 1c53c4c4d88f..37e55c81e4df 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -78,6 +78,8 @@ struct vsock_sock {
s64 vsock_stream_has_data(struct vsock_sock *vsk);
s64 vsock_stream_has_space(struct vsock_sock *vsk);
struct sock *vsock_create_connected(struct sock *parent);
+int vsock_bind_stream(struct vsock_sock *vsk,
+ struct sockaddr_vm *addr);
/**** TRANSPORT ****/
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 857df3a3a70d..0975b9c88292 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
enum virtio_vsock_type {
VIRTIO_VSOCK_TYPE_STREAM = 1,
VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
+ VIRTIO_VSOCK_TYPE_DGRAM = 3,
};
enum virtio_vsock_op {
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 1893f8aafa48..87e4ae1866d3 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
return 0;
}
+int vsock_bind_stream(struct vsock_sock *vsk,
+ struct sockaddr_vm *addr)
+{
+ int retval;
+
+ spin_lock_bh(&vsock_table_lock);
+ retval = __vsock_bind_connectible(vsk, addr);
+ spin_unlock_bh(&vsock_table_lock);
+
+ return retval;
+}
+EXPORT_SYMBOL(vsock_bind_stream);
+
static int __vsock_bind_dgram(struct vsock_sock *vsk,
struct sockaddr_vm *addr)
{
@@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
}
if (features & VSOCK_TRANSPORT_F_DGRAM) {
- if (t_dgram) {
- err = -EBUSY;
- goto err_busy;
+ /* TODO: always chose the G2H variant over others, support nesting later */
+ if (features & VSOCK_TRANSPORT_F_G2H) {
+ if (t_dgram)
+ pr_warn("virtio_vsock: t_dgram already set\n");
+ t_dgram = t;
+ }
+
+ if (!t_dgram) {
+ t_dgram = t;
}
- t_dgram = t;
}
if (features & VSOCK_TRANSPORT_F_LOCAL) {
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 073314312683..d4526ca462d2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
return -ENOMEM;
ret = vsock_core_register(&virtio_transport.transport,
- VSOCK_TRANSPORT_F_G2H);
+ VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
if (ret)
goto out_wq;
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index bdf16fff054f..aedb48728677 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
static u16 virtio_transport_get_type(struct sock *sk)
{
- if (sk->sk_type == SOCK_STREAM)
+ if (sk->sk_type == SOCK_DGRAM)
+ return VIRTIO_VSOCK_TYPE_DGRAM;
+ else if (sk->sk_type == SOCK_STREAM)
return VIRTIO_VSOCK_TYPE_STREAM;
else
return VIRTIO_VSOCK_TYPE_SEQPACKET;
@@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
vvs = vsk->trans;
/* we can send less than pkt_len bytes */
- if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
- pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+ if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+ if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
+ pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+ else
+ return 0;
+ }
- /* virtio_transport_get_credit might return less than pkt_len credit */
- pkt_len = virtio_transport_get_credit(vvs, pkt_len);
+ if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
+ /* virtio_transport_get_credit might return less than pkt_len credit */
+ pkt_len = virtio_transport_get_credit(vvs, pkt_len);
- /* Do not send zero length OP_RW pkt */
- if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
- return pkt_len;
+ /* Do not send zero length OP_RW pkt */
+ if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+ return pkt_len;
+ }
skb = virtio_transport_alloc_skb(info, pkt_len,
src_cid, src_port,
dst_cid, dst_port,
&err);
if (!skb) {
- virtio_transport_put_credit(vvs, pkt_len);
+ if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
+ virtio_transport_put_credit(vvs, pkt_len);
return err;
}
@@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
}
EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
+static ssize_t
+virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
+ struct msghdr *msg, size_t len)
+{
+ struct virtio_vsock_sock *vvs = vsk->trans;
+ struct sk_buff *skb;
+ size_t total = 0;
+ u32 free_space;
+ int err = -EFAULT;
+
+ spin_lock_bh(&vvs->rx_lock);
+ if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
+ skb = __skb_dequeue(&vvs->rx_queue);
+
+ total = len;
+ if (total > skb->len - vsock_metadata(skb)->off)
+ total = skb->len - vsock_metadata(skb)->off;
+ else if (total < skb->len - vsock_metadata(skb)->off)
+ msg->msg_flags |= MSG_TRUNC;
+
+ /* sk_lock is held by caller so no one else can dequeue.
+ * Unlock rx_lock since memcpy_to_msg() may sleep.
+ */
+ spin_unlock_bh(&vvs->rx_lock);
+
+ err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
+ if (err)
+ return err;
+
+ spin_lock_bh(&vvs->rx_lock);
+
+ virtio_transport_dec_rx_pkt(vvs, skb);
+ consume_skb(skb);
+ }
+
+ free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
+
+ spin_unlock_bh(&vvs->rx_lock);
+
+ if (total > 0 && msg->msg_name) {
+ /* Provide the address of the sender. */
+ DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
+
+ vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
+ le32_to_cpu(vsock_hdr(skb)->src_port));
+ msg->msg_namelen = sizeof(*vm_addr);
+ }
+ return total;
+}
+
+static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
+{
+ return virtio_transport_stream_has_data(vsk);
+}
+
int
virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
struct msghdr *msg,
@@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
struct msghdr *msg,
size_t len, int flags)
{
- return -EOPNOTSUPP;
+ struct sock *sk;
+ size_t err = 0;
+ long timeout;
+
+ DEFINE_WAIT(wait);
+
+ sk = &vsk->sk;
+ err = 0;
+
+ if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
+ return -EOPNOTSUPP;
+
+ lock_sock(sk);
+
+ if (!len)
+ goto out;
+
+ timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+ while (1) {
+ s64 ready;
+
+ prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+ ready = virtio_transport_dgram_has_data(vsk);
+
+ if (ready == 0) {
+ if (timeout == 0) {
+ err = -EAGAIN;
+ finish_wait(sk_sleep(sk), &wait);
+ break;
+ }
+
+ release_sock(sk);
+ timeout = schedule_timeout(timeout);
+ lock_sock(sk);
+
+ if (signal_pending(current)) {
+ err = sock_intr_errno(timeout);
+ finish_wait(sk_sleep(sk), &wait);
+ break;
+ } else if (timeout == 0) {
+ err = -EAGAIN;
+ finish_wait(sk_sleep(sk), &wait);
+ break;
+ }
+ } else {
+ finish_wait(sk_sleep(sk), &wait);
+
+ if (ready < 0) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
+ break;
+ }
+ }
+out:
+ release_sock(sk);
+ return err;
}
EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
@@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr)
{
- return -EOPNOTSUPP;
+ return vsock_bind_stream(vsk, addr);
}
EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
bool virtio_transport_dgram_allow(u32 cid, u32 port)
{
- return false;
+ return true;
}
EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
@@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
struct msghdr *msg,
size_t dgram_len)
{
- return -EOPNOTSUPP;
+ struct virtio_vsock_pkt_info info = {
+ .op = VIRTIO_VSOCK_OP_RW,
+ .msg = msg,
+ .pkt_len = dgram_len,
+ .vsk = vsk,
+ .remote_cid = remote_addr->svm_cid,
+ .remote_port = remote_addr->svm_port,
+ };
+
+ return virtio_transport_send_pkt_info(vsk, &info);
}
EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
@@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
int err = 0;
+ if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
+ virtio_transport_recv_enqueue(vsk, skb);
+ sk->sk_data_ready(sk);
+ return err;
+ }
+
switch (le16_to_cpu(hdr->op)) {
case VIRTIO_VSOCK_OP_RW:
virtio_transport_recv_enqueue(vsk, skb);
@@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
static bool virtio_transport_valid_type(u16 type)
{
return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
- (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
+ (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
+ (type == VIRTIO_VSOCK_TYPE_DGRAM);
}
/* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
@@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
goto free_pkt;
}
+ if (sk->sk_type == SOCK_DGRAM) {
+ virtio_transport_recv_connected(sk, skb);
+ goto out;
+ }
+
space_available = virtio_transport_space_update(sk, skb);
/* Update CID in case it has changed after a transport reset event */
@@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
break;
}
+out:
release_sock(sk);
/* Release refcnt obtained when we fetched this socket out of the
--
2.35.1
This commit adds a feature bit for virtio vsock to support datagrams.
Signed-off-by: Jiang Wang <[email protected]>
Signed-off-by: Bobby Eshleman <[email protected]>
---
drivers/vhost/vsock.c | 3 ++-
include/uapi/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport.c | 8 ++++++--
3 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index b20ddec2664b..a5d1bdb786fe 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -32,7 +32,8 @@
enum {
VHOST_VSOCK_FEATURES = VHOST_FEATURES |
(1ULL << VIRTIO_F_ACCESS_PLATFORM) |
- (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
+ (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
+ (1ULL << VIRTIO_VSOCK_F_DGRAM)
};
enum {
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..857df3a3a70d 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -40,6 +40,7 @@
/* The feature bitmap for virtio vsock */
#define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
+#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
struct virtio_vsock_config {
__le64 guest_cid;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index c6212eb38d3c..073314312683 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /* forward declaration */
struct virtio_vsock {
struct virtio_device *vdev;
struct virtqueue *vqs[VSOCK_VQ_MAX];
+ bool has_dgram;
/* Virtqueue processing is deferred to a workqueue */
struct work_struct tx_work;
@@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
}
vsock->vdev = vdev;
-
vsock->rx_buf_nr = 0;
vsock->rx_buf_max_nr = 0;
atomic_set(&vsock->queued_replies, 0);
@@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
vsock->seqpacket_allow = true;
+ if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
+ vsock->has_dgram = true;
+
vdev->priv = vsock;
ret = virtio_vsock_vqs_init(vsock);
@@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
};
static unsigned int features[] = {
- VIRTIO_VSOCK_F_SEQPACKET
+ VIRTIO_VSOCK_F_SEQPACKET,
+ VIRTIO_VSOCK_F_DGRAM
};
static struct virtio_driver virtio_vsock_driver = {
--
2.35.1
This patch replaces virtio_vsock_pkt with sk_buff.
The benefit of this series includes:
* The bug reported @ https://bugzilla.redhat.com/show_bug.cgi?id=2009935
does not present itself when reasonable sk_sndbuf thresholds are set.
* Using sock_alloc_send_skb() teaches VSOCK to respect
sk_sndbuf for tunability.
* Eliminates copying for vsock_deliver_tap().
* sk_buff is required for future improvements, such as using socket map.
Signed-off-by: Bobby Eshleman <[email protected]>
---
drivers/vhost/vsock.c | 214 +++++------
include/linux/virtio_vsock.h | 60 ++-
net/vmw_vsock/af_vsock.c | 1 +
net/vmw_vsock/virtio_transport.c | 212 +++++-----
net/vmw_vsock/virtio_transport_common.c | 491 ++++++++++++------------
net/vmw_vsock/vsock_loopback.c | 51 +--
6 files changed, 517 insertions(+), 512 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 368330417bde..f8601d93d94d 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -51,8 +51,7 @@ struct vhost_vsock {
struct hlist_node hash;
struct vhost_work send_pkt_work;
- spinlock_t send_pkt_list_lock;
- struct list_head send_pkt_list; /* host->guest pending packets */
+ struct sk_buff_head send_pkt_queue; /* host->guest pending packets */
atomic_t queued_replies;
@@ -108,7 +107,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
vhost_disable_notify(&vsock->dev, vq);
do {
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
+ struct virtio_vsock_hdr *hdr;
struct iov_iter iov_iter;
unsigned out, in;
size_t nbytes;
@@ -116,31 +116,22 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
int head;
u32 flags_to_restore = 0;
- spin_lock_bh(&vsock->send_pkt_list_lock);
- if (list_empty(&vsock->send_pkt_list)) {
- spin_unlock_bh(&vsock->send_pkt_list_lock);
+ skb = skb_dequeue(&vsock->send_pkt_queue);
+
+ if (!skb) {
vhost_enable_notify(&vsock->dev, vq);
break;
}
- pkt = list_first_entry(&vsock->send_pkt_list,
- struct virtio_vsock_pkt, list);
- list_del_init(&pkt->list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
-
head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
&out, &in, NULL, NULL);
if (head < 0) {
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
+ skb_queue_head(&vsock->send_pkt_queue, skb);
break;
}
if (head == vq->num) {
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
+ skb_queue_head(&vsock->send_pkt_queue, skb);
/* We cannot finish yet if more buffers snuck in while
* re-enabling notify.
@@ -153,26 +144,27 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}
if (out) {
- virtio_transport_free_pkt(pkt);
+ kfree_skb(skb);
vq_err(vq, "Expected 0 output buffers, got %u\n", out);
break;
}
iov_len = iov_length(&vq->iov[out], in);
- if (iov_len < sizeof(pkt->hdr)) {
- virtio_transport_free_pkt(pkt);
+ if (iov_len < sizeof(*hdr)) {
+ kfree_skb(skb);
vq_err(vq, "Buffer len [%zu] too small\n", iov_len);
break;
}
iov_iter_init(&iov_iter, READ, &vq->iov[out], in, iov_len);
- payload_len = pkt->len - pkt->off;
+ payload_len = skb->len - vsock_metadata(skb)->off;
+ hdr = vsock_hdr(skb);
/* If the packet is greater than the space available in the
* buffer, we split it using multiple buffers.
*/
- if (payload_len > iov_len - sizeof(pkt->hdr)) {
- payload_len = iov_len - sizeof(pkt->hdr);
+ if (payload_len > iov_len - sizeof(*hdr)) {
+ payload_len = iov_len - sizeof(*hdr);
/* As we are copying pieces of large packet's buffer to
* small rx buffers, headers of packets in rx queue are
@@ -185,31 +177,31 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
* bits set. After initialized header will be copied to
* rx buffer, these required bits will be restored.
*/
- if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
- pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
+ hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
flags_to_restore |= VIRTIO_VSOCK_SEQ_EOM;
- if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR) {
- pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) {
+ hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
flags_to_restore |= VIRTIO_VSOCK_SEQ_EOR;
}
}
}
/* Set the correct length in the header */
- pkt->hdr.len = cpu_to_le32(payload_len);
+ hdr->len = cpu_to_le32(payload_len);
- nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
- if (nbytes != sizeof(pkt->hdr)) {
- virtio_transport_free_pkt(pkt);
+ nbytes = copy_to_iter(hdr, sizeof(*hdr), &iov_iter);
+ if (nbytes != sizeof(*hdr)) {
+ kfree_skb(skb);
vq_err(vq, "Faulted on copying pkt hdr\n");
break;
}
- nbytes = copy_to_iter(pkt->buf + pkt->off, payload_len,
+ nbytes = copy_to_iter(skb->data + vsock_metadata(skb)->off, payload_len,
&iov_iter);
if (nbytes != payload_len) {
- virtio_transport_free_pkt(pkt);
+ kfree_skb(skb);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
}
@@ -217,31 +209,28 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
/* Deliver to monitoring devices all packets that we
* will transmit.
*/
- virtio_transport_deliver_tap_pkt(pkt);
+ virtio_transport_deliver_tap_pkt(skb);
- vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
+ vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
added = true;
- pkt->off += payload_len;
+ vsock_metadata(skb)->off += payload_len;
total_len += payload_len;
/* If we didn't send all the payload we can requeue the packet
* to send it with the next available buffer.
*/
- if (pkt->off < pkt->len) {
- pkt->hdr.flags |= cpu_to_le32(flags_to_restore);
+ if (vsock_metadata(skb)->off < skb->len) {
+ hdr->flags |= cpu_to_le32(flags_to_restore);
- /* We are queueing the same virtio_vsock_pkt to handle
+ /* We are queueing the same skb to handle
* the remaining bytes, and we want to deliver it
* to monitoring devices in the next iteration.
*/
- pkt->tap_delivered = false;
-
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
+ vsock_metadata(skb)->flags &= ~VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
+ skb_queue_head(&vsock->send_pkt_queue, skb);
} else {
- if (pkt->reply) {
+ if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY) {
int val;
val = atomic_dec_return(&vsock->queued_replies);
@@ -253,7 +242,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
restart_tx = true;
}
- virtio_transport_free_pkt(pkt);
+ consume_skb(skb);
}
} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
if (added)
@@ -278,28 +267,26 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
}
static int
-vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
+vhost_transport_send_pkt(struct sk_buff *skb)
{
struct vhost_vsock *vsock;
- int len = pkt->len;
+ int len = skb->len;
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
rcu_read_lock();
/* Find the vhost_vsock according to guest context id */
- vsock = vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid));
+ vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid));
if (!vsock) {
rcu_read_unlock();
- virtio_transport_free_pkt(pkt);
+ kfree_skb(skb);
return -ENODEV;
}
- if (pkt->reply)
+ if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
atomic_inc(&vsock->queued_replies);
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add_tail(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
-
+ skb_queue_tail(&vsock->send_pkt_queue, skb);
vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
rcu_read_unlock();
@@ -310,10 +297,8 @@ static int
vhost_transport_cancel_pkt(struct vsock_sock *vsk)
{
struct vhost_vsock *vsock;
- struct virtio_vsock_pkt *pkt, *n;
int cnt = 0;
int ret = -ENODEV;
- LIST_HEAD(freeme);
rcu_read_lock();
@@ -322,20 +307,7 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
if (!vsock)
goto out;
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
- if (pkt->vsk != vsk)
- continue;
- list_move(&pkt->list, &freeme);
- }
- spin_unlock_bh(&vsock->send_pkt_list_lock);
-
- list_for_each_entry_safe(pkt, n, &freeme, list) {
- if (pkt->reply)
- cnt++;
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
- }
+ cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
if (cnt) {
struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
@@ -352,11 +324,12 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
return ret;
}
-static struct virtio_vsock_pkt *
-vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
+static struct sk_buff *
+vhost_vsock_alloc_skb(struct vhost_virtqueue *vq,
unsigned int out, unsigned int in)
{
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
+ struct virtio_vsock_hdr *hdr;
struct iov_iter iov_iter;
size_t nbytes;
size_t len;
@@ -366,50 +339,49 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
return NULL;
}
- pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
- if (!pkt)
+ len = iov_length(vq->iov, out);
+
+ /* len contains both payload and hdr, so only add additional space for metadata */
+ skb = alloc_skb(len + sizeof(struct virtio_vsock_metadata), GFP_KERNEL);
+ if (!skb)
return NULL;
- len = iov_length(vq->iov, out);
+ memset(skb->head, 0, sizeof(struct virtio_vsock_metadata));
+ virtio_vsock_skb_reserve(skb);
iov_iter_init(&iov_iter, WRITE, vq->iov, out, len);
- nbytes = copy_from_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
- if (nbytes != sizeof(pkt->hdr)) {
+ hdr = vsock_hdr(skb);
+ nbytes = copy_from_iter(hdr, sizeof(*hdr), &iov_iter);
+ if (nbytes != sizeof(*hdr)) {
vq_err(vq, "Expected %zu bytes for pkt->hdr, got %zu bytes\n",
- sizeof(pkt->hdr), nbytes);
- kfree(pkt);
+ sizeof(*hdr), nbytes);
+ kfree_skb(skb);
return NULL;
}
- pkt->len = le32_to_cpu(pkt->hdr.len);
+ len = le32_to_cpu(hdr->len);
/* No payload */
- if (!pkt->len)
- return pkt;
+ if (!len)
+ return skb;
/* The pkt is too big */
- if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
- kfree(pkt);
+ if (len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+ kfree_skb(skb);
return NULL;
}
- pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
- if (!pkt->buf) {
- kfree(pkt);
- return NULL;
- }
+ virtio_vsock_skb_rx_put(skb);
- pkt->buf_len = pkt->len;
-
- nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
- if (nbytes != pkt->len) {
- vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
- pkt->len, nbytes);
- virtio_transport_free_pkt(pkt);
+ nbytes = copy_from_iter(skb->data, len, &iov_iter);
+ if (nbytes != len) {
+ vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
+ len, nbytes);
+ kfree_skb(skb);
return NULL;
}
- return pkt;
+ return skb;
}
/* Is there space left for replies to rx packets? */
@@ -496,7 +468,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
poll.work);
struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
dev);
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
int head, pkts = 0, total_len = 0;
unsigned int out, in;
bool added = false;
@@ -511,6 +483,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
vhost_disable_notify(&vsock->dev, vq);
do {
+ struct virtio_vsock_hdr *hdr;
+ u32 len;
+
if (!vhost_vsock_more_replies(vsock)) {
/* Stop tx until the device processes already
* pending replies. Leave tx virtqueue
@@ -532,26 +507,29 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
break;
}
- pkt = vhost_vsock_alloc_pkt(vq, out, in);
- if (!pkt) {
- vq_err(vq, "Faulted on pkt\n");
+ skb = vhost_vsock_alloc_skb(vq, out, in);
+ if (!skb)
continue;
- }
- total_len += sizeof(pkt->hdr) + pkt->len;
+ len = skb->len;
/* Deliver to monitoring devices all received packets */
- virtio_transport_deliver_tap_pkt(pkt);
+ virtio_transport_deliver_tap_pkt(skb);
+
+ hdr = vsock_hdr(skb);
/* Only accept correctly addressed packets */
- if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid &&
- le64_to_cpu(pkt->hdr.dst_cid) ==
+ if (le64_to_cpu(hdr->src_cid) == vsock->guest_cid &&
+ le64_to_cpu(hdr->dst_cid) ==
vhost_transport_get_local_cid())
- virtio_transport_recv_pkt(&vhost_transport, pkt);
+ virtio_transport_recv_pkt(&vhost_transport, skb);
else
- virtio_transport_free_pkt(pkt);
+ kfree_skb(skb);
+
- vhost_add_used(vq, head, 0);
+ len += sizeof(*hdr);
+ vhost_add_used(vq, head, len);
+ total_len += len;
added = true;
} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
@@ -693,8 +671,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
VHOST_VSOCK_WEIGHT, true, NULL);
file->private_data = vsock;
- spin_lock_init(&vsock->send_pkt_list_lock);
- INIT_LIST_HEAD(&vsock->send_pkt_list);
+ skb_queue_head_init(&vsock->send_pkt_queue);
vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
return 0;
@@ -760,16 +737,7 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
vhost_vsock_flush(vsock);
vhost_dev_stop(&vsock->dev);
- spin_lock_bh(&vsock->send_pkt_list_lock);
- while (!list_empty(&vsock->send_pkt_list)) {
- struct virtio_vsock_pkt *pkt;
-
- pkt = list_first_entry(&vsock->send_pkt_list,
- struct virtio_vsock_pkt, list);
- list_del_init(&pkt->list);
- virtio_transport_free_pkt(pkt);
- }
- spin_unlock_bh(&vsock->send_pkt_list_lock);
+ skb_queue_purge(&vsock->send_pkt_queue);
vhost_dev_cleanup(&vsock->dev);
kfree(vsock->dev.vqs);
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 35d7eedb5e8e..17ed01466875 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -4,9 +4,43 @@
#include <uapi/linux/virtio_vsock.h>
#include <linux/socket.h>
+#include <vdso/bits.h>
#include <net/sock.h>
#include <net/af_vsock.h>
+enum virtio_vsock_metadata_flags {
+ VIRTIO_VSOCK_METADATA_FLAGS_REPLY = BIT(0),
+ VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED = BIT(1),
+};
+
+/* Used only by the virtio/vhost vsock drivers, not related to protocol */
+struct virtio_vsock_metadata {
+ size_t off;
+ enum virtio_vsock_metadata_flags flags;
+};
+
+#define vsock_hdr(skb) \
+ ((struct virtio_vsock_hdr *) \
+ ((void *)skb->head + sizeof(struct virtio_vsock_metadata)))
+
+#define vsock_metadata(skb) \
+ ((struct virtio_vsock_metadata *)skb->head)
+
+#define virtio_vsock_skb_reserve(skb) \
+ skb_reserve(skb, \
+ sizeof(struct virtio_vsock_metadata) + \
+ sizeof(struct virtio_vsock_hdr))
+
+static inline void virtio_vsock_skb_rx_put(struct sk_buff *skb)
+{
+ u32 len;
+
+ len = le32_to_cpu(vsock_hdr(skb)->len);
+
+ if (len > 0)
+ skb_put(skb, len);
+}
+
#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
#define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
@@ -35,23 +69,10 @@ struct virtio_vsock_sock {
u32 last_fwd_cnt;
u32 rx_bytes;
u32 buf_alloc;
- struct list_head rx_queue;
+ struct sk_buff_head rx_queue;
u32 msg_count;
};
-struct virtio_vsock_pkt {
- struct virtio_vsock_hdr hdr;
- struct list_head list;
- /* socket refcnt not held, only use for cancellation */
- struct vsock_sock *vsk;
- void *buf;
- u32 buf_len;
- u32 len;
- u32 off;
- bool reply;
- bool tap_delivered;
-};
-
struct virtio_vsock_pkt_info {
u32 remote_cid, remote_port;
struct vsock_sock *vsk;
@@ -68,7 +89,7 @@ struct virtio_transport {
struct vsock_transport transport;
/* Takes ownership of the packet */
- int (*send_pkt)(struct virtio_vsock_pkt *pkt);
+ int (*send_pkt)(struct sk_buff *skb);
};
ssize_t
@@ -149,11 +170,10 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
void virtio_transport_destruct(struct vsock_sock *vsk);
void virtio_transport_recv_pkt(struct virtio_transport *t,
- struct virtio_vsock_pkt *pkt);
-void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
-void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt);
+ struct sk_buff *skb);
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb);
u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
-void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt);
-
+void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
+int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue);
#endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f04abf662ec6..e348b2d09eac 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -748,6 +748,7 @@ static struct sock *__vsock_create(struct net *net,
vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+ sk->sk_allocation = GFP_KERNEL;
sk->sk_destruct = vsock_sk_destruct;
sk->sk_backlog_rcv = vsock_queue_rcv_skb;
sock_reset_flag(sk, SOCK_DONE);
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index ad64f403536a..3bb293fd8607 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -21,6 +21,12 @@
#include <linux/mutex.h>
#include <net/af_vsock.h>
+#define VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE \
+ (VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE \
+ - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) \
+ - sizeof(struct virtio_vsock_hdr) \
+ - sizeof(struct virtio_vsock_metadata))
+
static struct workqueue_struct *virtio_vsock_workqueue;
static struct virtio_vsock __rcu *the_virtio_vsock;
static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
@@ -42,8 +48,7 @@ struct virtio_vsock {
bool tx_run;
struct work_struct send_pkt_work;
- spinlock_t send_pkt_list_lock;
- struct list_head send_pkt_list;
+ struct sk_buff_head send_pkt_queue;
atomic_t queued_replies;
@@ -101,41 +106,32 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];
for (;;) {
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
struct scatterlist hdr, buf, *sgs[2];
int ret, in_sg = 0, out_sg = 0;
bool reply;
- spin_lock_bh(&vsock->send_pkt_list_lock);
- if (list_empty(&vsock->send_pkt_list)) {
- spin_unlock_bh(&vsock->send_pkt_list_lock);
- break;
- }
+ skb = skb_dequeue(&vsock->send_pkt_queue);
- pkt = list_first_entry(&vsock->send_pkt_list,
- struct virtio_vsock_pkt, list);
- list_del_init(&pkt->list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
-
- virtio_transport_deliver_tap_pkt(pkt);
+ if (!skb)
+ break;
- reply = pkt->reply;
+ virtio_transport_deliver_tap_pkt(skb);
+ reply = vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
- sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+ sg_init_one(&hdr, vsock_hdr(skb), sizeof(*vsock_hdr(skb)));
sgs[out_sg++] = &hdr;
- if (pkt->buf) {
- sg_init_one(&buf, pkt->buf, pkt->len);
+ if (skb->len > 0) {
+ sg_init_one(&buf, skb->data, skb->len);
sgs[out_sg++] = &buf;
}
- ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt, GFP_KERNEL);
+ ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
/* Usually this means that there is no more space available in
* the vq
*/
if (ret < 0) {
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
+ skb_queue_head(&vsock->send_pkt_queue, skb);
break;
}
@@ -163,33 +159,84 @@ virtio_transport_send_pkt_work(struct work_struct *work)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}
+static inline bool
+virtio_transport_skbs_can_merge(struct sk_buff *old, struct sk_buff *new)
+{
+ return (new->len < GOOD_COPY_LEN &&
+ skb_tailroom(old) >= new->len &&
+ vsock_hdr(new)->src_cid == vsock_hdr(old)->src_cid &&
+ vsock_hdr(new)->dst_cid == vsock_hdr(old)->dst_cid &&
+ vsock_hdr(new)->src_port == vsock_hdr(old)->src_port &&
+ vsock_hdr(new)->dst_port == vsock_hdr(old)->dst_port &&
+ vsock_hdr(new)->type == vsock_hdr(old)->type &&
+ vsock_hdr(new)->flags == vsock_hdr(old)->flags &&
+ vsock_hdr(old)->op == VIRTIO_VSOCK_OP_RW &&
+ vsock_hdr(new)->op == VIRTIO_VSOCK_OP_RW);
+}
+
+/**
+ * Merge the two most recent skbs together if possible.
+ *
+ * Caller must hold the queue lock.
+ */
+static void
+virtio_transport_add_to_queue(struct sk_buff_head *queue, struct sk_buff *new)
+{
+ struct sk_buff *old;
+
+ spin_lock_bh(&queue->lock);
+ /* In order to reduce skb memory overhead, we merge new packets with
+ * older packets if they pass virtio_transport_skbs_can_merge().
+ */
+ if (skb_queue_empty_lockless(queue)) {
+ __skb_queue_tail(queue, new);
+ goto out;
+ }
+
+ old = skb_peek_tail(queue);
+
+ if (!virtio_transport_skbs_can_merge(old, new)) {
+ __skb_queue_tail(queue, new);
+ goto out;
+ }
+
+ memcpy(skb_put(old, new->len), new->data, new->len);
+ vsock_hdr(old)->len = cpu_to_le32(old->len);
+ vsock_hdr(old)->buf_alloc = vsock_hdr(new)->buf_alloc;
+ vsock_hdr(old)->fwd_cnt = vsock_hdr(new)->fwd_cnt;
+ dev_kfree_skb_any(new);
+
+out:
+ spin_unlock_bh(&queue->lock);
+}
+
static int
-virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
+virtio_transport_send_pkt(struct sk_buff *skb)
{
+ struct virtio_vsock_hdr *hdr;
struct virtio_vsock *vsock;
- int len = pkt->len;
+ int len = skb->len;
+
+ hdr = vsock_hdr(skb);
rcu_read_lock();
vsock = rcu_dereference(the_virtio_vsock);
if (!vsock) {
- virtio_transport_free_pkt(pkt);
+ kfree_skb(skb);
len = -ENODEV;
goto out_rcu;
}
- if (le64_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
- virtio_transport_free_pkt(pkt);
+ if (le64_to_cpu(hdr->dst_cid) == vsock->guest_cid) {
+ kfree_skb(skb);
len = -ENODEV;
goto out_rcu;
}
- if (pkt->reply)
+ if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
atomic_inc(&vsock->queued_replies);
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_add_tail(&pkt->list, &vsock->send_pkt_list);
- spin_unlock_bh(&vsock->send_pkt_list_lock);
-
+ virtio_transport_add_to_queue(&vsock->send_pkt_queue, skb);
queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
out_rcu:
@@ -201,9 +248,7 @@ static int
virtio_transport_cancel_pkt(struct vsock_sock *vsk)
{
struct virtio_vsock *vsock;
- struct virtio_vsock_pkt *pkt, *n;
int cnt = 0, ret;
- LIST_HEAD(freeme);
rcu_read_lock();
vsock = rcu_dereference(the_virtio_vsock);
@@ -212,20 +257,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
goto out_rcu;
}
- spin_lock_bh(&vsock->send_pkt_list_lock);
- list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
- if (pkt->vsk != vsk)
- continue;
- list_move(&pkt->list, &freeme);
- }
- spin_unlock_bh(&vsock->send_pkt_list_lock);
-
- list_for_each_entry_safe(pkt, n, &freeme, list) {
- if (pkt->reply)
- cnt++;
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
- }
+ cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
if (cnt) {
struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
@@ -246,38 +278,34 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
{
- int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
- struct virtio_vsock_pkt *pkt;
- struct scatterlist hdr, buf, *sgs[2];
+ struct scatterlist pkt, *sgs[1];
struct virtqueue *vq;
int ret;
vq = vsock->vqs[VSOCK_VQ_RX];
do {
- pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
- if (!pkt)
- break;
+ struct sk_buff *skb;
+ const size_t len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
- pkt->buf = kmalloc(buf_len, GFP_KERNEL);
- if (!pkt->buf) {
- virtio_transport_free_pkt(pkt);
+ skb = alloc_skb(len, GFP_KERNEL);
+ if (!skb)
break;
- }
- pkt->buf_len = buf_len;
- pkt->len = buf_len;
+ memset(skb->head, 0,
+ sizeof(struct virtio_vsock_metadata) + sizeof(struct virtio_vsock_hdr));
- sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
- sgs[0] = &hdr;
+ sg_init_one(&pkt, skb->head + sizeof(struct virtio_vsock_metadata),
+ VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE);
+ sgs[0] = &pkt;
- sg_init_one(&buf, pkt->buf, buf_len);
- sgs[1] = &buf;
- ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
- if (ret) {
- virtio_transport_free_pkt(pkt);
+ ret = virtqueue_add_sgs(vq, sgs, 0, 1, skb, GFP_KERNEL);
+ if (ret < 0) {
+ kfree_skb(skb);
break;
}
+
vsock->rx_buf_nr++;
} while (vq->num_free);
if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
@@ -299,12 +327,12 @@ static void virtio_transport_tx_work(struct work_struct *work)
goto out;
do {
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
unsigned int len;
virtqueue_disable_cb(vq);
- while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
- virtio_transport_free_pkt(pkt);
+ while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
+ consume_skb(skb);
added = true;
}
} while (!virtqueue_enable_cb(vq));
@@ -529,7 +557,8 @@ static void virtio_transport_rx_work(struct work_struct *work)
do {
virtqueue_disable_cb(vq);
for (;;) {
- struct virtio_vsock_pkt *pkt;
+ struct virtio_vsock_hdr *hdr;
+ struct sk_buff *skb;
unsigned int len;
if (!virtio_transport_more_replies(vsock)) {
@@ -540,23 +569,24 @@ static void virtio_transport_rx_work(struct work_struct *work)
goto out;
}
- pkt = virtqueue_get_buf(vq, &len);
- if (!pkt) {
+ skb = virtqueue_get_buf(vq, &len);
+ if (!skb)
break;
- }
vsock->rx_buf_nr--;
/* Drop short/long packets */
- if (unlikely(len < sizeof(pkt->hdr) ||
- len > sizeof(pkt->hdr) + pkt->len)) {
- virtio_transport_free_pkt(pkt);
+ if (unlikely(len < sizeof(*hdr) ||
+ len > VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE)) {
+ kfree_skb(skb);
continue;
}
- pkt->len = len - sizeof(pkt->hdr);
- virtio_transport_deliver_tap_pkt(pkt);
- virtio_transport_recv_pkt(&virtio_transport, pkt);
+ hdr = vsock_hdr(skb);
+ virtio_vsock_skb_reserve(skb);
+ virtio_vsock_skb_rx_put(skb);
+ virtio_transport_deliver_tap_pkt(skb);
+ virtio_transport_recv_pkt(&virtio_transport, skb);
}
} while (!virtqueue_enable_cb(vq));
@@ -610,7 +640,7 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
{
struct virtio_device *vdev = vsock->vdev;
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
/* Reset all connected sockets when the VQs disappear */
vsock_for_each_connected_socket(&virtio_transport.transport,
@@ -637,23 +667,16 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
virtio_reset_device(vdev);
mutex_lock(&vsock->rx_lock);
- while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
- virtio_transport_free_pkt(pkt);
+ while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
+ kfree_skb(skb);
mutex_unlock(&vsock->rx_lock);
mutex_lock(&vsock->tx_lock);
- while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
- virtio_transport_free_pkt(pkt);
+ while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
+ kfree_skb(skb);
mutex_unlock(&vsock->tx_lock);
- spin_lock_bh(&vsock->send_pkt_list_lock);
- while (!list_empty(&vsock->send_pkt_list)) {
- pkt = list_first_entry(&vsock->send_pkt_list,
- struct virtio_vsock_pkt, list);
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
- }
- spin_unlock_bh(&vsock->send_pkt_list_lock);
+ skb_queue_purge(&vsock->send_pkt_queue);
/* Delete virtqueues and flush outstanding callbacks if any */
vdev->config->del_vqs(vdev);
@@ -690,8 +713,7 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
mutex_init(&vsock->tx_lock);
mutex_init(&vsock->rx_lock);
mutex_init(&vsock->event_lock);
- spin_lock_init(&vsock->send_pkt_list_lock);
- INIT_LIST_HEAD(&vsock->send_pkt_list);
+ skb_queue_head_init(&vsock->send_pkt_queue);
INIT_WORK(&vsock->rx_work, virtio_transport_rx_work);
INIT_WORK(&vsock->tx_work, virtio_transport_tx_work);
INIT_WORK(&vsock->event_work, virtio_transport_event_work);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index ec2c2afbf0d0..920578597bb9 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,53 +37,81 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
return container_of(t, struct virtio_transport, transport);
}
-static struct virtio_vsock_pkt *
-virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
+/* Returns a new packet on success, otherwise returns NULL.
+ *
+ * If NULL is returned, errp is set to a negative errno.
+ */
+static struct sk_buff *
+virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
size_t len,
u32 src_cid,
u32 src_port,
u32 dst_cid,
- u32 dst_port)
+ u32 dst_port,
+ int *errp)
{
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
+ struct virtio_vsock_hdr *hdr;
+ void *payload;
+ const size_t skb_len = sizeof(*hdr) + sizeof(struct virtio_vsock_metadata) + len;
int err;
- pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
- if (!pkt)
- return NULL;
+ if (info->vsk) {
+ unsigned int msg_flags = info->msg ? info->msg->msg_flags : 0;
+ struct sock *sk;
- pkt->hdr.type = cpu_to_le16(info->type);
- pkt->hdr.op = cpu_to_le16(info->op);
- pkt->hdr.src_cid = cpu_to_le64(src_cid);
- pkt->hdr.dst_cid = cpu_to_le64(dst_cid);
- pkt->hdr.src_port = cpu_to_le32(src_port);
- pkt->hdr.dst_port = cpu_to_le32(dst_port);
- pkt->hdr.flags = cpu_to_le32(info->flags);
- pkt->len = len;
- pkt->hdr.len = cpu_to_le32(len);
- pkt->reply = info->reply;
- pkt->vsk = info->vsk;
+ sk = sk_vsock(info->vsk);
+ skb = sock_alloc_send_skb(sk, skb_len,
+ msg_flags & MSG_DONTWAIT, errp);
- if (info->msg && len > 0) {
- pkt->buf = kmalloc(len, GFP_KERNEL);
- if (!pkt->buf)
- goto out_pkt;
+ if (skb)
+ skb->priority = sk->sk_priority;
+ } else {
+ skb = alloc_skb(skb_len, GFP_KERNEL);
+ }
+
+ if (!skb) {
+ /* If using alloc_skb(), the skb is NULL due to lacking memory.
+ * Otherwise, errp is set by sock_alloc_send_skb().
+ */
+ if (!info->vsk)
+ *errp = -ENOMEM;
+ return NULL;
+ }
- pkt->buf_len = len;
+ memset(skb->head, 0, sizeof(*hdr) + sizeof(struct virtio_vsock_metadata));
+ virtio_vsock_skb_reserve(skb);
+ payload = skb_put(skb, len);
- err = memcpy_from_msg(pkt->buf, info->msg, len);
- if (err)
+ hdr = vsock_hdr(skb);
+ hdr->type = cpu_to_le16(info->type);
+ hdr->op = cpu_to_le16(info->op);
+ hdr->src_cid = cpu_to_le64(src_cid);
+ hdr->dst_cid = cpu_to_le64(dst_cid);
+ hdr->src_port = cpu_to_le32(src_port);
+ hdr->dst_port = cpu_to_le32(dst_port);
+ hdr->flags = cpu_to_le32(info->flags);
+ hdr->len = cpu_to_le32(len);
+
+ if (info->msg && len > 0) {
+ err = memcpy_from_msg(payload, info->msg, len);
+ if (err) {
+ *errp = -ENOMEM;
goto out;
+ }
if (msg_data_left(info->msg) == 0 &&
info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
- pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
if (info->msg->msg_flags & MSG_EOR)
- pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
}
}
+ if (info->reply)
+ vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
+
trace_virtio_transport_alloc_pkt(src_cid, src_port,
dst_cid, dst_port,
len,
@@ -91,85 +119,26 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
info->op,
info->flags);
- return pkt;
+ return skb;
out:
- kfree(pkt->buf);
-out_pkt:
- kfree(pkt);
+ kfree_skb(skb);
return NULL;
}
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
- struct virtio_vsock_pkt *pkt = opaque;
- struct af_vsockmon_hdr *hdr;
- struct sk_buff *skb;
- size_t payload_len;
- void *payload_buf;
-
- /* A packet could be split to fit the RX buffer, so we can retrieve
- * the payload length from the header and the buffer pointer taking
- * care of the offset in the original packet.
- */
- payload_len = le32_to_cpu(pkt->hdr.len);
- payload_buf = pkt->buf + pkt->off;
-
- skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + payload_len,
- GFP_ATOMIC);
- if (!skb)
- return NULL;
-
- hdr = skb_put(skb, sizeof(*hdr));
-
- /* pkt->hdr is little-endian so no need to byteswap here */
- hdr->src_cid = pkt->hdr.src_cid;
- hdr->src_port = pkt->hdr.src_port;
- hdr->dst_cid = pkt->hdr.dst_cid;
- hdr->dst_port = pkt->hdr.dst_port;
-
- hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
- hdr->len = cpu_to_le16(sizeof(pkt->hdr));
- memset(hdr->reserved, 0, sizeof(hdr->reserved));
-
- switch (le16_to_cpu(pkt->hdr.op)) {
- case VIRTIO_VSOCK_OP_REQUEST:
- case VIRTIO_VSOCK_OP_RESPONSE:
- hdr->op = cpu_to_le16(AF_VSOCK_OP_CONNECT);
- break;
- case VIRTIO_VSOCK_OP_RST:
- case VIRTIO_VSOCK_OP_SHUTDOWN:
- hdr->op = cpu_to_le16(AF_VSOCK_OP_DISCONNECT);
- break;
- case VIRTIO_VSOCK_OP_RW:
- hdr->op = cpu_to_le16(AF_VSOCK_OP_PAYLOAD);
- break;
- case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
- case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
- hdr->op = cpu_to_le16(AF_VSOCK_OP_CONTROL);
- break;
- default:
- hdr->op = cpu_to_le16(AF_VSOCK_OP_UNKNOWN);
- break;
- }
-
- skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
-
- if (payload_len) {
- skb_put_data(skb, payload_buf, payload_len);
- }
-
- return skb;
+ return (struct sk_buff *)opaque;
}
-void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
+void virtio_transport_deliver_tap_pkt(struct sk_buff *skb)
{
- if (pkt->tap_delivered)
+ if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED)
return;
- vsock_deliver_tap(virtio_transport_build_skb, pkt);
- pkt->tap_delivered = true;
+ vsock_deliver_tap(virtio_transport_build_skb, skb);
+ vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
}
EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
@@ -192,8 +161,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
u32 src_cid, src_port, dst_cid, dst_port;
const struct virtio_transport *t_ops;
struct virtio_vsock_sock *vvs;
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
u32 pkt_len = info->pkt_len;
+ int err;
info->type = virtio_transport_get_type(sk_vsock(vsk));
@@ -224,42 +194,47 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
return pkt_len;
- pkt = virtio_transport_alloc_pkt(info, pkt_len,
+ skb = virtio_transport_alloc_skb(info, pkt_len,
src_cid, src_port,
- dst_cid, dst_port);
- if (!pkt) {
+ dst_cid, dst_port,
+ &err);
+ if (!skb) {
virtio_transport_put_credit(vvs, pkt_len);
- return -ENOMEM;
+ return err;
}
- virtio_transport_inc_tx_pkt(vvs, pkt);
+ virtio_transport_inc_tx_pkt(vvs, skb);
+
+ err = t_ops->send_pkt(skb);
- return t_ops->send_pkt(pkt);
+ return err < 0 ? -ENOMEM : err;
}
static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
- if (vvs->rx_bytes + pkt->len > vvs->buf_alloc)
+ if (vvs->rx_bytes + skb->len > vvs->buf_alloc)
return false;
- vvs->rx_bytes += pkt->len;
+ vvs->rx_bytes += skb->len;
return true;
}
static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
- vvs->rx_bytes -= pkt->len;
- vvs->fwd_cnt += pkt->len;
+ vvs->rx_bytes -= skb->len;
+ vvs->fwd_cnt += skb->len;
}
-void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb)
{
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
+
spin_lock_bh(&vvs->rx_lock);
vvs->last_fwd_cnt = vvs->fwd_cnt;
- pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
- pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
+ hdr->fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
+ hdr->buf_alloc = cpu_to_le32(vvs->buf_alloc);
spin_unlock_bh(&vvs->rx_lock);
}
EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
@@ -303,29 +278,29 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
size_t len)
{
struct virtio_vsock_sock *vvs = vsk->trans;
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb, *tmp;
size_t bytes, total = 0, off;
int err = -EFAULT;
spin_lock_bh(&vvs->rx_lock);
- list_for_each_entry(pkt, &vvs->rx_queue, list) {
- off = pkt->off;
+ skb_queue_walk_safe(&vvs->rx_queue, skb, tmp) {
+ off = vsock_metadata(skb)->off;
if (total == len)
break;
- while (total < len && off < pkt->len) {
+ while (total < len && off < skb->len) {
bytes = len - total;
- if (bytes > pkt->len - off)
- bytes = pkt->len - off;
+ if (bytes > skb->len - off)
+ bytes = skb->len - off;
/* sk_lock is held by caller so no one else can dequeue.
* Unlock rx_lock since memcpy_to_msg() may sleep.
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, pkt->buf + off, bytes);
+ err = memcpy_to_msg(msg, skb->data + off, bytes);
if (err)
goto out;
@@ -352,37 +327,40 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
size_t len)
{
struct virtio_vsock_sock *vvs = vsk->trans;
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
size_t bytes, total = 0;
u32 free_space;
int err = -EFAULT;
spin_lock_bh(&vvs->rx_lock);
- while (total < len && !list_empty(&vvs->rx_queue)) {
- pkt = list_first_entry(&vvs->rx_queue,
- struct virtio_vsock_pkt, list);
+ while (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
+ skb = __skb_dequeue(&vvs->rx_queue);
bytes = len - total;
- if (bytes > pkt->len - pkt->off)
- bytes = pkt->len - pkt->off;
+ if (bytes > skb->len - vsock_metadata(skb)->off)
+ bytes = skb->len - vsock_metadata(skb)->off;
/* sk_lock is held by caller so no one else can dequeue.
* Unlock rx_lock since memcpy_to_msg() may sleep.
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
+ err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, bytes);
if (err)
goto out;
spin_lock_bh(&vvs->rx_lock);
total += bytes;
- pkt->off += bytes;
- if (pkt->off == pkt->len) {
- virtio_transport_dec_rx_pkt(vvs, pkt);
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
+ vsock_metadata(skb)->off += bytes;
+
+ WARN_ON(vsock_metadata(skb)->off > skb->len);
+
+ if (vsock_metadata(skb)->off == skb->len) {
+ virtio_transport_dec_rx_pkt(vvs, skb);
+ consume_skb(skb);
+ } else {
+ __skb_queue_head(&vvs->rx_queue, skb);
}
}
@@ -414,7 +392,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
int flags)
{
struct virtio_vsock_sock *vvs = vsk->trans;
- struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
int dequeued_len = 0;
size_t user_buf_len = msg_data_left(msg);
bool msg_ready = false;
@@ -427,13 +405,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
}
while (!msg_ready) {
- pkt = list_first_entry(&vvs->rx_queue, struct virtio_vsock_pkt, list);
+ struct virtio_vsock_hdr *hdr;
+
+ skb = __skb_dequeue(&vvs->rx_queue);
+ hdr = vsock_hdr(skb);
if (dequeued_len >= 0) {
size_t pkt_len;
size_t bytes_to_copy;
- pkt_len = (size_t)le32_to_cpu(pkt->hdr.len);
+ pkt_len = (size_t)le32_to_cpu(hdr->len);
bytes_to_copy = min(user_buf_len, pkt_len);
if (bytes_to_copy) {
@@ -444,7 +425,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, pkt->buf, bytes_to_copy);
+ err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
if (err) {
/* Copy of message failed. Rest of
* fragments will be freed without copy.
@@ -461,17 +442,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
dequeued_len += pkt_len;
}
- if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
msg_ready = true;
vvs->msg_count--;
- if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR)
+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
msg->msg_flags |= MSG_EOR;
}
- virtio_transport_dec_rx_pkt(vvs, pkt);
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
+ virtio_transport_dec_rx_pkt(vvs, skb);
+ kfree_skb(skb);
}
spin_unlock_bh(&vvs->rx_lock);
@@ -609,7 +589,7 @@ int virtio_transport_do_socket_init(struct vsock_sock *vsk,
spin_lock_init(&vvs->rx_lock);
spin_lock_init(&vvs->tx_lock);
- INIT_LIST_HEAD(&vvs->rx_queue);
+ skb_queue_head_init(&vvs->rx_queue);
return 0;
}
@@ -809,16 +789,16 @@ void virtio_transport_destruct(struct vsock_sock *vsk)
EXPORT_SYMBOL_GPL(virtio_transport_destruct);
static int virtio_transport_reset(struct vsock_sock *vsk,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_RST,
- .reply = !!pkt,
+ .reply = !!skb,
.vsk = vsk,
};
/* Send RST only if the original pkt is not a RST pkt */
- if (pkt && le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+ if (skb && le16_to_cpu(vsock_hdr(skb)->op) == VIRTIO_VSOCK_OP_RST)
return 0;
return virtio_transport_send_pkt_info(vsk, &info);
@@ -828,29 +808,32 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
* attempt was made to connect to a socket that does not exist.
*/
static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
- struct virtio_vsock_pkt *reply;
+ struct sk_buff *reply;
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_RST,
- .type = le16_to_cpu(pkt->hdr.type),
+ .type = le16_to_cpu(hdr->type),
.reply = true,
};
+ int err;
/* Send RST only if the original pkt is not a RST pkt */
- if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+ if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
return 0;
- reply = virtio_transport_alloc_pkt(&info, 0,
- le64_to_cpu(pkt->hdr.dst_cid),
- le32_to_cpu(pkt->hdr.dst_port),
- le64_to_cpu(pkt->hdr.src_cid),
- le32_to_cpu(pkt->hdr.src_port));
+ reply = virtio_transport_alloc_skb(&info, 0,
+ le64_to_cpu(hdr->dst_cid),
+ le32_to_cpu(hdr->dst_port),
+ le64_to_cpu(hdr->src_cid),
+ le32_to_cpu(hdr->src_port),
+ &err);
if (!reply)
- return -ENOMEM;
+ return err;
if (!t) {
- virtio_transport_free_pkt(reply);
+ kfree_skb(reply);
return -ENOTCONN;
}
@@ -861,16 +844,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
static void virtio_transport_remove_sock(struct vsock_sock *vsk)
{
struct virtio_vsock_sock *vvs = vsk->trans;
- struct virtio_vsock_pkt *pkt, *tmp;
/* We don't need to take rx_lock, as the socket is closing and we are
* removing it.
*/
- list_for_each_entry_safe(pkt, tmp, &vvs->rx_queue, list) {
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
- }
-
+ __skb_queue_purge(&vvs->rx_queue);
vsock_remove_sock(vsk);
}
@@ -984,13 +962,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_release);
static int
virtio_transport_recv_connecting(struct sock *sk,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
int err;
int skerr;
- switch (le16_to_cpu(pkt->hdr.op)) {
+ switch (le16_to_cpu(hdr->op)) {
case VIRTIO_VSOCK_OP_RESPONSE:
sk->sk_state = TCP_ESTABLISHED;
sk->sk_socket->state = SS_CONNECTED;
@@ -1011,7 +990,7 @@ virtio_transport_recv_connecting(struct sock *sk,
return 0;
destroy:
- virtio_transport_reset(vsk, pkt);
+ virtio_transport_reset(vsk, skb);
sk->sk_state = TCP_CLOSE;
sk->sk_err = skerr;
sk_error_report(sk);
@@ -1020,34 +999,38 @@ virtio_transport_recv_connecting(struct sock *sk,
static void
virtio_transport_recv_enqueue(struct vsock_sock *vsk,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
struct virtio_vsock_sock *vvs = vsk->trans;
+ struct virtio_vsock_hdr *hdr;
bool can_enqueue, free_pkt = false;
+ u32 len;
- pkt->len = le32_to_cpu(pkt->hdr.len);
- pkt->off = 0;
+ hdr = vsock_hdr(skb);
+ len = le32_to_cpu(hdr->len);
+ vsock_metadata(skb)->off = 0;
spin_lock_bh(&vvs->rx_lock);
- can_enqueue = virtio_transport_inc_rx_pkt(vvs, pkt);
+ can_enqueue = virtio_transport_inc_rx_pkt(vvs, skb);
if (!can_enqueue) {
free_pkt = true;
goto out;
}
- if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)
+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)
vvs->msg_count++;
/* Try to copy small packets into the buffer of last packet queued,
* to avoid wasting memory queueing the entire buffer with a small
* payload.
*/
- if (pkt->len <= GOOD_COPY_LEN && !list_empty(&vvs->rx_queue)) {
- struct virtio_vsock_pkt *last_pkt;
+ if (len <= GOOD_COPY_LEN && !skb_queue_empty_lockless(&vvs->rx_queue)) {
+ struct virtio_vsock_hdr *last_hdr;
+ struct sk_buff *last_skb;
- last_pkt = list_last_entry(&vvs->rx_queue,
- struct virtio_vsock_pkt, list);
+ last_skb = skb_peek_tail(&vvs->rx_queue);
+ last_hdr = vsock_hdr(last_skb);
/* If there is space in the last packet queued, we copy the
* new packet in its buffer. We avoid this if the last packet
@@ -1055,35 +1038,35 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
* delimiter of SEQPACKET message, so 'pkt' is the first packet
* of a new message.
*/
- if ((pkt->len <= last_pkt->buf_len - last_pkt->len) &&
- !(le32_to_cpu(last_pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)) {
- memcpy(last_pkt->buf + last_pkt->len, pkt->buf,
- pkt->len);
- last_pkt->len += pkt->len;
+ if (skb->len < skb_tailroom(last_skb) &&
+ !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) &&
+ (vsock_hdr(skb)->type != VIRTIO_VSOCK_TYPE_DGRAM)) {
+ memcpy(skb_put(last_skb, skb->len), skb->data, skb->len);
free_pkt = true;
- last_pkt->hdr.flags |= pkt->hdr.flags;
+ last_hdr->flags |= hdr->flags;
goto out;
}
}
- list_add_tail(&pkt->list, &vvs->rx_queue);
+ __skb_queue_tail(&vvs->rx_queue, skb);
out:
spin_unlock_bh(&vvs->rx_lock);
if (free_pkt)
- virtio_transport_free_pkt(pkt);
+ kfree_skb(skb);
}
static int
virtio_transport_recv_connected(struct sock *sk,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
int err = 0;
- switch (le16_to_cpu(pkt->hdr.op)) {
+ switch (le16_to_cpu(hdr->op)) {
case VIRTIO_VSOCK_OP_RW:
- virtio_transport_recv_enqueue(vsk, pkt);
+ virtio_transport_recv_enqueue(vsk, skb);
sk->sk_data_ready(sk);
return err;
case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
@@ -1093,18 +1076,17 @@ virtio_transport_recv_connected(struct sock *sk,
sk->sk_write_space(sk);
break;
case VIRTIO_VSOCK_OP_SHUTDOWN:
- if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
vsk->peer_shutdown |= RCV_SHUTDOWN;
- if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
+ if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
vsk->peer_shutdown |= SEND_SHUTDOWN;
if (vsk->peer_shutdown == SHUTDOWN_MASK &&
vsock_stream_has_data(vsk) <= 0 &&
!sock_flag(sk, SOCK_DONE)) {
(void)virtio_transport_reset(vsk, NULL);
-
virtio_transport_do_close(vsk, true);
}
- if (le32_to_cpu(pkt->hdr.flags))
+ if (le32_to_cpu(vsock_hdr(skb)->flags))
sk->sk_state_change(sk);
break;
case VIRTIO_VSOCK_OP_RST:
@@ -1115,28 +1097,30 @@ virtio_transport_recv_connected(struct sock *sk,
break;
}
- virtio_transport_free_pkt(pkt);
+ kfree_skb(skb);
return err;
}
static void
virtio_transport_recv_disconnecting(struct sock *sk,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
- if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+ if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
virtio_transport_do_close(vsk, true);
}
static int
virtio_transport_send_response(struct vsock_sock *vsk,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_RESPONSE,
- .remote_cid = le64_to_cpu(pkt->hdr.src_cid),
- .remote_port = le32_to_cpu(pkt->hdr.src_port),
+ .remote_cid = le64_to_cpu(hdr->src_cid),
+ .remote_port = le32_to_cpu(hdr->src_port),
.reply = true,
.vsk = vsk,
};
@@ -1145,10 +1129,11 @@ virtio_transport_send_response(struct vsock_sock *vsk,
}
static bool virtio_transport_space_update(struct sock *sk,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
struct vsock_sock *vsk = vsock_sk(sk);
struct virtio_vsock_sock *vvs = vsk->trans;
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
bool space_available;
/* Listener sockets are not associated with any transport, so we are
@@ -1161,8 +1146,8 @@ static bool virtio_transport_space_update(struct sock *sk,
/* buf_alloc and fwd_cnt is always included in the hdr */
spin_lock_bh(&vvs->tx_lock);
- vvs->peer_buf_alloc = le32_to_cpu(pkt->hdr.buf_alloc);
- vvs->peer_fwd_cnt = le32_to_cpu(pkt->hdr.fwd_cnt);
+ vvs->peer_buf_alloc = le32_to_cpu(hdr->buf_alloc);
+ vvs->peer_fwd_cnt = le32_to_cpu(hdr->fwd_cnt);
space_available = virtio_transport_has_space(vsk);
spin_unlock_bh(&vvs->tx_lock);
return space_available;
@@ -1170,27 +1155,28 @@ static bool virtio_transport_space_update(struct sock *sk,
/* Handle server socket */
static int
-virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
+virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
struct virtio_transport *t)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
struct vsock_sock *vchild;
struct sock *child;
int ret;
- if (le16_to_cpu(pkt->hdr.op) != VIRTIO_VSOCK_OP_REQUEST) {
- virtio_transport_reset_no_sock(t, pkt);
+ if (le16_to_cpu(hdr->op) != VIRTIO_VSOCK_OP_REQUEST) {
+ virtio_transport_reset_no_sock(t, skb);
return -EINVAL;
}
if (sk_acceptq_is_full(sk)) {
- virtio_transport_reset_no_sock(t, pkt);
+ virtio_transport_reset_no_sock(t, skb);
return -ENOMEM;
}
child = vsock_create_connected(sk);
if (!child) {
- virtio_transport_reset_no_sock(t, pkt);
+ virtio_transport_reset_no_sock(t, skb);
return -ENOMEM;
}
@@ -1201,10 +1187,10 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
child->sk_state = TCP_ESTABLISHED;
vchild = vsock_sk(child);
- vsock_addr_init(&vchild->local_addr, le64_to_cpu(pkt->hdr.dst_cid),
- le32_to_cpu(pkt->hdr.dst_port));
- vsock_addr_init(&vchild->remote_addr, le64_to_cpu(pkt->hdr.src_cid),
- le32_to_cpu(pkt->hdr.src_port));
+ vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
+ le32_to_cpu(hdr->dst_port));
+ vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
+ le32_to_cpu(hdr->src_port));
ret = vsock_assign_transport(vchild, vsk);
/* Transport assigned (looking at remote_addr) must be the same
@@ -1212,17 +1198,17 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
*/
if (ret || vchild->transport != &t->transport) {
release_sock(child);
- virtio_transport_reset_no_sock(t, pkt);
+ virtio_transport_reset_no_sock(t, skb);
sock_put(child);
return ret;
}
- if (virtio_transport_space_update(child, pkt))
+ if (virtio_transport_space_update(child, skb))
child->sk_write_space(child);
vsock_insert_connected(vchild);
vsock_enqueue_accept(sk, child);
- virtio_transport_send_response(vchild, pkt);
+ virtio_transport_send_response(vchild, skb);
release_sock(child);
@@ -1240,29 +1226,30 @@ static bool virtio_transport_valid_type(u16 type)
* lock.
*/
void virtio_transport_recv_pkt(struct virtio_transport *t,
- struct virtio_vsock_pkt *pkt)
+ struct sk_buff *skb)
{
struct sockaddr_vm src, dst;
struct vsock_sock *vsk;
struct sock *sk;
bool space_available;
+ struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
- vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
- le32_to_cpu(pkt->hdr.src_port));
- vsock_addr_init(&dst, le64_to_cpu(pkt->hdr.dst_cid),
- le32_to_cpu(pkt->hdr.dst_port));
+ vsock_addr_init(&src, le64_to_cpu(hdr->src_cid),
+ le32_to_cpu(hdr->src_port));
+ vsock_addr_init(&dst, le64_to_cpu(hdr->dst_cid),
+ le32_to_cpu(hdr->dst_port));
trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
dst.svm_cid, dst.svm_port,
- le32_to_cpu(pkt->hdr.len),
- le16_to_cpu(pkt->hdr.type),
- le16_to_cpu(pkt->hdr.op),
- le32_to_cpu(pkt->hdr.flags),
- le32_to_cpu(pkt->hdr.buf_alloc),
- le32_to_cpu(pkt->hdr.fwd_cnt));
-
- if (!virtio_transport_valid_type(le16_to_cpu(pkt->hdr.type))) {
- (void)virtio_transport_reset_no_sock(t, pkt);
+ le32_to_cpu(hdr->len),
+ le16_to_cpu(hdr->type),
+ le16_to_cpu(hdr->op),
+ le32_to_cpu(hdr->flags),
+ le32_to_cpu(hdr->buf_alloc),
+ le32_to_cpu(hdr->fwd_cnt));
+
+ if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
+ (void)virtio_transport_reset_no_sock(t, skb);
goto free_pkt;
}
@@ -1273,13 +1260,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
if (!sk) {
sk = vsock_find_bound_socket(&dst);
if (!sk) {
- (void)virtio_transport_reset_no_sock(t, pkt);
+ (void)virtio_transport_reset_no_sock(t, skb);
goto free_pkt;
}
}
- if (virtio_transport_get_type(sk) != le16_to_cpu(pkt->hdr.type)) {
- (void)virtio_transport_reset_no_sock(t, pkt);
+ if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
+ (void)virtio_transport_reset_no_sock(t, skb);
sock_put(sk);
goto free_pkt;
}
@@ -1290,13 +1277,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
/* Check if sk has been closed before lock_sock */
if (sock_flag(sk, SOCK_DONE)) {
- (void)virtio_transport_reset_no_sock(t, pkt);
+ (void)virtio_transport_reset_no_sock(t, skb);
release_sock(sk);
sock_put(sk);
goto free_pkt;
}
- space_available = virtio_transport_space_update(sk, pkt);
+ space_available = virtio_transport_space_update(sk, skb);
/* Update CID in case it has changed after a transport reset event */
if (vsk->local_addr.svm_cid != VMADDR_CID_ANY)
@@ -1307,23 +1294,23 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
switch (sk->sk_state) {
case TCP_LISTEN:
- virtio_transport_recv_listen(sk, pkt, t);
- virtio_transport_free_pkt(pkt);
+ virtio_transport_recv_listen(sk, skb, t);
+ kfree_skb(skb);
break;
case TCP_SYN_SENT:
- virtio_transport_recv_connecting(sk, pkt);
- virtio_transport_free_pkt(pkt);
+ virtio_transport_recv_connecting(sk, skb);
+ kfree_skb(skb);
break;
case TCP_ESTABLISHED:
- virtio_transport_recv_connected(sk, pkt);
+ virtio_transport_recv_connected(sk, skb);
break;
case TCP_CLOSING:
- virtio_transport_recv_disconnecting(sk, pkt);
- virtio_transport_free_pkt(pkt);
+ virtio_transport_recv_disconnecting(sk, skb);
+ kfree_skb(skb);
break;
default:
- (void)virtio_transport_reset_no_sock(t, pkt);
- virtio_transport_free_pkt(pkt);
+ (void)virtio_transport_reset_no_sock(t, skb);
+ kfree_skb(skb);
break;
}
@@ -1336,16 +1323,42 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
return;
free_pkt:
- virtio_transport_free_pkt(pkt);
+ kfree(skb);
}
EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
-void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
+/* Remove skbs found in a queue that have a vsk that matches.
+ *
+ * Each skb is freed.
+ *
+ * Returns the count of skbs that were reply packets.
+ */
+int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue)
{
- kfree(pkt->buf);
- kfree(pkt);
+ int cnt = 0;
+ struct sk_buff *skb, *tmp;
+ struct sk_buff_head freeme;
+
+ skb_queue_head_init(&freeme);
+
+ spin_lock_bh(&queue->lock);
+ skb_queue_walk_safe(queue, skb, tmp) {
+ if (vsock_sk(skb->sk) != vsk)
+ continue;
+
+ __skb_unlink(skb, queue);
+ skb_queue_tail(&freeme, skb);
+
+ if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
+ cnt++;
+ }
+ spin_unlock_bh(&queue->lock);
+
+ skb_queue_purge(&freeme);
+
+ return cnt;
}
-EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
+EXPORT_SYMBOL_GPL(virtio_transport_purge_skbs);
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Asias He");
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 169a8cf65b39..906f7cdff65e 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -16,7 +16,7 @@ struct vsock_loopback {
struct workqueue_struct *workqueue;
spinlock_t pkt_list_lock; /* protects pkt_list */
- struct list_head pkt_list;
+ struct sk_buff_head pkt_queue;
struct work_struct pkt_work;
};
@@ -27,13 +27,13 @@ static u32 vsock_loopback_get_local_cid(void)
return VMADDR_CID_LOCAL;
}
-static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
+static int vsock_loopback_send_pkt(struct sk_buff *skb)
{
struct vsock_loopback *vsock = &the_vsock_loopback;
- int len = pkt->len;
+ int len = skb->len;
spin_lock_bh(&vsock->pkt_list_lock);
- list_add_tail(&pkt->list, &vsock->pkt_list);
+ skb_queue_tail(&vsock->pkt_queue, skb);
spin_unlock_bh(&vsock->pkt_list_lock);
queue_work(vsock->workqueue, &vsock->pkt_work);
@@ -44,21 +44,8 @@ static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
{
struct vsock_loopback *vsock = &the_vsock_loopback;
- struct virtio_vsock_pkt *pkt, *n;
- LIST_HEAD(freeme);
- spin_lock_bh(&vsock->pkt_list_lock);
- list_for_each_entry_safe(pkt, n, &vsock->pkt_list, list) {
- if (pkt->vsk != vsk)
- continue;
- list_move(&pkt->list, &freeme);
- }
- spin_unlock_bh(&vsock->pkt_list_lock);
-
- list_for_each_entry_safe(pkt, n, &freeme, list) {
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
- }
+ virtio_transport_purge_skbs(vsk, &vsock->pkt_queue);
return 0;
}
@@ -121,20 +108,20 @@ static void vsock_loopback_work(struct work_struct *work)
{
struct vsock_loopback *vsock =
container_of(work, struct vsock_loopback, pkt_work);
- LIST_HEAD(pkts);
+ struct sk_buff_head pkts;
+
+ skb_queue_head_init(&pkts);
spin_lock_bh(&vsock->pkt_list_lock);
- list_splice_init(&vsock->pkt_list, &pkts);
+ skb_queue_splice_init(&vsock->pkt_queue, &pkts);
spin_unlock_bh(&vsock->pkt_list_lock);
- while (!list_empty(&pkts)) {
- struct virtio_vsock_pkt *pkt;
+ while (!skb_queue_empty(&pkts)) {
+ struct sk_buff *skb;
- pkt = list_first_entry(&pkts, struct virtio_vsock_pkt, list);
- list_del_init(&pkt->list);
-
- virtio_transport_deliver_tap_pkt(pkt);
- virtio_transport_recv_pkt(&loopback_transport, pkt);
+ skb = skb_dequeue(&pkts);
+ virtio_transport_deliver_tap_pkt(skb);
+ virtio_transport_recv_pkt(&loopback_transport, skb);
}
}
@@ -148,7 +135,7 @@ static int __init vsock_loopback_init(void)
return -ENOMEM;
spin_lock_init(&vsock->pkt_list_lock);
- INIT_LIST_HEAD(&vsock->pkt_list);
+ skb_queue_head_init(&vsock->pkt_queue);
INIT_WORK(&vsock->pkt_work, vsock_loopback_work);
ret = vsock_core_register(&loopback_transport.transport,
@@ -166,19 +153,13 @@ static int __init vsock_loopback_init(void)
static void __exit vsock_loopback_exit(void)
{
struct vsock_loopback *vsock = &the_vsock_loopback;
- struct virtio_vsock_pkt *pkt;
vsock_core_unregister(&loopback_transport.transport);
flush_work(&vsock->pkt_work);
spin_lock_bh(&vsock->pkt_list_lock);
- while (!list_empty(&vsock->pkt_list)) {
- pkt = list_first_entry(&vsock->pkt_list,
- struct virtio_vsock_pkt, list);
- list_del(&pkt->list);
- virtio_transport_free_pkt(pkt);
- }
+ skb_queue_purge(&vsock->pkt_queue);
spin_unlock_bh(&vsock->pkt_list_lock);
destroy_workqueue(vsock->workqueue);
--
2.35.1
In order to support usage of qdisc on vsock traffic, this commit
introduces a struct net_device to vhost and virtio vsock.
Two new devices are created, vhost-vsock for vhost and virtio-vsock
for virtio. The devices are attached to the respective transports.
To bypass the usage of the device, the user may "down" the associated
network interface using common tools. For example, "ip link set dev
virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
simply using the FIFO logic of the prior implementation.
For both hosts and guests, there is one device for all G2H vsock sockets
and one device for all H2G vsock sockets. This makes sense for guests
because the driver only supports a single vsock channel (one pair of
TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
seem ideal for some workloads. However, it is possible to use a
multi-queue qdisc, where a given queue is responsible for a range of
sockets. This seems to be a better solution than having one device per
socket, which may yield a very large number of devices and qdiscs, all
of which are dynamically being created and destroyed. Because of this
dynamism, it would also require a complex policy management daemon, as
devices would constantly be spun up and down as sockets were created and
destroyed. To avoid this, one device and qdisc also applies to all H2G
sockets.
Signed-off-by: Bobby Eshleman <[email protected]>
---
drivers/vhost/vsock.c | 19 +++-
include/linux/virtio_vsock.h | 10 +++
net/vmw_vsock/virtio_transport.c | 19 +++-
net/vmw_vsock/virtio_transport_common.c | 112 +++++++++++++++++++++++-
4 files changed, 152 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index f8601d93d94d..b20ddec2664b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -927,13 +927,30 @@ static int __init vhost_vsock_init(void)
VSOCK_TRANSPORT_F_H2G);
if (ret < 0)
return ret;
- return misc_register(&vhost_vsock_misc);
+
+ ret = virtio_transport_init(&vhost_transport, "vhost-vsock");
+ if (ret < 0)
+ goto out_unregister;
+
+ ret = misc_register(&vhost_vsock_misc);
+ if (ret < 0)
+ goto out_transport_exit;
+ return ret;
+
+out_transport_exit:
+ virtio_transport_exit(&vhost_transport);
+
+out_unregister:
+ vsock_core_unregister(&vhost_transport.transport);
+ return ret;
+
};
static void __exit vhost_vsock_exit(void)
{
misc_deregister(&vhost_vsock_misc);
vsock_core_unregister(&vhost_transport.transport);
+ virtio_transport_exit(&vhost_transport);
};
module_init(vhost_vsock_init);
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 9a37eddbb87a..5d7e7fbd75f8 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -91,10 +91,20 @@ struct virtio_transport {
/* This must be the first field */
struct vsock_transport transport;
+ /* Used almost exclusively for qdisc */
+ struct net_device *dev;
+
/* Takes ownership of the packet */
int (*send_pkt)(struct sk_buff *skb);
};
+int
+virtio_transport_init(struct virtio_transport *t,
+ const char *name);
+
+void
+virtio_transport_exit(struct virtio_transport *t);
+
ssize_t
virtio_transport_stream_dequeue(struct vsock_sock *vsk,
struct msghdr *msg,
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 3bb293fd8607..c6212eb38d3c 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -131,7 +131,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
* the vq
*/
if (ret < 0) {
- skb_queue_head(&vsock->send_pkt_queue, skb);
+ spin_lock_bh(&vsock->send_pkt_queue.lock);
+ __skb_queue_head(&vsock->send_pkt_queue, skb);
+ spin_unlock_bh(&vsock->send_pkt_queue.lock);
break;
}
@@ -676,7 +678,9 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
kfree_skb(skb);
mutex_unlock(&vsock->tx_lock);
- skb_queue_purge(&vsock->send_pkt_queue);
+ spin_lock_bh(&vsock->send_pkt_queue.lock);
+ __skb_queue_purge(&vsock->send_pkt_queue);
+ spin_unlock_bh(&vsock->send_pkt_queue.lock);
/* Delete virtqueues and flush outstanding callbacks if any */
vdev->config->del_vqs(vdev);
@@ -760,6 +764,8 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
flush_work(&vsock->event_work);
flush_work(&vsock->send_pkt_work);
+ virtio_transport_exit(&virtio_transport);
+
mutex_unlock(&the_virtio_vsock_mutex);
kfree(vsock);
@@ -844,12 +850,18 @@ static int __init virtio_vsock_init(void)
if (ret)
goto out_wq;
- ret = register_virtio_driver(&virtio_vsock_driver);
+ ret = virtio_transport_init(&virtio_transport, "virtio-vsock");
if (ret)
goto out_vci;
+ ret = register_virtio_driver(&virtio_vsock_driver);
+ if (ret)
+ goto out_transport;
+
return 0;
+out_transport:
+ virtio_transport_exit(&virtio_transport);
out_vci:
vsock_core_unregister(&virtio_transport.transport);
out_wq:
@@ -861,6 +873,7 @@ static void __exit virtio_vsock_exit(void)
{
unregister_virtio_driver(&virtio_vsock_driver);
vsock_core_unregister(&virtio_transport.transport);
+ virtio_transport_exit(&virtio_transport);
destroy_workqueue(virtio_vsock_workqueue);
}
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index d5780599fe93..bdf16fff054f 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -16,6 +16,7 @@
#include <net/sock.h>
#include <net/af_vsock.h>
+#include <net/pkt_sched.h>
#define CREATE_TRACE_POINTS
#include <trace/events/vsock_virtio_transport_common.h>
@@ -23,6 +24,93 @@
/* How long to wait for graceful shutdown of a connection */
#define VSOCK_CLOSE_TIMEOUT (8 * HZ)
+struct virtio_transport_priv {
+ struct virtio_transport *trans;
+};
+
+static netdev_tx_t virtio_transport_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ struct virtio_transport *t =
+ ((struct virtio_transport_priv *)netdev_priv(dev))->trans;
+ int ret;
+
+ ret = t->send_pkt(skb);
+ if (unlikely(ret == -ENODEV))
+ return NETDEV_TX_BUSY;
+
+ return NETDEV_TX_OK;
+}
+
+const struct net_device_ops virtio_transport_netdev_ops = {
+ .ndo_start_xmit = virtio_transport_start_xmit,
+};
+
+static void virtio_transport_setup(struct net_device *dev)
+{
+ dev->netdev_ops = &virtio_transport_netdev_ops;
+ dev->needs_free_netdev = true;
+ dev->flags = IFF_NOARP;
+ dev->mtu = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+ dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
+}
+
+static int ifup(struct net_device *dev)
+{
+ int ret;
+
+ rtnl_lock();
+ ret = dev_open(dev, NULL) ? -ENOMEM : 0;
+ rtnl_unlock();
+
+ return ret;
+}
+
+/* virtio_transport_init - initialize a virtio vsock transport layer
+ *
+ * @t: ptr to the virtio transport struct to initialize
+ * @name: the name of the net_device to be created.
+ *
+ * Return 0 on success, otherwise negative errno.
+ */
+int virtio_transport_init(struct virtio_transport *t, const char *name)
+{
+ struct virtio_transport_priv *priv;
+ int ret;
+
+ t->dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN, virtio_transport_setup);
+ if (!t->dev)
+ return -ENOMEM;
+
+ priv = netdev_priv(t->dev);
+ priv->trans = t;
+
+ ret = register_netdev(t->dev);
+ if (ret < 0)
+ goto out_free_netdev;
+
+ ret = ifup(t->dev);
+ if (ret < 0)
+ goto out_unregister_netdev;
+
+ return 0;
+
+out_unregister_netdev:
+ unregister_netdev(t->dev);
+
+out_free_netdev:
+ free_netdev(t->dev);
+
+ return ret;
+}
+
+void virtio_transport_exit(struct virtio_transport *t)
+{
+ if (t->dev) {
+ unregister_netdev(t->dev);
+ free_netdev(t->dev);
+ }
+}
+
static const struct virtio_transport *
virtio_transport_get_ops(struct vsock_sock *vsk)
{
@@ -147,6 +235,24 @@ static u16 virtio_transport_get_type(struct sock *sk)
return VIRTIO_VSOCK_TYPE_SEQPACKET;
}
+/* Return pkt->len on success, otherwise negative errno */
+static int virtio_transport_send_pkt(const struct virtio_transport *t, struct sk_buff *skb)
+{
+ int ret;
+ int len = skb->len;
+
+ if (unlikely(!t->dev || !(t->dev->flags & IFF_UP)))
+ return t->send_pkt(skb);
+
+ skb->dev = t->dev;
+ ret = dev_queue_xmit(skb);
+
+ if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN))
+ return len;
+
+ return -ENOMEM;
+}
+
/* This function can only be used on connecting/connected sockets,
* since a socket assigned to a transport is required.
*
@@ -202,9 +308,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
virtio_transport_inc_tx_pkt(vvs, skb);
- err = t_ops->send_pkt(skb);
-
- return err < 0 ? -ENOMEM : err;
+ return virtio_transport_send_pkt(t_ops, skb);
}
static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
@@ -834,7 +938,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
return -ENOTCONN;
}
- return t->send_pkt(reply);
+ return virtio_transport_send_pkt(t, reply);
}
/* This function should be called with sk_lock held and SOCK_DONE set */
--
2.35.1
On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> Hey everybody,
>
> This series introduces datagrams, packet scheduling, and sk_buff usage
> to virtio vsock.
>
> The usage of struct sk_buff benefits users by a) preparing vsock to use
> other related systems that require sk_buff, such as sockmap and qdisc,
> b) supporting basic congestion control via sock_alloc_send_skb, and c)
> reducing copying when delivering packets to TAP.
>
> The socket layer no longer forces errors to be -ENOMEM, as typically
> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> messages are being sent with option MSG_DONTWAIT.
>
> The datagram work is based off previous patches by Jiang Wang[1].
>
> The introduction of datagrams creates a transport layer fairness issue
> where datagrams may freely starve streams of queue access. This happens
> because, unlike streams, datagrams lack the transactions necessary for
> calculating credits and throttling.
>
> Previous proposals introduce changes to the spec to add an additional
> virtqueue pair for datagrams[1]. Although this solution works, using
> Linux's qdisc for packet scheduling leverages already existing systems,
> avoids the need to change the virtio specification, and gives additional
> capabilities. The usage of SFQ or fq_codel, for example, may solve the
> transport layer starvation problem. It is easy to imagine other use
> cases as well. For example, services of varying importance may be
> assigned different priorities, and qdisc will apply appropriate
> priority-based scheduling. By default, the system default pfifo qdisc is
> used. The qdisc may be bypassed and legacy queuing is resumed by simply
> setting the virtio-vsock%d network device to state DOWN. This technique
> still allows vsock to work with zero-configuration.
>
> In summary, this series introduces these major changes to vsock:
>
> - virtio vsock supports datagrams
> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
> - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
> which applies the throttling threshold sk_sndbuf.
> - The vsock socket layer supports returning errors other than -ENOMEM.
> - This is used to return -EAGAIN when the sk_sndbuf threshold is
> reached.
> - virtio vsock uses a net_device, through which qdisc may be used.
> - qdisc allows scheduling policies to be applied to vsock flows.
> - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
> it may avoid datagrams from flooding out stream flows. The benefit
> to this is that additional virtqueues are not needed for datagrams.
> - The net_device and qdisc is bypassed by simply setting the
> net_device state to DOWN.
>
> [1]: https://lore.kernel.org/all/[email protected]/
Given this affects the driver/device interface I'd like to
ask you to please copy virtio-dev mailing list on these patches.
Subscriber only I'm afraid you will need to subscribe :(
> Bobby Eshleman (5):
> vsock: replace virtio_vsock_pkt with sk_buff
> vsock: return errors other than -ENOMEM to socket
> vsock: add netdev to vhost/virtio vsock
> virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
> virtio/vsock: add support for dgram
>
> Jiang Wang (1):
> vsock_test: add tests for vsock dgram
>
> drivers/vhost/vsock.c | 238 ++++----
> include/linux/virtio_vsock.h | 73 ++-
> include/net/af_vsock.h | 2 +
> include/uapi/linux/virtio_vsock.h | 2 +
> net/vmw_vsock/af_vsock.c | 30 +-
> net/vmw_vsock/hyperv_transport.c | 2 +-
> net/vmw_vsock/virtio_transport.c | 237 +++++---
> net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
> net/vmw_vsock/vmci_transport.c | 9 +-
> net/vmw_vsock/vsock_loopback.c | 51 +-
> tools/testing/vsock/util.c | 105 ++++
> tools/testing/vsock/util.h | 4 +
> tools/testing/vsock/vsock_test.c | 195 ++++++
> 13 files changed, 1176 insertions(+), 543 deletions(-)
>
> --
> 2.35.1
Hi Bobby,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.0-rc1 next-20220815]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
base: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20220816/[email protected]/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/cbb332da78c86ac574688831ed6f404d04d506db
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
git checkout cbb332da78c86ac574688831ed6f404d04d506db
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash net/vmw_vsock/
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
net/vmw_vsock/virtio_transport_common.c: In function 'virtio_transport_dgram_do_dequeue':
>> net/vmw_vsock/virtio_transport_common.c:605:13: warning: variable 'free_space' set but not used [-Wunused-but-set-variable]
605 | u32 free_space;
| ^~~~~~~~~~
vim +/free_space +605 net/vmw_vsock/virtio_transport_common.c
597
598 static ssize_t
599 virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
600 struct msghdr *msg, size_t len)
601 {
602 struct virtio_vsock_sock *vvs = vsk->trans;
603 struct sk_buff *skb;
604 size_t total = 0;
> 605 u32 free_space;
606 int err = -EFAULT;
607
608 spin_lock_bh(&vvs->rx_lock);
609 if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
610 skb = __skb_dequeue(&vvs->rx_queue);
611
612 total = len;
613 if (total > skb->len - vsock_metadata(skb)->off)
614 total = skb->len - vsock_metadata(skb)->off;
615 else if (total < skb->len - vsock_metadata(skb)->off)
616 msg->msg_flags |= MSG_TRUNC;
617
618 /* sk_lock is held by caller so no one else can dequeue.
619 * Unlock rx_lock since memcpy_to_msg() may sleep.
620 */
621 spin_unlock_bh(&vvs->rx_lock);
622
623 err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
624 if (err)
625 return err;
626
627 spin_lock_bh(&vvs->rx_lock);
628
629 virtio_transport_dec_rx_pkt(vvs, skb);
630 consume_skb(skb);
631 }
632
633 free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
634
635 spin_unlock_bh(&vvs->rx_lock);
636
637 if (total > 0 && msg->msg_name) {
638 /* Provide the address of the sender. */
639 DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
640
641 vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
642 le32_to_cpu(vsock_hdr(skb)->src_port));
643 msg->msg_namelen = sizeof(*vm_addr);
644 }
645 return total;
646 }
647
--
0-DAY CI Kernel Test Service
https://01.org/lkp
On Mon, Aug 15, 2022 at 04:39:08PM -0400, Michael S. Tsirkin wrote:
>
> Given this affects the driver/device interface I'd like to
> ask you to please copy virtio-dev mailing list on these patches.
> Subscriber only I'm afraid you will need to subscribe :(
>
Ah makes sense, will do!
Best,
Bobby
CC'ing [email protected]
On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> Hey everybody,
>
> This series introduces datagrams, packet scheduling, and sk_buff usage
> to virtio vsock.
>
> The usage of struct sk_buff benefits users by a) preparing vsock to use
> other related systems that require sk_buff, such as sockmap and qdisc,
> b) supporting basic congestion control via sock_alloc_send_skb, and c)
> reducing copying when delivering packets to TAP.
>
> The socket layer no longer forces errors to be -ENOMEM, as typically
> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> messages are being sent with option MSG_DONTWAIT.
>
> The datagram work is based off previous patches by Jiang Wang[1].
>
> The introduction of datagrams creates a transport layer fairness issue
> where datagrams may freely starve streams of queue access. This happens
> because, unlike streams, datagrams lack the transactions necessary for
> calculating credits and throttling.
>
> Previous proposals introduce changes to the spec to add an additional
> virtqueue pair for datagrams[1]. Although this solution works, using
> Linux's qdisc for packet scheduling leverages already existing systems,
> avoids the need to change the virtio specification, and gives additional
> capabilities. The usage of SFQ or fq_codel, for example, may solve the
> transport layer starvation problem. It is easy to imagine other use
> cases as well. For example, services of varying importance may be
> assigned different priorities, and qdisc will apply appropriate
> priority-based scheduling. By default, the system default pfifo qdisc is
> used. The qdisc may be bypassed and legacy queuing is resumed by simply
> setting the virtio-vsock%d network device to state DOWN. This technique
> still allows vsock to work with zero-configuration.
>
> In summary, this series introduces these major changes to vsock:
>
> - virtio vsock supports datagrams
> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
> - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
> which applies the throttling threshold sk_sndbuf.
> - The vsock socket layer supports returning errors other than -ENOMEM.
> - This is used to return -EAGAIN when the sk_sndbuf threshold is
> reached.
> - virtio vsock uses a net_device, through which qdisc may be used.
> - qdisc allows scheduling policies to be applied to vsock flows.
> - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
> it may avoid datagrams from flooding out stream flows. The benefit
> to this is that additional virtqueues are not needed for datagrams.
> - The net_device and qdisc is bypassed by simply setting the
> net_device state to DOWN.
>
> [1]: https://lore.kernel.org/all/[email protected]/
>
> Bobby Eshleman (5):
> vsock: replace virtio_vsock_pkt with sk_buff
> vsock: return errors other than -ENOMEM to socket
> vsock: add netdev to vhost/virtio vsock
> virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
> virtio/vsock: add support for dgram
>
> Jiang Wang (1):
> vsock_test: add tests for vsock dgram
>
> drivers/vhost/vsock.c | 238 ++++----
> include/linux/virtio_vsock.h | 73 ++-
> include/net/af_vsock.h | 2 +
> include/uapi/linux/virtio_vsock.h | 2 +
> net/vmw_vsock/af_vsock.c | 30 +-
> net/vmw_vsock/hyperv_transport.c | 2 +-
> net/vmw_vsock/virtio_transport.c | 237 +++++---
> net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
> net/vmw_vsock/vmci_transport.c | 9 +-
> net/vmw_vsock/vsock_loopback.c | 51 +-
> tools/testing/vsock/util.c | 105 ++++
> tools/testing/vsock/util.h | 4 +
> tools/testing/vsock/vsock_test.c | 195 ++++++
> 13 files changed, 1176 insertions(+), 543 deletions(-)
>
> --
> 2.35.1
>
CC'ing [email protected]
On Mon, Aug 15, 2022 at 10:56:04AM -0700, Bobby Eshleman wrote:
> This patch replaces virtio_vsock_pkt with sk_buff.
>
> The benefit of this series includes:
>
> * The bug reported @ https://bugzilla.redhat.com/show_bug.cgi?id=2009935
> does not present itself when reasonable sk_sndbuf thresholds are set.
> * Using sock_alloc_send_skb() teaches VSOCK to respect
> sk_sndbuf for tunability.
> * Eliminates copying for vsock_deliver_tap().
> * sk_buff is required for future improvements, such as using socket map.
>
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> drivers/vhost/vsock.c | 214 +++++------
> include/linux/virtio_vsock.h | 60 ++-
> net/vmw_vsock/af_vsock.c | 1 +
> net/vmw_vsock/virtio_transport.c | 212 +++++-----
> net/vmw_vsock/virtio_transport_common.c | 491 ++++++++++++------------
> net/vmw_vsock/vsock_loopback.c | 51 +--
> 6 files changed, 517 insertions(+), 512 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 368330417bde..f8601d93d94d 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -51,8 +51,7 @@ struct vhost_vsock {
> struct hlist_node hash;
>
> struct vhost_work send_pkt_work;
> - spinlock_t send_pkt_list_lock;
> - struct list_head send_pkt_list; /* host->guest pending packets */
> + struct sk_buff_head send_pkt_queue; /* host->guest pending packets */
>
> atomic_t queued_replies;
>
> @@ -108,7 +107,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> vhost_disable_notify(&vsock->dev, vq);
>
> do {
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> + struct virtio_vsock_hdr *hdr;
> struct iov_iter iov_iter;
> unsigned out, in;
> size_t nbytes;
> @@ -116,31 +116,22 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> int head;
> u32 flags_to_restore = 0;
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - if (list_empty(&vsock->send_pkt_list)) {
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> + skb = skb_dequeue(&vsock->send_pkt_queue);
> +
> + if (!skb) {
> vhost_enable_notify(&vsock->dev, vq);
> break;
> }
>
> - pkt = list_first_entry(&vsock->send_pkt_list,
> - struct virtio_vsock_pkt, list);
> - list_del_init(&pkt->list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> &out, &in, NULL, NULL);
> if (head < 0) {
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_add(&pkt->list, &vsock->send_pkt_list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> + skb_queue_head(&vsock->send_pkt_queue, skb);
> break;
> }
>
> if (head == vq->num) {
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_add(&pkt->list, &vsock->send_pkt_list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> + skb_queue_head(&vsock->send_pkt_queue, skb);
>
> /* We cannot finish yet if more buffers snuck in while
> * re-enabling notify.
> @@ -153,26 +144,27 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> }
>
> if (out) {
> - virtio_transport_free_pkt(pkt);
> + kfree_skb(skb);
> vq_err(vq, "Expected 0 output buffers, got %u\n", out);
> break;
> }
>
> iov_len = iov_length(&vq->iov[out], in);
> - if (iov_len < sizeof(pkt->hdr)) {
> - virtio_transport_free_pkt(pkt);
> + if (iov_len < sizeof(*hdr)) {
> + kfree_skb(skb);
> vq_err(vq, "Buffer len [%zu] too small\n", iov_len);
> break;
> }
>
> iov_iter_init(&iov_iter, READ, &vq->iov[out], in, iov_len);
> - payload_len = pkt->len - pkt->off;
> + payload_len = skb->len - vsock_metadata(skb)->off;
> + hdr = vsock_hdr(skb);
>
> /* If the packet is greater than the space available in the
> * buffer, we split it using multiple buffers.
> */
> - if (payload_len > iov_len - sizeof(pkt->hdr)) {
> - payload_len = iov_len - sizeof(pkt->hdr);
> + if (payload_len > iov_len - sizeof(*hdr)) {
> + payload_len = iov_len - sizeof(*hdr);
>
> /* As we are copying pieces of large packet's buffer to
> * small rx buffers, headers of packets in rx queue are
> @@ -185,31 +177,31 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> * bits set. After initialized header will be copied to
> * rx buffer, these required bits will be restored.
> */
> - if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
> - pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> + hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> flags_to_restore |= VIRTIO_VSOCK_SEQ_EOM;
>
> - if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR) {
> - pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) {
> + hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> flags_to_restore |= VIRTIO_VSOCK_SEQ_EOR;
> }
> }
> }
>
> /* Set the correct length in the header */
> - pkt->hdr.len = cpu_to_le32(payload_len);
> + hdr->len = cpu_to_le32(payload_len);
>
> - nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
> - if (nbytes != sizeof(pkt->hdr)) {
> - virtio_transport_free_pkt(pkt);
> + nbytes = copy_to_iter(hdr, sizeof(*hdr), &iov_iter);
> + if (nbytes != sizeof(*hdr)) {
> + kfree_skb(skb);
> vq_err(vq, "Faulted on copying pkt hdr\n");
> break;
> }
>
> - nbytes = copy_to_iter(pkt->buf + pkt->off, payload_len,
> + nbytes = copy_to_iter(skb->data + vsock_metadata(skb)->off, payload_len,
> &iov_iter);
> if (nbytes != payload_len) {
> - virtio_transport_free_pkt(pkt);
> + kfree_skb(skb);
> vq_err(vq, "Faulted on copying pkt buf\n");
> break;
> }
> @@ -217,31 +209,28 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> /* Deliver to monitoring devices all packets that we
> * will transmit.
> */
> - virtio_transport_deliver_tap_pkt(pkt);
> + virtio_transport_deliver_tap_pkt(skb);
>
> - vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
> + vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
> added = true;
>
> - pkt->off += payload_len;
> + vsock_metadata(skb)->off += payload_len;
> total_len += payload_len;
>
> /* If we didn't send all the payload we can requeue the packet
> * to send it with the next available buffer.
> */
> - if (pkt->off < pkt->len) {
> - pkt->hdr.flags |= cpu_to_le32(flags_to_restore);
> + if (vsock_metadata(skb)->off < skb->len) {
> + hdr->flags |= cpu_to_le32(flags_to_restore);
>
> - /* We are queueing the same virtio_vsock_pkt to handle
> + /* We are queueing the same skb to handle
> * the remaining bytes, and we want to deliver it
> * to monitoring devices in the next iteration.
> */
> - pkt->tap_delivered = false;
> -
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_add(&pkt->list, &vsock->send_pkt_list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> + vsock_metadata(skb)->flags &= ~VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
> + skb_queue_head(&vsock->send_pkt_queue, skb);
> } else {
> - if (pkt->reply) {
> + if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY) {
> int val;
>
> val = atomic_dec_return(&vsock->queued_replies);
> @@ -253,7 +242,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> restart_tx = true;
> }
>
> - virtio_transport_free_pkt(pkt);
> + consume_skb(skb);
> }
> } while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
> if (added)
> @@ -278,28 +267,26 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
> }
>
> static int
> -vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
> +vhost_transport_send_pkt(struct sk_buff *skb)
> {
> struct vhost_vsock *vsock;
> - int len = pkt->len;
> + int len = skb->len;
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>
> rcu_read_lock();
>
> /* Find the vhost_vsock according to guest context id */
> - vsock = vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid));
> + vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid));
> if (!vsock) {
> rcu_read_unlock();
> - virtio_transport_free_pkt(pkt);
> + kfree_skb(skb);
> return -ENODEV;
> }
>
> - if (pkt->reply)
> + if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
> atomic_inc(&vsock->queued_replies);
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_add_tail(&pkt->list, &vsock->send_pkt_list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> + skb_queue_tail(&vsock->send_pkt_queue, skb);
> vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
>
> rcu_read_unlock();
> @@ -310,10 +297,8 @@ static int
> vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> {
> struct vhost_vsock *vsock;
> - struct virtio_vsock_pkt *pkt, *n;
> int cnt = 0;
> int ret = -ENODEV;
> - LIST_HEAD(freeme);
>
> rcu_read_lock();
>
> @@ -322,20 +307,7 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> if (!vsock)
> goto out;
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
> - if (pkt->vsk != vsk)
> - continue;
> - list_move(&pkt->list, &freeme);
> - }
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> - list_for_each_entry_safe(pkt, n, &freeme, list) {
> - if (pkt->reply)
> - cnt++;
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> - }
> + cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>
> if (cnt) {
> struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
> @@ -352,11 +324,12 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> return ret;
> }
>
> -static struct virtio_vsock_pkt *
> -vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> +static struct sk_buff *
> +vhost_vsock_alloc_skb(struct vhost_virtqueue *vq,
> unsigned int out, unsigned int in)
> {
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> + struct virtio_vsock_hdr *hdr;
> struct iov_iter iov_iter;
> size_t nbytes;
> size_t len;
> @@ -366,50 +339,49 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> return NULL;
> }
>
> - pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> - if (!pkt)
> + len = iov_length(vq->iov, out);
> +
> + /* len contains both payload and hdr, so only add additional space for metadata */
> + skb = alloc_skb(len + sizeof(struct virtio_vsock_metadata), GFP_KERNEL);
> + if (!skb)
> return NULL;
>
> - len = iov_length(vq->iov, out);
> + memset(skb->head, 0, sizeof(struct virtio_vsock_metadata));
> + virtio_vsock_skb_reserve(skb);
> iov_iter_init(&iov_iter, WRITE, vq->iov, out, len);
>
> - nbytes = copy_from_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
> - if (nbytes != sizeof(pkt->hdr)) {
> + hdr = vsock_hdr(skb);
> + nbytes = copy_from_iter(hdr, sizeof(*hdr), &iov_iter);
> + if (nbytes != sizeof(*hdr)) {
> vq_err(vq, "Expected %zu bytes for pkt->hdr, got %zu bytes\n",
> - sizeof(pkt->hdr), nbytes);
> - kfree(pkt);
> + sizeof(*hdr), nbytes);
> + kfree_skb(skb);
> return NULL;
> }
>
> - pkt->len = le32_to_cpu(pkt->hdr.len);
> + len = le32_to_cpu(hdr->len);
>
> /* No payload */
> - if (!pkt->len)
> - return pkt;
> + if (!len)
> + return skb;
>
> /* The pkt is too big */
> - if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> - kfree(pkt);
> + if (len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> + kfree_skb(skb);
> return NULL;
> }
>
> - pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
> - if (!pkt->buf) {
> - kfree(pkt);
> - return NULL;
> - }
> + virtio_vsock_skb_rx_put(skb);
>
> - pkt->buf_len = pkt->len;
> -
> - nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
> - if (nbytes != pkt->len) {
> - vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
> - pkt->len, nbytes);
> - virtio_transport_free_pkt(pkt);
> + nbytes = copy_from_iter(skb->data, len, &iov_iter);
> + if (nbytes != len) {
> + vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
> + len, nbytes);
> + kfree_skb(skb);
> return NULL;
> }
>
> - return pkt;
> + return skb;
> }
>
> /* Is there space left for replies to rx packets? */
> @@ -496,7 +468,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
> poll.work);
> struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
> dev);
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> int head, pkts = 0, total_len = 0;
> unsigned int out, in;
> bool added = false;
> @@ -511,6 +483,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
>
> vhost_disable_notify(&vsock->dev, vq);
> do {
> + struct virtio_vsock_hdr *hdr;
> + u32 len;
> +
> if (!vhost_vsock_more_replies(vsock)) {
> /* Stop tx until the device processes already
> * pending replies. Leave tx virtqueue
> @@ -532,26 +507,29 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
> break;
> }
>
> - pkt = vhost_vsock_alloc_pkt(vq, out, in);
> - if (!pkt) {
> - vq_err(vq, "Faulted on pkt\n");
> + skb = vhost_vsock_alloc_skb(vq, out, in);
> + if (!skb)
> continue;
> - }
>
> - total_len += sizeof(pkt->hdr) + pkt->len;
> + len = skb->len;
>
> /* Deliver to monitoring devices all received packets */
> - virtio_transport_deliver_tap_pkt(pkt);
> + virtio_transport_deliver_tap_pkt(skb);
> +
> + hdr = vsock_hdr(skb);
>
> /* Only accept correctly addressed packets */
> - if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid &&
> - le64_to_cpu(pkt->hdr.dst_cid) ==
> + if (le64_to_cpu(hdr->src_cid) == vsock->guest_cid &&
> + le64_to_cpu(hdr->dst_cid) ==
> vhost_transport_get_local_cid())
> - virtio_transport_recv_pkt(&vhost_transport, pkt);
> + virtio_transport_recv_pkt(&vhost_transport, skb);
> else
> - virtio_transport_free_pkt(pkt);
> + kfree_skb(skb);
> +
>
> - vhost_add_used(vq, head, 0);
> + len += sizeof(*hdr);
> + vhost_add_used(vq, head, len);
> + total_len += len;
> added = true;
> } while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
>
> @@ -693,8 +671,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
> VHOST_VSOCK_WEIGHT, true, NULL);
>
> file->private_data = vsock;
> - spin_lock_init(&vsock->send_pkt_list_lock);
> - INIT_LIST_HEAD(&vsock->send_pkt_list);
> + skb_queue_head_init(&vsock->send_pkt_queue);
> vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
> return 0;
>
> @@ -760,16 +737,7 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
> vhost_vsock_flush(vsock);
> vhost_dev_stop(&vsock->dev);
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - while (!list_empty(&vsock->send_pkt_list)) {
> - struct virtio_vsock_pkt *pkt;
> -
> - pkt = list_first_entry(&vsock->send_pkt_list,
> - struct virtio_vsock_pkt, list);
> - list_del_init(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> - }
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> + skb_queue_purge(&vsock->send_pkt_queue);
>
> vhost_dev_cleanup(&vsock->dev);
> kfree(vsock->dev.vqs);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 35d7eedb5e8e..17ed01466875 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -4,9 +4,43 @@
>
> #include <uapi/linux/virtio_vsock.h>
> #include <linux/socket.h>
> +#include <vdso/bits.h>
> #include <net/sock.h>
> #include <net/af_vsock.h>
>
> +enum virtio_vsock_metadata_flags {
> + VIRTIO_VSOCK_METADATA_FLAGS_REPLY = BIT(0),
> + VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED = BIT(1),
> +};
> +
> +/* Used only by the virtio/vhost vsock drivers, not related to protocol */
> +struct virtio_vsock_metadata {
> + size_t off;
> + enum virtio_vsock_metadata_flags flags;
> +};
> +
> +#define vsock_hdr(skb) \
> + ((struct virtio_vsock_hdr *) \
> + ((void *)skb->head + sizeof(struct virtio_vsock_metadata)))
> +
> +#define vsock_metadata(skb) \
> + ((struct virtio_vsock_metadata *)skb->head)
> +
> +#define virtio_vsock_skb_reserve(skb) \
> + skb_reserve(skb, \
> + sizeof(struct virtio_vsock_metadata) + \
> + sizeof(struct virtio_vsock_hdr))
> +
> +static inline void virtio_vsock_skb_rx_put(struct sk_buff *skb)
> +{
> + u32 len;
> +
> + len = le32_to_cpu(vsock_hdr(skb)->len);
> +
> + if (len > 0)
> + skb_put(skb, len);
> +}
> +
> #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
> #define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
> #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
> @@ -35,23 +69,10 @@ struct virtio_vsock_sock {
> u32 last_fwd_cnt;
> u32 rx_bytes;
> u32 buf_alloc;
> - struct list_head rx_queue;
> + struct sk_buff_head rx_queue;
> u32 msg_count;
> };
>
> -struct virtio_vsock_pkt {
> - struct virtio_vsock_hdr hdr;
> - struct list_head list;
> - /* socket refcnt not held, only use for cancellation */
> - struct vsock_sock *vsk;
> - void *buf;
> - u32 buf_len;
> - u32 len;
> - u32 off;
> - bool reply;
> - bool tap_delivered;
> -};
> -
> struct virtio_vsock_pkt_info {
> u32 remote_cid, remote_port;
> struct vsock_sock *vsk;
> @@ -68,7 +89,7 @@ struct virtio_transport {
> struct vsock_transport transport;
>
> /* Takes ownership of the packet */
> - int (*send_pkt)(struct virtio_vsock_pkt *pkt);
> + int (*send_pkt)(struct sk_buff *skb);
> };
>
> ssize_t
> @@ -149,11 +170,10 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> void virtio_transport_destruct(struct vsock_sock *vsk);
>
> void virtio_transport_recv_pkt(struct virtio_transport *t,
> - struct virtio_vsock_pkt *pkt);
> -void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
> -void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt);
> + struct sk_buff *skb);
> +void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb);
> u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
> void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
> -void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt);
> -
> +void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
> +int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue);
> #endif /* _LINUX_VIRTIO_VSOCK_H */
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index f04abf662ec6..e348b2d09eac 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -748,6 +748,7 @@ static struct sock *__vsock_create(struct net *net,
> vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>
> + sk->sk_allocation = GFP_KERNEL;
> sk->sk_destruct = vsock_sk_destruct;
> sk->sk_backlog_rcv = vsock_queue_rcv_skb;
> sock_reset_flag(sk, SOCK_DONE);
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index ad64f403536a..3bb293fd8607 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -21,6 +21,12 @@
> #include <linux/mutex.h>
> #include <net/af_vsock.h>
>
> +#define VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE \
> + (VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE \
> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) \
> + - sizeof(struct virtio_vsock_hdr) \
> + - sizeof(struct virtio_vsock_metadata))
> +
> static struct workqueue_struct *virtio_vsock_workqueue;
> static struct virtio_vsock __rcu *the_virtio_vsock;
> static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
> @@ -42,8 +48,7 @@ struct virtio_vsock {
> bool tx_run;
>
> struct work_struct send_pkt_work;
> - spinlock_t send_pkt_list_lock;
> - struct list_head send_pkt_list;
> + struct sk_buff_head send_pkt_queue;
>
> atomic_t queued_replies;
>
> @@ -101,41 +106,32 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> vq = vsock->vqs[VSOCK_VQ_TX];
>
> for (;;) {
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> struct scatterlist hdr, buf, *sgs[2];
> int ret, in_sg = 0, out_sg = 0;
> bool reply;
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - if (list_empty(&vsock->send_pkt_list)) {
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> - break;
> - }
> + skb = skb_dequeue(&vsock->send_pkt_queue);
>
> - pkt = list_first_entry(&vsock->send_pkt_list,
> - struct virtio_vsock_pkt, list);
> - list_del_init(&pkt->list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> - virtio_transport_deliver_tap_pkt(pkt);
> + if (!skb)
> + break;
>
> - reply = pkt->reply;
> + virtio_transport_deliver_tap_pkt(skb);
> + reply = vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
>
> - sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
> + sg_init_one(&hdr, vsock_hdr(skb), sizeof(*vsock_hdr(skb)));
> sgs[out_sg++] = &hdr;
> - if (pkt->buf) {
> - sg_init_one(&buf, pkt->buf, pkt->len);
> + if (skb->len > 0) {
> + sg_init_one(&buf, skb->data, skb->len);
> sgs[out_sg++] = &buf;
> }
>
> - ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt, GFP_KERNEL);
> + ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
> /* Usually this means that there is no more space available in
> * the vq
> */
> if (ret < 0) {
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_add(&pkt->list, &vsock->send_pkt_list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> + skb_queue_head(&vsock->send_pkt_queue, skb);
> break;
> }
>
> @@ -163,33 +159,84 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> }
>
> +static inline bool
> +virtio_transport_skbs_can_merge(struct sk_buff *old, struct sk_buff *new)
> +{
> + return (new->len < GOOD_COPY_LEN &&
> + skb_tailroom(old) >= new->len &&
> + vsock_hdr(new)->src_cid == vsock_hdr(old)->src_cid &&
> + vsock_hdr(new)->dst_cid == vsock_hdr(old)->dst_cid &&
> + vsock_hdr(new)->src_port == vsock_hdr(old)->src_port &&
> + vsock_hdr(new)->dst_port == vsock_hdr(old)->dst_port &&
> + vsock_hdr(new)->type == vsock_hdr(old)->type &&
> + vsock_hdr(new)->flags == vsock_hdr(old)->flags &&
> + vsock_hdr(old)->op == VIRTIO_VSOCK_OP_RW &&
> + vsock_hdr(new)->op == VIRTIO_VSOCK_OP_RW);
> +}
> +
> +/**
> + * Merge the two most recent skbs together if possible.
> + *
> + * Caller must hold the queue lock.
> + */
> +static void
> +virtio_transport_add_to_queue(struct sk_buff_head *queue, struct sk_buff *new)
> +{
> + struct sk_buff *old;
> +
> + spin_lock_bh(&queue->lock);
> + /* In order to reduce skb memory overhead, we merge new packets with
> + * older packets if they pass virtio_transport_skbs_can_merge().
> + */
> + if (skb_queue_empty_lockless(queue)) {
> + __skb_queue_tail(queue, new);
> + goto out;
> + }
> +
> + old = skb_peek_tail(queue);
> +
> + if (!virtio_transport_skbs_can_merge(old, new)) {
> + __skb_queue_tail(queue, new);
> + goto out;
> + }
> +
> + memcpy(skb_put(old, new->len), new->data, new->len);
> + vsock_hdr(old)->len = cpu_to_le32(old->len);
> + vsock_hdr(old)->buf_alloc = vsock_hdr(new)->buf_alloc;
> + vsock_hdr(old)->fwd_cnt = vsock_hdr(new)->fwd_cnt;
> + dev_kfree_skb_any(new);
> +
> +out:
> + spin_unlock_bh(&queue->lock);
> +}
> +
> static int
> -virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
> +virtio_transport_send_pkt(struct sk_buff *skb)
> {
> + struct virtio_vsock_hdr *hdr;
> struct virtio_vsock *vsock;
> - int len = pkt->len;
> + int len = skb->len;
> +
> + hdr = vsock_hdr(skb);
>
> rcu_read_lock();
> vsock = rcu_dereference(the_virtio_vsock);
> if (!vsock) {
> - virtio_transport_free_pkt(pkt);
> + kfree_skb(skb);
> len = -ENODEV;
> goto out_rcu;
> }
>
> - if (le64_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
> - virtio_transport_free_pkt(pkt);
> + if (le64_to_cpu(hdr->dst_cid) == vsock->guest_cid) {
> + kfree_skb(skb);
> len = -ENODEV;
> goto out_rcu;
> }
>
> - if (pkt->reply)
> + if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
> atomic_inc(&vsock->queued_replies);
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_add_tail(&pkt->list, &vsock->send_pkt_list);
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> + virtio_transport_add_to_queue(&vsock->send_pkt_queue, skb);
> queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>
> out_rcu:
> @@ -201,9 +248,7 @@ static int
> virtio_transport_cancel_pkt(struct vsock_sock *vsk)
> {
> struct virtio_vsock *vsock;
> - struct virtio_vsock_pkt *pkt, *n;
> int cnt = 0, ret;
> - LIST_HEAD(freeme);
>
> rcu_read_lock();
> vsock = rcu_dereference(the_virtio_vsock);
> @@ -212,20 +257,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
> goto out_rcu;
> }
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
> - if (pkt->vsk != vsk)
> - continue;
> - list_move(&pkt->list, &freeme);
> - }
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> - list_for_each_entry_safe(pkt, n, &freeme, list) {
> - if (pkt->reply)
> - cnt++;
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> - }
> + cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>
> if (cnt) {
> struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
> @@ -246,38 +278,34 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>
> static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
> {
> - int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> - struct virtio_vsock_pkt *pkt;
> - struct scatterlist hdr, buf, *sgs[2];
> + struct scatterlist pkt, *sgs[1];
> struct virtqueue *vq;
> int ret;
>
> vq = vsock->vqs[VSOCK_VQ_RX];
>
> do {
> - pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> - if (!pkt)
> - break;
> + struct sk_buff *skb;
> + const size_t len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE -
> + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>
> - pkt->buf = kmalloc(buf_len, GFP_KERNEL);
> - if (!pkt->buf) {
> - virtio_transport_free_pkt(pkt);
> + skb = alloc_skb(len, GFP_KERNEL);
> + if (!skb)
> break;
> - }
>
> - pkt->buf_len = buf_len;
> - pkt->len = buf_len;
> + memset(skb->head, 0,
> + sizeof(struct virtio_vsock_metadata) + sizeof(struct virtio_vsock_hdr));
>
> - sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
> - sgs[0] = &hdr;
> + sg_init_one(&pkt, skb->head + sizeof(struct virtio_vsock_metadata),
> + VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE);
> + sgs[0] = &pkt;
>
> - sg_init_one(&buf, pkt->buf, buf_len);
> - sgs[1] = &buf;
> - ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
> - if (ret) {
> - virtio_transport_free_pkt(pkt);
> + ret = virtqueue_add_sgs(vq, sgs, 0, 1, skb, GFP_KERNEL);
> + if (ret < 0) {
> + kfree_skb(skb);
> break;
> }
> +
> vsock->rx_buf_nr++;
> } while (vq->num_free);
> if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
> @@ -299,12 +327,12 @@ static void virtio_transport_tx_work(struct work_struct *work)
> goto out;
>
> do {
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> unsigned int len;
>
> virtqueue_disable_cb(vq);
> - while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
> - virtio_transport_free_pkt(pkt);
> + while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
> + consume_skb(skb);
> added = true;
> }
> } while (!virtqueue_enable_cb(vq));
> @@ -529,7 +557,8 @@ static void virtio_transport_rx_work(struct work_struct *work)
> do {
> virtqueue_disable_cb(vq);
> for (;;) {
> - struct virtio_vsock_pkt *pkt;
> + struct virtio_vsock_hdr *hdr;
> + struct sk_buff *skb;
> unsigned int len;
>
> if (!virtio_transport_more_replies(vsock)) {
> @@ -540,23 +569,24 @@ static void virtio_transport_rx_work(struct work_struct *work)
> goto out;
> }
>
> - pkt = virtqueue_get_buf(vq, &len);
> - if (!pkt) {
> + skb = virtqueue_get_buf(vq, &len);
> + if (!skb)
> break;
> - }
>
> vsock->rx_buf_nr--;
>
> /* Drop short/long packets */
> - if (unlikely(len < sizeof(pkt->hdr) ||
> - len > sizeof(pkt->hdr) + pkt->len)) {
> - virtio_transport_free_pkt(pkt);
> + if (unlikely(len < sizeof(*hdr) ||
> + len > VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE)) {
> + kfree_skb(skb);
> continue;
> }
>
> - pkt->len = len - sizeof(pkt->hdr);
> - virtio_transport_deliver_tap_pkt(pkt);
> - virtio_transport_recv_pkt(&virtio_transport, pkt);
> + hdr = vsock_hdr(skb);
> + virtio_vsock_skb_reserve(skb);
> + virtio_vsock_skb_rx_put(skb);
> + virtio_transport_deliver_tap_pkt(skb);
> + virtio_transport_recv_pkt(&virtio_transport, skb);
> }
> } while (!virtqueue_enable_cb(vq));
>
> @@ -610,7 +640,7 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
> static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
> {
> struct virtio_device *vdev = vsock->vdev;
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
>
> /* Reset all connected sockets when the VQs disappear */
> vsock_for_each_connected_socket(&virtio_transport.transport,
> @@ -637,23 +667,16 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
> virtio_reset_device(vdev);
>
> mutex_lock(&vsock->rx_lock);
> - while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
> - virtio_transport_free_pkt(pkt);
> + while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
> + kfree_skb(skb);
> mutex_unlock(&vsock->rx_lock);
>
> mutex_lock(&vsock->tx_lock);
> - while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
> - virtio_transport_free_pkt(pkt);
> + while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
> + kfree_skb(skb);
> mutex_unlock(&vsock->tx_lock);
>
> - spin_lock_bh(&vsock->send_pkt_list_lock);
> - while (!list_empty(&vsock->send_pkt_list)) {
> - pkt = list_first_entry(&vsock->send_pkt_list,
> - struct virtio_vsock_pkt, list);
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> - }
> - spin_unlock_bh(&vsock->send_pkt_list_lock);
> + skb_queue_purge(&vsock->send_pkt_queue);
>
> /* Delete virtqueues and flush outstanding callbacks if any */
> vdev->config->del_vqs(vdev);
> @@ -690,8 +713,7 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> mutex_init(&vsock->tx_lock);
> mutex_init(&vsock->rx_lock);
> mutex_init(&vsock->event_lock);
> - spin_lock_init(&vsock->send_pkt_list_lock);
> - INIT_LIST_HEAD(&vsock->send_pkt_list);
> + skb_queue_head_init(&vsock->send_pkt_queue);
> INIT_WORK(&vsock->rx_work, virtio_transport_rx_work);
> INIT_WORK(&vsock->tx_work, virtio_transport_tx_work);
> INIT_WORK(&vsock->event_work, virtio_transport_event_work);
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index ec2c2afbf0d0..920578597bb9 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -37,53 +37,81 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> return container_of(t, struct virtio_transport, transport);
> }
>
> -static struct virtio_vsock_pkt *
> -virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
> +/* Returns a new packet on success, otherwise returns NULL.
> + *
> + * If NULL is returned, errp is set to a negative errno.
> + */
> +static struct sk_buff *
> +virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> size_t len,
> u32 src_cid,
> u32 src_port,
> u32 dst_cid,
> - u32 dst_port)
> + u32 dst_port,
> + int *errp)
> {
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> + struct virtio_vsock_hdr *hdr;
> + void *payload;
> + const size_t skb_len = sizeof(*hdr) + sizeof(struct virtio_vsock_metadata) + len;
> int err;
>
> - pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> - if (!pkt)
> - return NULL;
> + if (info->vsk) {
> + unsigned int msg_flags = info->msg ? info->msg->msg_flags : 0;
> + struct sock *sk;
>
> - pkt->hdr.type = cpu_to_le16(info->type);
> - pkt->hdr.op = cpu_to_le16(info->op);
> - pkt->hdr.src_cid = cpu_to_le64(src_cid);
> - pkt->hdr.dst_cid = cpu_to_le64(dst_cid);
> - pkt->hdr.src_port = cpu_to_le32(src_port);
> - pkt->hdr.dst_port = cpu_to_le32(dst_port);
> - pkt->hdr.flags = cpu_to_le32(info->flags);
> - pkt->len = len;
> - pkt->hdr.len = cpu_to_le32(len);
> - pkt->reply = info->reply;
> - pkt->vsk = info->vsk;
> + sk = sk_vsock(info->vsk);
> + skb = sock_alloc_send_skb(sk, skb_len,
> + msg_flags & MSG_DONTWAIT, errp);
>
> - if (info->msg && len > 0) {
> - pkt->buf = kmalloc(len, GFP_KERNEL);
> - if (!pkt->buf)
> - goto out_pkt;
> + if (skb)
> + skb->priority = sk->sk_priority;
> + } else {
> + skb = alloc_skb(skb_len, GFP_KERNEL);
> + }
> +
> + if (!skb) {
> + /* If using alloc_skb(), the skb is NULL due to lacking memory.
> + * Otherwise, errp is set by sock_alloc_send_skb().
> + */
> + if (!info->vsk)
> + *errp = -ENOMEM;
> + return NULL;
> + }
>
> - pkt->buf_len = len;
> + memset(skb->head, 0, sizeof(*hdr) + sizeof(struct virtio_vsock_metadata));
> + virtio_vsock_skb_reserve(skb);
> + payload = skb_put(skb, len);
>
> - err = memcpy_from_msg(pkt->buf, info->msg, len);
> - if (err)
> + hdr = vsock_hdr(skb);
> + hdr->type = cpu_to_le16(info->type);
> + hdr->op = cpu_to_le16(info->op);
> + hdr->src_cid = cpu_to_le64(src_cid);
> + hdr->dst_cid = cpu_to_le64(dst_cid);
> + hdr->src_port = cpu_to_le32(src_port);
> + hdr->dst_port = cpu_to_le32(dst_port);
> + hdr->flags = cpu_to_le32(info->flags);
> + hdr->len = cpu_to_le32(len);
> +
> + if (info->msg && len > 0) {
> + err = memcpy_from_msg(payload, info->msg, len);
> + if (err) {
> + *errp = -ENOMEM;
> goto out;
> + }
>
> if (msg_data_left(info->msg) == 0 &&
> info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> - pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>
> if (info->msg->msg_flags & MSG_EOR)
> - pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> }
> }
>
> + if (info->reply)
> + vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
> +
> trace_virtio_transport_alloc_pkt(src_cid, src_port,
> dst_cid, dst_port,
> len,
> @@ -91,85 +119,26 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
> info->op,
> info->flags);
>
> - return pkt;
> + return skb;
>
> out:
> - kfree(pkt->buf);
> -out_pkt:
> - kfree(pkt);
> + kfree_skb(skb);
> return NULL;
> }
>
> /* Packet capture */
> static struct sk_buff *virtio_transport_build_skb(void *opaque)
> {
> - struct virtio_vsock_pkt *pkt = opaque;
> - struct af_vsockmon_hdr *hdr;
> - struct sk_buff *skb;
> - size_t payload_len;
> - void *payload_buf;
> -
> - /* A packet could be split to fit the RX buffer, so we can retrieve
> - * the payload length from the header and the buffer pointer taking
> - * care of the offset in the original packet.
> - */
> - payload_len = le32_to_cpu(pkt->hdr.len);
> - payload_buf = pkt->buf + pkt->off;
> -
> - skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + payload_len,
> - GFP_ATOMIC);
> - if (!skb)
> - return NULL;
> -
> - hdr = skb_put(skb, sizeof(*hdr));
> -
> - /* pkt->hdr is little-endian so no need to byteswap here */
> - hdr->src_cid = pkt->hdr.src_cid;
> - hdr->src_port = pkt->hdr.src_port;
> - hdr->dst_cid = pkt->hdr.dst_cid;
> - hdr->dst_port = pkt->hdr.dst_port;
> -
> - hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
> - hdr->len = cpu_to_le16(sizeof(pkt->hdr));
> - memset(hdr->reserved, 0, sizeof(hdr->reserved));
> -
> - switch (le16_to_cpu(pkt->hdr.op)) {
> - case VIRTIO_VSOCK_OP_REQUEST:
> - case VIRTIO_VSOCK_OP_RESPONSE:
> - hdr->op = cpu_to_le16(AF_VSOCK_OP_CONNECT);
> - break;
> - case VIRTIO_VSOCK_OP_RST:
> - case VIRTIO_VSOCK_OP_SHUTDOWN:
> - hdr->op = cpu_to_le16(AF_VSOCK_OP_DISCONNECT);
> - break;
> - case VIRTIO_VSOCK_OP_RW:
> - hdr->op = cpu_to_le16(AF_VSOCK_OP_PAYLOAD);
> - break;
> - case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
> - case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
> - hdr->op = cpu_to_le16(AF_VSOCK_OP_CONTROL);
> - break;
> - default:
> - hdr->op = cpu_to_le16(AF_VSOCK_OP_UNKNOWN);
> - break;
> - }
> -
> - skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
> -
> - if (payload_len) {
> - skb_put_data(skb, payload_buf, payload_len);
> - }
> -
> - return skb;
> + return (struct sk_buff *)opaque;
> }
>
> -void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
> +void virtio_transport_deliver_tap_pkt(struct sk_buff *skb)
> {
> - if (pkt->tap_delivered)
> + if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED)
> return;
>
> - vsock_deliver_tap(virtio_transport_build_skb, pkt);
> - pkt->tap_delivered = true;
> + vsock_deliver_tap(virtio_transport_build_skb, skb);
> + vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
> }
> EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>
> @@ -192,8 +161,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> u32 src_cid, src_port, dst_cid, dst_port;
> const struct virtio_transport *t_ops;
> struct virtio_vsock_sock *vvs;
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> u32 pkt_len = info->pkt_len;
> + int err;
>
> info->type = virtio_transport_get_type(sk_vsock(vsk));
>
> @@ -224,42 +194,47 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> return pkt_len;
>
> - pkt = virtio_transport_alloc_pkt(info, pkt_len,
> + skb = virtio_transport_alloc_skb(info, pkt_len,
> src_cid, src_port,
> - dst_cid, dst_port);
> - if (!pkt) {
> + dst_cid, dst_port,
> + &err);
> + if (!skb) {
> virtio_transport_put_credit(vvs, pkt_len);
> - return -ENOMEM;
> + return err;
> }
>
> - virtio_transport_inc_tx_pkt(vvs, pkt);
> + virtio_transport_inc_tx_pkt(vvs, skb);
> +
> + err = t_ops->send_pkt(skb);
>
> - return t_ops->send_pkt(pkt);
> + return err < 0 ? -ENOMEM : err;
> }
>
> static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> - if (vvs->rx_bytes + pkt->len > vvs->buf_alloc)
> + if (vvs->rx_bytes + skb->len > vvs->buf_alloc)
> return false;
>
> - vvs->rx_bytes += pkt->len;
> + vvs->rx_bytes += skb->len;
> return true;
> }
>
> static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> - vvs->rx_bytes -= pkt->len;
> - vvs->fwd_cnt += pkt->len;
> + vvs->rx_bytes -= skb->len;
> + vvs->fwd_cnt += skb->len;
> }
>
> -void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
> +void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb)
> {
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> +
> spin_lock_bh(&vvs->rx_lock);
> vvs->last_fwd_cnt = vvs->fwd_cnt;
> - pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
> - pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
> + hdr->fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
> + hdr->buf_alloc = cpu_to_le32(vvs->buf_alloc);
> spin_unlock_bh(&vvs->rx_lock);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
> @@ -303,29 +278,29 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> size_t len)
> {
> struct virtio_vsock_sock *vvs = vsk->trans;
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb, *tmp;
> size_t bytes, total = 0, off;
> int err = -EFAULT;
>
> spin_lock_bh(&vvs->rx_lock);
>
> - list_for_each_entry(pkt, &vvs->rx_queue, list) {
> - off = pkt->off;
> + skb_queue_walk_safe(&vvs->rx_queue, skb, tmp) {
> + off = vsock_metadata(skb)->off;
>
> if (total == len)
> break;
>
> - while (total < len && off < pkt->len) {
> + while (total < len && off < skb->len) {
> bytes = len - total;
> - if (bytes > pkt->len - off)
> - bytes = pkt->len - off;
> + if (bytes > skb->len - off)
> + bytes = skb->len - off;
>
> /* sk_lock is held by caller so no one else can dequeue.
> * Unlock rx_lock since memcpy_to_msg() may sleep.
> */
> spin_unlock_bh(&vvs->rx_lock);
>
> - err = memcpy_to_msg(msg, pkt->buf + off, bytes);
> + err = memcpy_to_msg(msg, skb->data + off, bytes);
> if (err)
> goto out;
>
> @@ -352,37 +327,40 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> size_t len)
> {
> struct virtio_vsock_sock *vvs = vsk->trans;
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> size_t bytes, total = 0;
> u32 free_space;
> int err = -EFAULT;
>
> spin_lock_bh(&vvs->rx_lock);
> - while (total < len && !list_empty(&vvs->rx_queue)) {
> - pkt = list_first_entry(&vvs->rx_queue,
> - struct virtio_vsock_pkt, list);
> + while (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> + skb = __skb_dequeue(&vvs->rx_queue);
>
> bytes = len - total;
> - if (bytes > pkt->len - pkt->off)
> - bytes = pkt->len - pkt->off;
> + if (bytes > skb->len - vsock_metadata(skb)->off)
> + bytes = skb->len - vsock_metadata(skb)->off;
>
> /* sk_lock is held by caller so no one else can dequeue.
> * Unlock rx_lock since memcpy_to_msg() may sleep.
> */
> spin_unlock_bh(&vvs->rx_lock);
>
> - err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
> + err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, bytes);
> if (err)
> goto out;
>
> spin_lock_bh(&vvs->rx_lock);
>
> total += bytes;
> - pkt->off += bytes;
> - if (pkt->off == pkt->len) {
> - virtio_transport_dec_rx_pkt(vvs, pkt);
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> + vsock_metadata(skb)->off += bytes;
> +
> + WARN_ON(vsock_metadata(skb)->off > skb->len);
> +
> + if (vsock_metadata(skb)->off == skb->len) {
> + virtio_transport_dec_rx_pkt(vvs, skb);
> + consume_skb(skb);
> + } else {
> + __skb_queue_head(&vvs->rx_queue, skb);
> }
> }
>
> @@ -414,7 +392,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> int flags)
> {
> struct virtio_vsock_sock *vvs = vsk->trans;
> - struct virtio_vsock_pkt *pkt;
> + struct sk_buff *skb;
> int dequeued_len = 0;
> size_t user_buf_len = msg_data_left(msg);
> bool msg_ready = false;
> @@ -427,13 +405,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> }
>
> while (!msg_ready) {
> - pkt = list_first_entry(&vvs->rx_queue, struct virtio_vsock_pkt, list);
> + struct virtio_vsock_hdr *hdr;
> +
> + skb = __skb_dequeue(&vvs->rx_queue);
> + hdr = vsock_hdr(skb);
>
> if (dequeued_len >= 0) {
> size_t pkt_len;
> size_t bytes_to_copy;
>
> - pkt_len = (size_t)le32_to_cpu(pkt->hdr.len);
> + pkt_len = (size_t)le32_to_cpu(hdr->len);
> bytes_to_copy = min(user_buf_len, pkt_len);
>
> if (bytes_to_copy) {
> @@ -444,7 +425,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
> - err = memcpy_to_msg(msg, pkt->buf, bytes_to_copy);
> + err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
> if (err) {
> /* Copy of message failed. Rest of
> * fragments will be freed without copy.
> @@ -461,17 +442,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> dequeued_len += pkt_len;
> }
>
> - if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
> + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> msg_ready = true;
> vvs->msg_count--;
>
> - if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR)
> + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
> msg->msg_flags |= MSG_EOR;
> }
>
> - virtio_transport_dec_rx_pkt(vvs, pkt);
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> + virtio_transport_dec_rx_pkt(vvs, skb);
> + kfree_skb(skb);
> }
>
> spin_unlock_bh(&vvs->rx_lock);
> @@ -609,7 +589,7 @@ int virtio_transport_do_socket_init(struct vsock_sock *vsk,
>
> spin_lock_init(&vvs->rx_lock);
> spin_lock_init(&vvs->tx_lock);
> - INIT_LIST_HEAD(&vvs->rx_queue);
> + skb_queue_head_init(&vvs->rx_queue);
>
> return 0;
> }
> @@ -809,16 +789,16 @@ void virtio_transport_destruct(struct vsock_sock *vsk)
> EXPORT_SYMBOL_GPL(virtio_transport_destruct);
>
> static int virtio_transport_reset(struct vsock_sock *vsk,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> struct virtio_vsock_pkt_info info = {
> .op = VIRTIO_VSOCK_OP_RST,
> - .reply = !!pkt,
> + .reply = !!skb,
> .vsk = vsk,
> };
>
> /* Send RST only if the original pkt is not a RST pkt */
> - if (pkt && le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
> + if (skb && le16_to_cpu(vsock_hdr(skb)->op) == VIRTIO_VSOCK_OP_RST)
> return 0;
>
> return virtio_transport_send_pkt_info(vsk, &info);
> @@ -828,29 +808,32 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
> * attempt was made to connect to a socket that does not exist.
> */
> static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> - struct virtio_vsock_pkt *reply;
> + struct sk_buff *reply;
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> struct virtio_vsock_pkt_info info = {
> .op = VIRTIO_VSOCK_OP_RST,
> - .type = le16_to_cpu(pkt->hdr.type),
> + .type = le16_to_cpu(hdr->type),
> .reply = true,
> };
> + int err;
>
> /* Send RST only if the original pkt is not a RST pkt */
> - if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
> + if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
> return 0;
>
> - reply = virtio_transport_alloc_pkt(&info, 0,
> - le64_to_cpu(pkt->hdr.dst_cid),
> - le32_to_cpu(pkt->hdr.dst_port),
> - le64_to_cpu(pkt->hdr.src_cid),
> - le32_to_cpu(pkt->hdr.src_port));
> + reply = virtio_transport_alloc_skb(&info, 0,
> + le64_to_cpu(hdr->dst_cid),
> + le32_to_cpu(hdr->dst_port),
> + le64_to_cpu(hdr->src_cid),
> + le32_to_cpu(hdr->src_port),
> + &err);
> if (!reply)
> - return -ENOMEM;
> + return err;
>
> if (!t) {
> - virtio_transport_free_pkt(reply);
> + kfree_skb(reply);
> return -ENOTCONN;
> }
>
> @@ -861,16 +844,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> static void virtio_transport_remove_sock(struct vsock_sock *vsk)
> {
> struct virtio_vsock_sock *vvs = vsk->trans;
> - struct virtio_vsock_pkt *pkt, *tmp;
>
> /* We don't need to take rx_lock, as the socket is closing and we are
> * removing it.
> */
> - list_for_each_entry_safe(pkt, tmp, &vvs->rx_queue, list) {
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> - }
> -
> + __skb_queue_purge(&vvs->rx_queue);
> vsock_remove_sock(vsk);
> }
>
> @@ -984,13 +962,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_release);
>
> static int
> virtio_transport_recv_connecting(struct sock *sk,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> int err;
> int skerr;
>
> - switch (le16_to_cpu(pkt->hdr.op)) {
> + switch (le16_to_cpu(hdr->op)) {
> case VIRTIO_VSOCK_OP_RESPONSE:
> sk->sk_state = TCP_ESTABLISHED;
> sk->sk_socket->state = SS_CONNECTED;
> @@ -1011,7 +990,7 @@ virtio_transport_recv_connecting(struct sock *sk,
> return 0;
>
> destroy:
> - virtio_transport_reset(vsk, pkt);
> + virtio_transport_reset(vsk, skb);
> sk->sk_state = TCP_CLOSE;
> sk->sk_err = skerr;
> sk_error_report(sk);
> @@ -1020,34 +999,38 @@ virtio_transport_recv_connecting(struct sock *sk,
>
> static void
> virtio_transport_recv_enqueue(struct vsock_sock *vsk,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> struct virtio_vsock_sock *vvs = vsk->trans;
> + struct virtio_vsock_hdr *hdr;
> bool can_enqueue, free_pkt = false;
> + u32 len;
>
> - pkt->len = le32_to_cpu(pkt->hdr.len);
> - pkt->off = 0;
> + hdr = vsock_hdr(skb);
> + len = le32_to_cpu(hdr->len);
> + vsock_metadata(skb)->off = 0;
>
> spin_lock_bh(&vvs->rx_lock);
>
> - can_enqueue = virtio_transport_inc_rx_pkt(vvs, pkt);
> + can_enqueue = virtio_transport_inc_rx_pkt(vvs, skb);
> if (!can_enqueue) {
> free_pkt = true;
> goto out;
> }
>
> - if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)
> + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)
> vvs->msg_count++;
>
> /* Try to copy small packets into the buffer of last packet queued,
> * to avoid wasting memory queueing the entire buffer with a small
> * payload.
> */
> - if (pkt->len <= GOOD_COPY_LEN && !list_empty(&vvs->rx_queue)) {
> - struct virtio_vsock_pkt *last_pkt;
> + if (len <= GOOD_COPY_LEN && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> + struct virtio_vsock_hdr *last_hdr;
> + struct sk_buff *last_skb;
>
> - last_pkt = list_last_entry(&vvs->rx_queue,
> - struct virtio_vsock_pkt, list);
> + last_skb = skb_peek_tail(&vvs->rx_queue);
> + last_hdr = vsock_hdr(last_skb);
>
> /* If there is space in the last packet queued, we copy the
> * new packet in its buffer. We avoid this if the last packet
> @@ -1055,35 +1038,35 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
> * delimiter of SEQPACKET message, so 'pkt' is the first packet
> * of a new message.
> */
> - if ((pkt->len <= last_pkt->buf_len - last_pkt->len) &&
> - !(le32_to_cpu(last_pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)) {
> - memcpy(last_pkt->buf + last_pkt->len, pkt->buf,
> - pkt->len);
> - last_pkt->len += pkt->len;
> + if (skb->len < skb_tailroom(last_skb) &&
> + !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) &&
> + (vsock_hdr(skb)->type != VIRTIO_VSOCK_TYPE_DGRAM)) {
> + memcpy(skb_put(last_skb, skb->len), skb->data, skb->len);
> free_pkt = true;
> - last_pkt->hdr.flags |= pkt->hdr.flags;
> + last_hdr->flags |= hdr->flags;
> goto out;
> }
> }
>
> - list_add_tail(&pkt->list, &vvs->rx_queue);
> + __skb_queue_tail(&vvs->rx_queue, skb);
>
> out:
> spin_unlock_bh(&vvs->rx_lock);
> if (free_pkt)
> - virtio_transport_free_pkt(pkt);
> + kfree_skb(skb);
> }
>
> static int
> virtio_transport_recv_connected(struct sock *sk,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> int err = 0;
>
> - switch (le16_to_cpu(pkt->hdr.op)) {
> + switch (le16_to_cpu(hdr->op)) {
> case VIRTIO_VSOCK_OP_RW:
> - virtio_transport_recv_enqueue(vsk, pkt);
> + virtio_transport_recv_enqueue(vsk, skb);
> sk->sk_data_ready(sk);
> return err;
> case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
> @@ -1093,18 +1076,17 @@ virtio_transport_recv_connected(struct sock *sk,
> sk->sk_write_space(sk);
> break;
> case VIRTIO_VSOCK_OP_SHUTDOWN:
> - if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
> + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
> vsk->peer_shutdown |= RCV_SHUTDOWN;
> - if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
> + if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
> vsk->peer_shutdown |= SEND_SHUTDOWN;
> if (vsk->peer_shutdown == SHUTDOWN_MASK &&
> vsock_stream_has_data(vsk) <= 0 &&
> !sock_flag(sk, SOCK_DONE)) {
> (void)virtio_transport_reset(vsk, NULL);
> -
> virtio_transport_do_close(vsk, true);
> }
> - if (le32_to_cpu(pkt->hdr.flags))
> + if (le32_to_cpu(vsock_hdr(skb)->flags))
> sk->sk_state_change(sk);
> break;
> case VIRTIO_VSOCK_OP_RST:
> @@ -1115,28 +1097,30 @@ virtio_transport_recv_connected(struct sock *sk,
> break;
> }
>
> - virtio_transport_free_pkt(pkt);
> + kfree_skb(skb);
> return err;
> }
>
> static void
> virtio_transport_recv_disconnecting(struct sock *sk,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>
> - if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
> + if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
> virtio_transport_do_close(vsk, true);
> }
>
> static int
> virtio_transport_send_response(struct vsock_sock *vsk,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> struct virtio_vsock_pkt_info info = {
> .op = VIRTIO_VSOCK_OP_RESPONSE,
> - .remote_cid = le64_to_cpu(pkt->hdr.src_cid),
> - .remote_port = le32_to_cpu(pkt->hdr.src_port),
> + .remote_cid = le64_to_cpu(hdr->src_cid),
> + .remote_port = le32_to_cpu(hdr->src_port),
> .reply = true,
> .vsk = vsk,
> };
> @@ -1145,10 +1129,11 @@ virtio_transport_send_response(struct vsock_sock *vsk,
> }
>
> static bool virtio_transport_space_update(struct sock *sk,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
> struct virtio_vsock_sock *vvs = vsk->trans;
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> bool space_available;
>
> /* Listener sockets are not associated with any transport, so we are
> @@ -1161,8 +1146,8 @@ static bool virtio_transport_space_update(struct sock *sk,
>
> /* buf_alloc and fwd_cnt is always included in the hdr */
> spin_lock_bh(&vvs->tx_lock);
> - vvs->peer_buf_alloc = le32_to_cpu(pkt->hdr.buf_alloc);
> - vvs->peer_fwd_cnt = le32_to_cpu(pkt->hdr.fwd_cnt);
> + vvs->peer_buf_alloc = le32_to_cpu(hdr->buf_alloc);
> + vvs->peer_fwd_cnt = le32_to_cpu(hdr->fwd_cnt);
> space_available = virtio_transport_has_space(vsk);
> spin_unlock_bh(&vvs->tx_lock);
> return space_available;
> @@ -1170,27 +1155,28 @@ static bool virtio_transport_space_update(struct sock *sk,
>
> /* Handle server socket */
> static int
> -virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
> +virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> struct virtio_transport *t)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> struct vsock_sock *vchild;
> struct sock *child;
> int ret;
>
> - if (le16_to_cpu(pkt->hdr.op) != VIRTIO_VSOCK_OP_REQUEST) {
> - virtio_transport_reset_no_sock(t, pkt);
> + if (le16_to_cpu(hdr->op) != VIRTIO_VSOCK_OP_REQUEST) {
> + virtio_transport_reset_no_sock(t, skb);
> return -EINVAL;
> }
>
> if (sk_acceptq_is_full(sk)) {
> - virtio_transport_reset_no_sock(t, pkt);
> + virtio_transport_reset_no_sock(t, skb);
> return -ENOMEM;
> }
>
> child = vsock_create_connected(sk);
> if (!child) {
> - virtio_transport_reset_no_sock(t, pkt);
> + virtio_transport_reset_no_sock(t, skb);
> return -ENOMEM;
> }
>
> @@ -1201,10 +1187,10 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
> child->sk_state = TCP_ESTABLISHED;
>
> vchild = vsock_sk(child);
> - vsock_addr_init(&vchild->local_addr, le64_to_cpu(pkt->hdr.dst_cid),
> - le32_to_cpu(pkt->hdr.dst_port));
> - vsock_addr_init(&vchild->remote_addr, le64_to_cpu(pkt->hdr.src_cid),
> - le32_to_cpu(pkt->hdr.src_port));
> + vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
> + le32_to_cpu(hdr->dst_port));
> + vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
> + le32_to_cpu(hdr->src_port));
>
> ret = vsock_assign_transport(vchild, vsk);
> /* Transport assigned (looking at remote_addr) must be the same
> @@ -1212,17 +1198,17 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
> */
> if (ret || vchild->transport != &t->transport) {
> release_sock(child);
> - virtio_transport_reset_no_sock(t, pkt);
> + virtio_transport_reset_no_sock(t, skb);
> sock_put(child);
> return ret;
> }
>
> - if (virtio_transport_space_update(child, pkt))
> + if (virtio_transport_space_update(child, skb))
> child->sk_write_space(child);
>
> vsock_insert_connected(vchild);
> vsock_enqueue_accept(sk, child);
> - virtio_transport_send_response(vchild, pkt);
> + virtio_transport_send_response(vchild, skb);
>
> release_sock(child);
>
> @@ -1240,29 +1226,30 @@ static bool virtio_transport_valid_type(u16 type)
> * lock.
> */
> void virtio_transport_recv_pkt(struct virtio_transport *t,
> - struct virtio_vsock_pkt *pkt)
> + struct sk_buff *skb)
> {
> struct sockaddr_vm src, dst;
> struct vsock_sock *vsk;
> struct sock *sk;
> bool space_available;
> + struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>
> - vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
> - le32_to_cpu(pkt->hdr.src_port));
> - vsock_addr_init(&dst, le64_to_cpu(pkt->hdr.dst_cid),
> - le32_to_cpu(pkt->hdr.dst_port));
> + vsock_addr_init(&src, le64_to_cpu(hdr->src_cid),
> + le32_to_cpu(hdr->src_port));
> + vsock_addr_init(&dst, le64_to_cpu(hdr->dst_cid),
> + le32_to_cpu(hdr->dst_port));
>
> trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
> dst.svm_cid, dst.svm_port,
> - le32_to_cpu(pkt->hdr.len),
> - le16_to_cpu(pkt->hdr.type),
> - le16_to_cpu(pkt->hdr.op),
> - le32_to_cpu(pkt->hdr.flags),
> - le32_to_cpu(pkt->hdr.buf_alloc),
> - le32_to_cpu(pkt->hdr.fwd_cnt));
> -
> - if (!virtio_transport_valid_type(le16_to_cpu(pkt->hdr.type))) {
> - (void)virtio_transport_reset_no_sock(t, pkt);
> + le32_to_cpu(hdr->len),
> + le16_to_cpu(hdr->type),
> + le16_to_cpu(hdr->op),
> + le32_to_cpu(hdr->flags),
> + le32_to_cpu(hdr->buf_alloc),
> + le32_to_cpu(hdr->fwd_cnt));
> +
> + if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
> + (void)virtio_transport_reset_no_sock(t, skb);
> goto free_pkt;
> }
>
> @@ -1273,13 +1260,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> if (!sk) {
> sk = vsock_find_bound_socket(&dst);
> if (!sk) {
> - (void)virtio_transport_reset_no_sock(t, pkt);
> + (void)virtio_transport_reset_no_sock(t, skb);
> goto free_pkt;
> }
> }
>
> - if (virtio_transport_get_type(sk) != le16_to_cpu(pkt->hdr.type)) {
> - (void)virtio_transport_reset_no_sock(t, pkt);
> + if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
> + (void)virtio_transport_reset_no_sock(t, skb);
> sock_put(sk);
> goto free_pkt;
> }
> @@ -1290,13 +1277,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>
> /* Check if sk has been closed before lock_sock */
> if (sock_flag(sk, SOCK_DONE)) {
> - (void)virtio_transport_reset_no_sock(t, pkt);
> + (void)virtio_transport_reset_no_sock(t, skb);
> release_sock(sk);
> sock_put(sk);
> goto free_pkt;
> }
>
> - space_available = virtio_transport_space_update(sk, pkt);
> + space_available = virtio_transport_space_update(sk, skb);
>
> /* Update CID in case it has changed after a transport reset event */
> if (vsk->local_addr.svm_cid != VMADDR_CID_ANY)
> @@ -1307,23 +1294,23 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>
> switch (sk->sk_state) {
> case TCP_LISTEN:
> - virtio_transport_recv_listen(sk, pkt, t);
> - virtio_transport_free_pkt(pkt);
> + virtio_transport_recv_listen(sk, skb, t);
> + kfree_skb(skb);
> break;
> case TCP_SYN_SENT:
> - virtio_transport_recv_connecting(sk, pkt);
> - virtio_transport_free_pkt(pkt);
> + virtio_transport_recv_connecting(sk, skb);
> + kfree_skb(skb);
> break;
> case TCP_ESTABLISHED:
> - virtio_transport_recv_connected(sk, pkt);
> + virtio_transport_recv_connected(sk, skb);
> break;
> case TCP_CLOSING:
> - virtio_transport_recv_disconnecting(sk, pkt);
> - virtio_transport_free_pkt(pkt);
> + virtio_transport_recv_disconnecting(sk, skb);
> + kfree_skb(skb);
> break;
> default:
> - (void)virtio_transport_reset_no_sock(t, pkt);
> - virtio_transport_free_pkt(pkt);
> + (void)virtio_transport_reset_no_sock(t, skb);
> + kfree_skb(skb);
> break;
> }
>
> @@ -1336,16 +1323,42 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> return;
>
> free_pkt:
> - virtio_transport_free_pkt(pkt);
> + kfree(skb);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>
> -void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> +/* Remove skbs found in a queue that have a vsk that matches.
> + *
> + * Each skb is freed.
> + *
> + * Returns the count of skbs that were reply packets.
> + */
> +int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue)
> {
> - kfree(pkt->buf);
> - kfree(pkt);
> + int cnt = 0;
> + struct sk_buff *skb, *tmp;
> + struct sk_buff_head freeme;
> +
> + skb_queue_head_init(&freeme);
> +
> + spin_lock_bh(&queue->lock);
> + skb_queue_walk_safe(queue, skb, tmp) {
> + if (vsock_sk(skb->sk) != vsk)
> + continue;
> +
> + __skb_unlink(skb, queue);
> + skb_queue_tail(&freeme, skb);
> +
> + if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
> + cnt++;
> + }
> + spin_unlock_bh(&queue->lock);
> +
> + skb_queue_purge(&freeme);
> +
> + return cnt;
> }
> -EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
> +EXPORT_SYMBOL_GPL(virtio_transport_purge_skbs);
>
> MODULE_LICENSE("GPL v2");
> MODULE_AUTHOR("Asias He");
> diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> index 169a8cf65b39..906f7cdff65e 100644
> --- a/net/vmw_vsock/vsock_loopback.c
> +++ b/net/vmw_vsock/vsock_loopback.c
> @@ -16,7 +16,7 @@ struct vsock_loopback {
> struct workqueue_struct *workqueue;
>
> spinlock_t pkt_list_lock; /* protects pkt_list */
> - struct list_head pkt_list;
> + struct sk_buff_head pkt_queue;
> struct work_struct pkt_work;
> };
>
> @@ -27,13 +27,13 @@ static u32 vsock_loopback_get_local_cid(void)
> return VMADDR_CID_LOCAL;
> }
>
> -static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
> +static int vsock_loopback_send_pkt(struct sk_buff *skb)
> {
> struct vsock_loopback *vsock = &the_vsock_loopback;
> - int len = pkt->len;
> + int len = skb->len;
>
> spin_lock_bh(&vsock->pkt_list_lock);
> - list_add_tail(&pkt->list, &vsock->pkt_list);
> + skb_queue_tail(&vsock->pkt_queue, skb);
> spin_unlock_bh(&vsock->pkt_list_lock);
>
> queue_work(vsock->workqueue, &vsock->pkt_work);
> @@ -44,21 +44,8 @@ static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
> static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
> {
> struct vsock_loopback *vsock = &the_vsock_loopback;
> - struct virtio_vsock_pkt *pkt, *n;
> - LIST_HEAD(freeme);
>
> - spin_lock_bh(&vsock->pkt_list_lock);
> - list_for_each_entry_safe(pkt, n, &vsock->pkt_list, list) {
> - if (pkt->vsk != vsk)
> - continue;
> - list_move(&pkt->list, &freeme);
> - }
> - spin_unlock_bh(&vsock->pkt_list_lock);
> -
> - list_for_each_entry_safe(pkt, n, &freeme, list) {
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> - }
> + virtio_transport_purge_skbs(vsk, &vsock->pkt_queue);
>
> return 0;
> }
> @@ -121,20 +108,20 @@ static void vsock_loopback_work(struct work_struct *work)
> {
> struct vsock_loopback *vsock =
> container_of(work, struct vsock_loopback, pkt_work);
> - LIST_HEAD(pkts);
> + struct sk_buff_head pkts;
> +
> + skb_queue_head_init(&pkts);
>
> spin_lock_bh(&vsock->pkt_list_lock);
> - list_splice_init(&vsock->pkt_list, &pkts);
> + skb_queue_splice_init(&vsock->pkt_queue, &pkts);
> spin_unlock_bh(&vsock->pkt_list_lock);
>
> - while (!list_empty(&pkts)) {
> - struct virtio_vsock_pkt *pkt;
> + while (!skb_queue_empty(&pkts)) {
> + struct sk_buff *skb;
>
> - pkt = list_first_entry(&pkts, struct virtio_vsock_pkt, list);
> - list_del_init(&pkt->list);
> -
> - virtio_transport_deliver_tap_pkt(pkt);
> - virtio_transport_recv_pkt(&loopback_transport, pkt);
> + skb = skb_dequeue(&pkts);
> + virtio_transport_deliver_tap_pkt(skb);
> + virtio_transport_recv_pkt(&loopback_transport, skb);
> }
> }
>
> @@ -148,7 +135,7 @@ static int __init vsock_loopback_init(void)
> return -ENOMEM;
>
> spin_lock_init(&vsock->pkt_list_lock);
> - INIT_LIST_HEAD(&vsock->pkt_list);
> + skb_queue_head_init(&vsock->pkt_queue);
> INIT_WORK(&vsock->pkt_work, vsock_loopback_work);
>
> ret = vsock_core_register(&loopback_transport.transport,
> @@ -166,19 +153,13 @@ static int __init vsock_loopback_init(void)
> static void __exit vsock_loopback_exit(void)
> {
> struct vsock_loopback *vsock = &the_vsock_loopback;
> - struct virtio_vsock_pkt *pkt;
>
> vsock_core_unregister(&loopback_transport.transport);
>
> flush_work(&vsock->pkt_work);
>
> spin_lock_bh(&vsock->pkt_list_lock);
> - while (!list_empty(&vsock->pkt_list)) {
> - pkt = list_first_entry(&vsock->pkt_list,
> - struct virtio_vsock_pkt, list);
> - list_del(&pkt->list);
> - virtio_transport_free_pkt(pkt);
> - }
> + skb_queue_purge(&vsock->pkt_queue);
> spin_unlock_bh(&vsock->pkt_list_lock);
>
> destroy_workqueue(vsock->workqueue);
> --
> 2.35.1
>
CC'ing [email protected]
On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> In order to support usage of qdisc on vsock traffic, this commit
> introduces a struct net_device to vhost and virtio vsock.
>
> Two new devices are created, vhost-vsock for vhost and virtio-vsock
> for virtio. The devices are attached to the respective transports.
>
> To bypass the usage of the device, the user may "down" the associated
> network interface using common tools. For example, "ip link set dev
> virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> simply using the FIFO logic of the prior implementation.
>
> For both hosts and guests, there is one device for all G2H vsock sockets
> and one device for all H2G vsock sockets. This makes sense for guests
> because the driver only supports a single vsock channel (one pair of
> TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> seem ideal for some workloads. However, it is possible to use a
> multi-queue qdisc, where a given queue is responsible for a range of
> sockets. This seems to be a better solution than having one device per
> socket, which may yield a very large number of devices and qdiscs, all
> of which are dynamically being created and destroyed. Because of this
> dynamism, it would also require a complex policy management daemon, as
> devices would constantly be spun up and down as sockets were created and
> destroyed. To avoid this, one device and qdisc also applies to all H2G
> sockets.
>
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> drivers/vhost/vsock.c | 19 +++-
> include/linux/virtio_vsock.h | 10 +++
> net/vmw_vsock/virtio_transport.c | 19 +++-
> net/vmw_vsock/virtio_transport_common.c | 112 +++++++++++++++++++++++-
> 4 files changed, 152 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index f8601d93d94d..b20ddec2664b 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -927,13 +927,30 @@ static int __init vhost_vsock_init(void)
> VSOCK_TRANSPORT_F_H2G);
> if (ret < 0)
> return ret;
> - return misc_register(&vhost_vsock_misc);
> +
> + ret = virtio_transport_init(&vhost_transport, "vhost-vsock");
> + if (ret < 0)
> + goto out_unregister;
> +
> + ret = misc_register(&vhost_vsock_misc);
> + if (ret < 0)
> + goto out_transport_exit;
> + return ret;
> +
> +out_transport_exit:
> + virtio_transport_exit(&vhost_transport);
> +
> +out_unregister:
> + vsock_core_unregister(&vhost_transport.transport);
> + return ret;
> +
> };
>
> static void __exit vhost_vsock_exit(void)
> {
> misc_deregister(&vhost_vsock_misc);
> vsock_core_unregister(&vhost_transport.transport);
> + virtio_transport_exit(&vhost_transport);
> };
>
> module_init(vhost_vsock_init);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 9a37eddbb87a..5d7e7fbd75f8 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -91,10 +91,20 @@ struct virtio_transport {
> /* This must be the first field */
> struct vsock_transport transport;
>
> + /* Used almost exclusively for qdisc */
> + struct net_device *dev;
> +
> /* Takes ownership of the packet */
> int (*send_pkt)(struct sk_buff *skb);
> };
>
> +int
> +virtio_transport_init(struct virtio_transport *t,
> + const char *name);
> +
> +void
> +virtio_transport_exit(struct virtio_transport *t);
> +
> ssize_t
> virtio_transport_stream_dequeue(struct vsock_sock *vsk,
> struct msghdr *msg,
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 3bb293fd8607..c6212eb38d3c 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -131,7 +131,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> * the vq
> */
> if (ret < 0) {
> - skb_queue_head(&vsock->send_pkt_queue, skb);
> + spin_lock_bh(&vsock->send_pkt_queue.lock);
> + __skb_queue_head(&vsock->send_pkt_queue, skb);
> + spin_unlock_bh(&vsock->send_pkt_queue.lock);
> break;
> }
>
> @@ -676,7 +678,9 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
> kfree_skb(skb);
> mutex_unlock(&vsock->tx_lock);
>
> - skb_queue_purge(&vsock->send_pkt_queue);
> + spin_lock_bh(&vsock->send_pkt_queue.lock);
> + __skb_queue_purge(&vsock->send_pkt_queue);
> + spin_unlock_bh(&vsock->send_pkt_queue.lock);
>
> /* Delete virtqueues and flush outstanding callbacks if any */
> vdev->config->del_vqs(vdev);
> @@ -760,6 +764,8 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
> flush_work(&vsock->event_work);
> flush_work(&vsock->send_pkt_work);
>
> + virtio_transport_exit(&virtio_transport);
> +
> mutex_unlock(&the_virtio_vsock_mutex);
>
> kfree(vsock);
> @@ -844,12 +850,18 @@ static int __init virtio_vsock_init(void)
> if (ret)
> goto out_wq;
>
> - ret = register_virtio_driver(&virtio_vsock_driver);
> + ret = virtio_transport_init(&virtio_transport, "virtio-vsock");
> if (ret)
> goto out_vci;
>
> + ret = register_virtio_driver(&virtio_vsock_driver);
> + if (ret)
> + goto out_transport;
> +
> return 0;
>
> +out_transport:
> + virtio_transport_exit(&virtio_transport);
> out_vci:
> vsock_core_unregister(&virtio_transport.transport);
> out_wq:
> @@ -861,6 +873,7 @@ static void __exit virtio_vsock_exit(void)
> {
> unregister_virtio_driver(&virtio_vsock_driver);
> vsock_core_unregister(&virtio_transport.transport);
> + virtio_transport_exit(&virtio_transport);
> destroy_workqueue(virtio_vsock_workqueue);
> }
>
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index d5780599fe93..bdf16fff054f 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -16,6 +16,7 @@
>
> #include <net/sock.h>
> #include <net/af_vsock.h>
> +#include <net/pkt_sched.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/vsock_virtio_transport_common.h>
> @@ -23,6 +24,93 @@
> /* How long to wait for graceful shutdown of a connection */
> #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
>
> +struct virtio_transport_priv {
> + struct virtio_transport *trans;
> +};
> +
> +static netdev_tx_t virtio_transport_start_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct virtio_transport *t =
> + ((struct virtio_transport_priv *)netdev_priv(dev))->trans;
> + int ret;
> +
> + ret = t->send_pkt(skb);
> + if (unlikely(ret == -ENODEV))
> + return NETDEV_TX_BUSY;
> +
> + return NETDEV_TX_OK;
> +}
> +
> +const struct net_device_ops virtio_transport_netdev_ops = {
> + .ndo_start_xmit = virtio_transport_start_xmit,
> +};
> +
> +static void virtio_transport_setup(struct net_device *dev)
> +{
> + dev->netdev_ops = &virtio_transport_netdev_ops;
> + dev->needs_free_netdev = true;
> + dev->flags = IFF_NOARP;
> + dev->mtu = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> +}
> +
> +static int ifup(struct net_device *dev)
> +{
> + int ret;
> +
> + rtnl_lock();
> + ret = dev_open(dev, NULL) ? -ENOMEM : 0;
> + rtnl_unlock();
> +
> + return ret;
> +}
> +
> +/* virtio_transport_init - initialize a virtio vsock transport layer
> + *
> + * @t: ptr to the virtio transport struct to initialize
> + * @name: the name of the net_device to be created.
> + *
> + * Return 0 on success, otherwise negative errno.
> + */
> +int virtio_transport_init(struct virtio_transport *t, const char *name)
> +{
> + struct virtio_transport_priv *priv;
> + int ret;
> +
> + t->dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN, virtio_transport_setup);
> + if (!t->dev)
> + return -ENOMEM;
> +
> + priv = netdev_priv(t->dev);
> + priv->trans = t;
> +
> + ret = register_netdev(t->dev);
> + if (ret < 0)
> + goto out_free_netdev;
> +
> + ret = ifup(t->dev);
> + if (ret < 0)
> + goto out_unregister_netdev;
> +
> + return 0;
> +
> +out_unregister_netdev:
> + unregister_netdev(t->dev);
> +
> +out_free_netdev:
> + free_netdev(t->dev);
> +
> + return ret;
> +}
> +
> +void virtio_transport_exit(struct virtio_transport *t)
> +{
> + if (t->dev) {
> + unregister_netdev(t->dev);
> + free_netdev(t->dev);
> + }
> +}
> +
> static const struct virtio_transport *
> virtio_transport_get_ops(struct vsock_sock *vsk)
> {
> @@ -147,6 +235,24 @@ static u16 virtio_transport_get_type(struct sock *sk)
> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> }
>
> +/* Return pkt->len on success, otherwise negative errno */
> +static int virtio_transport_send_pkt(const struct virtio_transport *t, struct sk_buff *skb)
> +{
> + int ret;
> + int len = skb->len;
> +
> + if (unlikely(!t->dev || !(t->dev->flags & IFF_UP)))
> + return t->send_pkt(skb);
> +
> + skb->dev = t->dev;
> + ret = dev_queue_xmit(skb);
> +
> + if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN))
> + return len;
> +
> + return -ENOMEM;
> +}
> +
> /* This function can only be used on connecting/connected sockets,
> * since a socket assigned to a transport is required.
> *
> @@ -202,9 +308,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>
> virtio_transport_inc_tx_pkt(vvs, skb);
>
> - err = t_ops->send_pkt(skb);
> -
> - return err < 0 ? -ENOMEM : err;
> + return virtio_transport_send_pkt(t_ops, skb);
> }
>
> static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
> @@ -834,7 +938,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> return -ENOTCONN;
> }
>
> - return t->send_pkt(reply);
> + return virtio_transport_send_pkt(t, reply);
> }
>
> /* This function should be called with sk_lock held and SOCK_DONE set */
> --
> 2.35.1
>
On Tue, Aug 16, 2022 at 09:00:45AM +0200, Stefano Garzarella wrote:
> Hi Bobby,
..
>
> Please send next versions of this series as RFC until we have at least an
> agreement on the spec changes.
>
> I think will be better to agree on the spec before merge Linux changes.
>
> Thanks,
> Stefano
>
Duly noted, I'll tag it as RFC on the next send.
Best,
Bobby
CC'ing [email protected]
On Mon, Aug 15, 2022 at 10:56:07AM -0700, Bobby Eshleman wrote:
> This commit adds a feature bit for virtio vsock to support datagrams.
>
> Signed-off-by: Jiang Wang <[email protected]>
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> drivers/vhost/vsock.c | 3 ++-
> include/uapi/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport.c | 8 ++++++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index b20ddec2664b..a5d1bdb786fe 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -32,7 +32,8 @@
> enum {
> VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> - (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> + (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> + (1ULL << VIRTIO_VSOCK_F_DGRAM)
> };
>
> enum {
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 64738838bee5..857df3a3a70d 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -40,6 +40,7 @@
>
> /* The feature bitmap for virtio vsock */
> #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
> +#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
>
> struct virtio_vsock_config {
> __le64 guest_cid;
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index c6212eb38d3c..073314312683 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /* forward declaration */
> struct virtio_vsock {
> struct virtio_device *vdev;
> struct virtqueue *vqs[VSOCK_VQ_MAX];
> + bool has_dgram;
>
> /* Virtqueue processing is deferred to a workqueue */
> struct work_struct tx_work;
> @@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> }
>
> vsock->vdev = vdev;
> -
> vsock->rx_buf_nr = 0;
> vsock->rx_buf_max_nr = 0;
> atomic_set(&vsock->queued_replies, 0);
> @@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
> vsock->seqpacket_allow = true;
>
> + if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
> + vsock->has_dgram = true;
> +
> vdev->priv = vsock;
>
> ret = virtio_vsock_vqs_init(vsock);
> @@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
> };
>
> static unsigned int features[] = {
> - VIRTIO_VSOCK_F_SEQPACKET
> + VIRTIO_VSOCK_F_SEQPACKET,
> + VIRTIO_VSOCK_F_DGRAM
> };
>
> static struct virtio_driver virtio_vsock_driver = {
> --
> 2.35.1
>
On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> In order to support usage of qdisc on vsock traffic, this commit
> introduces a struct net_device to vhost and virtio vsock.
>
> Two new devices are created, vhost-vsock for vhost and virtio-vsock
> for virtio. The devices are attached to the respective transports.
>
> To bypass the usage of the device, the user may "down" the associated
> network interface using common tools. For example, "ip link set dev
> virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> simply using the FIFO logic of the prior implementation.
Ugh. That's quite a hack. Mark my words, at some point we will decide to
have down mean "down". Besides, multiple net devices with link up tend
to confuse userspace. So might want to keep it down at all times
even short term.
> For both hosts and guests, there is one device for all G2H vsock sockets
> and one device for all H2G vsock sockets. This makes sense for guests
> because the driver only supports a single vsock channel (one pair of
> TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> seem ideal for some workloads. However, it is possible to use a
> multi-queue qdisc, where a given queue is responsible for a range of
> sockets. This seems to be a better solution than having one device per
> socket, which may yield a very large number of devices and qdiscs, all
> of which are dynamically being created and destroyed. Because of this
> dynamism, it would also require a complex policy management daemon, as
> devices would constantly be spun up and down as sockets were created and
> destroyed. To avoid this, one device and qdisc also applies to all H2G
> sockets.
>
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> drivers/vhost/vsock.c | 19 +++-
> include/linux/virtio_vsock.h | 10 +++
> net/vmw_vsock/virtio_transport.c | 19 +++-
> net/vmw_vsock/virtio_transport_common.c | 112 +++++++++++++++++++++++-
> 4 files changed, 152 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index f8601d93d94d..b20ddec2664b 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -927,13 +927,30 @@ static int __init vhost_vsock_init(void)
> VSOCK_TRANSPORT_F_H2G);
> if (ret < 0)
> return ret;
> - return misc_register(&vhost_vsock_misc);
> +
> + ret = virtio_transport_init(&vhost_transport, "vhost-vsock");
> + if (ret < 0)
> + goto out_unregister;
> +
> + ret = misc_register(&vhost_vsock_misc);
> + if (ret < 0)
> + goto out_transport_exit;
> + return ret;
> +
> +out_transport_exit:
> + virtio_transport_exit(&vhost_transport);
> +
> +out_unregister:
> + vsock_core_unregister(&vhost_transport.transport);
> + return ret;
> +
> };
>
> static void __exit vhost_vsock_exit(void)
> {
> misc_deregister(&vhost_vsock_misc);
> vsock_core_unregister(&vhost_transport.transport);
> + virtio_transport_exit(&vhost_transport);
> };
>
> module_init(vhost_vsock_init);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 9a37eddbb87a..5d7e7fbd75f8 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -91,10 +91,20 @@ struct virtio_transport {
> /* This must be the first field */
> struct vsock_transport transport;
>
> + /* Used almost exclusively for qdisc */
> + struct net_device *dev;
> +
> /* Takes ownership of the packet */
> int (*send_pkt)(struct sk_buff *skb);
> };
>
> +int
> +virtio_transport_init(struct virtio_transport *t,
> + const char *name);
> +
> +void
> +virtio_transport_exit(struct virtio_transport *t);
> +
> ssize_t
> virtio_transport_stream_dequeue(struct vsock_sock *vsk,
> struct msghdr *msg,
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 3bb293fd8607..c6212eb38d3c 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -131,7 +131,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> * the vq
> */
> if (ret < 0) {
> - skb_queue_head(&vsock->send_pkt_queue, skb);
> + spin_lock_bh(&vsock->send_pkt_queue.lock);
> + __skb_queue_head(&vsock->send_pkt_queue, skb);
> + spin_unlock_bh(&vsock->send_pkt_queue.lock);
> break;
> }
>
> @@ -676,7 +678,9 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
> kfree_skb(skb);
> mutex_unlock(&vsock->tx_lock);
>
> - skb_queue_purge(&vsock->send_pkt_queue);
> + spin_lock_bh(&vsock->send_pkt_queue.lock);
> + __skb_queue_purge(&vsock->send_pkt_queue);
> + spin_unlock_bh(&vsock->send_pkt_queue.lock);
>
> /* Delete virtqueues and flush outstanding callbacks if any */
> vdev->config->del_vqs(vdev);
> @@ -760,6 +764,8 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
> flush_work(&vsock->event_work);
> flush_work(&vsock->send_pkt_work);
>
> + virtio_transport_exit(&virtio_transport);
> +
> mutex_unlock(&the_virtio_vsock_mutex);
>
> kfree(vsock);
> @@ -844,12 +850,18 @@ static int __init virtio_vsock_init(void)
> if (ret)
> goto out_wq;
>
> - ret = register_virtio_driver(&virtio_vsock_driver);
> + ret = virtio_transport_init(&virtio_transport, "virtio-vsock");
> if (ret)
> goto out_vci;
>
> + ret = register_virtio_driver(&virtio_vsock_driver);
> + if (ret)
> + goto out_transport;
> +
> return 0;
>
> +out_transport:
> + virtio_transport_exit(&virtio_transport);
> out_vci:
> vsock_core_unregister(&virtio_transport.transport);
> out_wq:
> @@ -861,6 +873,7 @@ static void __exit virtio_vsock_exit(void)
> {
> unregister_virtio_driver(&virtio_vsock_driver);
> vsock_core_unregister(&virtio_transport.transport);
> + virtio_transport_exit(&virtio_transport);
> destroy_workqueue(virtio_vsock_workqueue);
> }
>
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index d5780599fe93..bdf16fff054f 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -16,6 +16,7 @@
>
> #include <net/sock.h>
> #include <net/af_vsock.h>
> +#include <net/pkt_sched.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/vsock_virtio_transport_common.h>
> @@ -23,6 +24,93 @@
> /* How long to wait for graceful shutdown of a connection */
> #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
>
> +struct virtio_transport_priv {
> + struct virtio_transport *trans;
> +};
> +
> +static netdev_tx_t virtio_transport_start_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct virtio_transport *t =
> + ((struct virtio_transport_priv *)netdev_priv(dev))->trans;
> + int ret;
> +
> + ret = t->send_pkt(skb);
> + if (unlikely(ret == -ENODEV))
> + return NETDEV_TX_BUSY;
> +
> + return NETDEV_TX_OK;
> +}
> +
> +const struct net_device_ops virtio_transport_netdev_ops = {
> + .ndo_start_xmit = virtio_transport_start_xmit,
> +};
> +
> +static void virtio_transport_setup(struct net_device *dev)
> +{
> + dev->netdev_ops = &virtio_transport_netdev_ops;
> + dev->needs_free_netdev = true;
> + dev->flags = IFF_NOARP;
> + dev->mtu = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> +}
> +
> +static int ifup(struct net_device *dev)
> +{
> + int ret;
> +
> + rtnl_lock();
> + ret = dev_open(dev, NULL) ? -ENOMEM : 0;
> + rtnl_unlock();
> +
> + return ret;
> +}
> +
> +/* virtio_transport_init - initialize a virtio vsock transport layer
> + *
> + * @t: ptr to the virtio transport struct to initialize
> + * @name: the name of the net_device to be created.
> + *
> + * Return 0 on success, otherwise negative errno.
> + */
> +int virtio_transport_init(struct virtio_transport *t, const char *name)
> +{
> + struct virtio_transport_priv *priv;
> + int ret;
> +
> + t->dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN, virtio_transport_setup);
> + if (!t->dev)
> + return -ENOMEM;
> +
> + priv = netdev_priv(t->dev);
> + priv->trans = t;
> +
> + ret = register_netdev(t->dev);
> + if (ret < 0)
> + goto out_free_netdev;
> +
> + ret = ifup(t->dev);
> + if (ret < 0)
> + goto out_unregister_netdev;
> +
> + return 0;
> +
> +out_unregister_netdev:
> + unregister_netdev(t->dev);
> +
> +out_free_netdev:
> + free_netdev(t->dev);
> +
> + return ret;
> +}
> +
> +void virtio_transport_exit(struct virtio_transport *t)
> +{
> + if (t->dev) {
> + unregister_netdev(t->dev);
> + free_netdev(t->dev);
> + }
> +}
> +
> static const struct virtio_transport *
> virtio_transport_get_ops(struct vsock_sock *vsk)
> {
> @@ -147,6 +235,24 @@ static u16 virtio_transport_get_type(struct sock *sk)
> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> }
>
> +/* Return pkt->len on success, otherwise negative errno */
> +static int virtio_transport_send_pkt(const struct virtio_transport *t, struct sk_buff *skb)
> +{
> + int ret;
> + int len = skb->len;
> +
> + if (unlikely(!t->dev || !(t->dev->flags & IFF_UP)))
> + return t->send_pkt(skb);
> +
> + skb->dev = t->dev;
> + ret = dev_queue_xmit(skb);
> +
> + if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN))
> + return len;
> +
> + return -ENOMEM;
> +}
> +
> /* This function can only be used on connecting/connected sockets,
> * since a socket assigned to a transport is required.
> *
> @@ -202,9 +308,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>
> virtio_transport_inc_tx_pkt(vvs, skb);
>
> - err = t_ops->send_pkt(skb);
> -
> - return err < 0 ? -ENOMEM : err;
> + return virtio_transport_send_pkt(t_ops, skb);
> }
>
> static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
> @@ -834,7 +938,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> return -ENOTCONN;
> }
>
> - return t->send_pkt(reply);
> + return virtio_transport_send_pkt(t, reply);
> }
>
> /* This function should be called with sk_lock held and SOCK_DONE set */
> --
> 2.35.1
CC'ing [email protected]
On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> This patch supports dgram in virtio and on the vhost side.
>
> Signed-off-by: Jiang Wang <[email protected]>
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> drivers/vhost/vsock.c | 2 +-
> include/net/af_vsock.h | 2 +
> include/uapi/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/af_vsock.c | 26 +++-
> net/vmw_vsock/virtio_transport.c | 2 +-
> net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
> 6 files changed, 186 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index a5d1bdb786fe..3dc72a5647ca 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> int ret;
>
> ret = vsock_core_register(&vhost_transport.transport,
> - VSOCK_TRANSPORT_F_H2G);
> + VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
> if (ret < 0)
> return ret;
>
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 1c53c4c4d88f..37e55c81e4df 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -78,6 +78,8 @@ struct vsock_sock {
> s64 vsock_stream_has_data(struct vsock_sock *vsk);
> s64 vsock_stream_has_space(struct vsock_sock *vsk);
> struct sock *vsock_create_connected(struct sock *parent);
> +int vsock_bind_stream(struct vsock_sock *vsk,
> + struct sockaddr_vm *addr);
>
> /**** TRANSPORT ****/
>
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 857df3a3a70d..0975b9c88292 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> enum virtio_vsock_type {
> VIRTIO_VSOCK_TYPE_STREAM = 1,
> VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> + VIRTIO_VSOCK_TYPE_DGRAM = 3,
> };
>
> enum virtio_vsock_op {
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 1893f8aafa48..87e4ae1866d3 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> return 0;
> }
>
> +int vsock_bind_stream(struct vsock_sock *vsk,
> + struct sockaddr_vm *addr)
> +{
> + int retval;
> +
> + spin_lock_bh(&vsock_table_lock);
> + retval = __vsock_bind_connectible(vsk, addr);
> + spin_unlock_bh(&vsock_table_lock);
> +
> + return retval;
> +}
> +EXPORT_SYMBOL(vsock_bind_stream);
> +
> static int __vsock_bind_dgram(struct vsock_sock *vsk,
> struct sockaddr_vm *addr)
> {
> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
> }
>
> if (features & VSOCK_TRANSPORT_F_DGRAM) {
> - if (t_dgram) {
> - err = -EBUSY;
> - goto err_busy;
> + /* TODO: always chose the G2H variant over others, support nesting later */
> + if (features & VSOCK_TRANSPORT_F_G2H) {
> + if (t_dgram)
> + pr_warn("virtio_vsock: t_dgram already set\n");
> + t_dgram = t;
> + }
> +
> + if (!t_dgram) {
> + t_dgram = t;
> }
> - t_dgram = t;
> }
>
> if (features & VSOCK_TRANSPORT_F_LOCAL) {
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 073314312683..d4526ca462d2 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> return -ENOMEM;
>
> ret = vsock_core_register(&virtio_transport.transport,
> - VSOCK_TRANSPORT_F_G2H);
> + VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
> if (ret)
> goto out_wq;
>
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index bdf16fff054f..aedb48728677 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>
> static u16 virtio_transport_get_type(struct sock *sk)
> {
> - if (sk->sk_type == SOCK_STREAM)
> + if (sk->sk_type == SOCK_DGRAM)
> + return VIRTIO_VSOCK_TYPE_DGRAM;
> + else if (sk->sk_type == SOCK_STREAM)
> return VIRTIO_VSOCK_TYPE_STREAM;
> else
> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> vvs = vsk->trans;
>
> /* we can send less than pkt_len bytes */
> - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> + pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> + else
> + return 0;
> + }
>
> - /* virtio_transport_get_credit might return less than pkt_len credit */
> - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> + /* virtio_transport_get_credit might return less than pkt_len credit */
> + pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>
> - /* Do not send zero length OP_RW pkt */
> - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> - return pkt_len;
> + /* Do not send zero length OP_RW pkt */
> + if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> + return pkt_len;
> + }
>
> skb = virtio_transport_alloc_skb(info, pkt_len,
> src_cid, src_port,
> dst_cid, dst_port,
> &err);
> if (!skb) {
> - virtio_transport_put_credit(vvs, pkt_len);
> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> + virtio_transport_put_credit(vvs, pkt_len);
> return err;
> }
>
> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> }
> EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
>
> +static ssize_t
> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> + struct msghdr *msg, size_t len)
> +{
> + struct virtio_vsock_sock *vvs = vsk->trans;
> + struct sk_buff *skb;
> + size_t total = 0;
> + u32 free_space;
> + int err = -EFAULT;
> +
> + spin_lock_bh(&vvs->rx_lock);
> + if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> + skb = __skb_dequeue(&vvs->rx_queue);
> +
> + total = len;
> + if (total > skb->len - vsock_metadata(skb)->off)
> + total = skb->len - vsock_metadata(skb)->off;
> + else if (total < skb->len - vsock_metadata(skb)->off)
> + msg->msg_flags |= MSG_TRUNC;
> +
> + /* sk_lock is held by caller so no one else can dequeue.
> + * Unlock rx_lock since memcpy_to_msg() may sleep.
> + */
> + spin_unlock_bh(&vvs->rx_lock);
> +
> + err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
> + if (err)
> + return err;
> +
> + spin_lock_bh(&vvs->rx_lock);
> +
> + virtio_transport_dec_rx_pkt(vvs, skb);
> + consume_skb(skb);
> + }
> +
> + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> +
> + spin_unlock_bh(&vvs->rx_lock);
> +
> + if (total > 0 && msg->msg_name) {
> + /* Provide the address of the sender. */
> + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> +
> + vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
> + le32_to_cpu(vsock_hdr(skb)->src_port));
> + msg->msg_namelen = sizeof(*vm_addr);
> + }
> + return total;
> +}
> +
> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
> +{
> + return virtio_transport_stream_has_data(vsk);
> +}
> +
> int
> virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> struct msghdr *msg,
> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
> struct msghdr *msg,
> size_t len, int flags)
> {
> - return -EOPNOTSUPP;
> + struct sock *sk;
> + size_t err = 0;
> + long timeout;
> +
> + DEFINE_WAIT(wait);
> +
> + sk = &vsk->sk;
> + err = 0;
> +
> + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
> + return -EOPNOTSUPP;
> +
> + lock_sock(sk);
> +
> + if (!len)
> + goto out;
> +
> + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> +
> + while (1) {
> + s64 ready;
> +
> + prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> + ready = virtio_transport_dgram_has_data(vsk);
> +
> + if (ready == 0) {
> + if (timeout == 0) {
> + err = -EAGAIN;
> + finish_wait(sk_sleep(sk), &wait);
> + break;
> + }
> +
> + release_sock(sk);
> + timeout = schedule_timeout(timeout);
> + lock_sock(sk);
> +
> + if (signal_pending(current)) {
> + err = sock_intr_errno(timeout);
> + finish_wait(sk_sleep(sk), &wait);
> + break;
> + } else if (timeout == 0) {
> + err = -EAGAIN;
> + finish_wait(sk_sleep(sk), &wait);
> + break;
> + }
> + } else {
> + finish_wait(sk_sleep(sk), &wait);
> +
> + if (ready < 0) {
> + err = -ENOMEM;
> + goto out;
> + }
> +
> + err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
> + break;
> + }
> + }
> +out:
> + release_sock(sk);
> + return err;
> }
> EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
>
> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> struct sockaddr_vm *addr)
> {
> - return -EOPNOTSUPP;
> + return vsock_bind_stream(vsk, addr);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>
> bool virtio_transport_dgram_allow(u32 cid, u32 port)
> {
> - return false;
> + return true;
> }
> EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
>
> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> struct msghdr *msg,
> size_t dgram_len)
> {
> - return -EOPNOTSUPP;
> + struct virtio_vsock_pkt_info info = {
> + .op = VIRTIO_VSOCK_OP_RW,
> + .msg = msg,
> + .pkt_len = dgram_len,
> + .vsk = vsk,
> + .remote_cid = remote_addr->svm_cid,
> + .remote_port = remote_addr->svm_port,
> + };
> +
> + return virtio_transport_send_pkt_info(vsk, &info);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>
> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
> struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> int err = 0;
>
> + if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> + virtio_transport_recv_enqueue(vsk, skb);
> + sk->sk_data_ready(sk);
> + return err;
> + }
> +
> switch (le16_to_cpu(hdr->op)) {
> case VIRTIO_VSOCK_OP_RW:
> virtio_transport_recv_enqueue(vsk, skb);
> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> static bool virtio_transport_valid_type(u16 type)
> {
> return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> + (type == VIRTIO_VSOCK_TYPE_DGRAM);
> }
>
> /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> goto free_pkt;
> }
>
> + if (sk->sk_type == SOCK_DGRAM) {
> + virtio_transport_recv_connected(sk, skb);
> + goto out;
> + }
> +
> space_available = virtio_transport_space_update(sk, skb);
>
> /* Update CID in case it has changed after a transport reset event */
> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> break;
> }
>
> +out:
> release_sock(sk);
>
> /* Release refcnt obtained when we fetched this socket out of the
> --
> 2.35.1
>
On Tue, 16 Aug 2022 12:38:52 -0400 Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > In order to support usage of qdisc on vsock traffic, this commit
> > introduces a struct net_device to vhost and virtio vsock.
> >
> > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > for virtio. The devices are attached to the respective transports.
> >
> > To bypass the usage of the device, the user may "down" the associated
> > network interface using common tools. For example, "ip link set dev
> > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > simply using the FIFO logic of the prior implementation.
>
> Ugh. That's quite a hack. Mark my words, at some point we will decide to
> have down mean "down". Besides, multiple net devices with link up tend
> to confuse userspace. So might want to keep it down at all times
> even short term.
Agreed!
From a cursory look (and Documentation/ would be nice..) it feels
very wrong to me. Do you know of any uses of a netdev which would
be semantically similar to what you're doing? Treating netdevs as
buildings blocks for arbitrary message passing solutions is something
I dislike quite strongly.
Could you recommend where I can learn more about vsocks?
On Tue, Aug 16, 2022 at 12:38:52PM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > In order to support usage of qdisc on vsock traffic, this commit
> > introduces a struct net_device to vhost and virtio vsock.
> >
> > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > for virtio. The devices are attached to the respective transports.
> >
> > To bypass the usage of the device, the user may "down" the associated
> > network interface using common tools. For example, "ip link set dev
> > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > simply using the FIFO logic of the prior implementation.
>
> Ugh. That's quite a hack. Mark my words, at some point we will decide to
> have down mean "down". Besides, multiple net devices with link up tend
> to confuse userspace. So might want to keep it down at all times
> even short term.
>
I have to admit, this choice was born more of perceived necessity than
of a love for the design... but I can explain the pain points that led
to the current state, which I hope sparks some discussion.
When the state is down, dev_queue_xmit() will fail. To avoid this and
preserve the "zero-configuration" guarantee of vsock, I chose to make
transmission work regardless of device state by implementing this
"ignore up/down state" hack.
This is unfortunate because what we are really after here is just packet
scheduling, i.e., qdisc. We don't really need the rest of the
net_device, and I don't think up/down buys us anything of value. The
relationship between qdisc and net_device is so tightly knit together
though, that using qdisc without a net_device doesn't look very
practical (and maybe impossible).
Some alternative routes might be:
1) Default the state to up, and let users disable vsock by downing the
device if they'd like. It still works out-of-the-box, but if users
really want to disable vsock they may.
2) vsock may simply turn the device to state up when a socket is first
used. For instance, the HCI device in net/bluetooth/hci_* uses a
technique where the net_device is turned to up when bind() is called on
any HCI socket (BTPROTO_HCI). It can also be turned up/down via
ioctl().
3) Modify net_device registration to allow us to have an invisible
device that is only known by the kernel. It may default to up and remain
unchanged. The qdisc controls alone may be exposed to userspace,
hopefully via netlink to still work with tc. This is not
currently supported by register_netdevice(), but a series from 2007 was
sent to the ML, tentatively approved in concept, but was never merged[1].
4) Currently NETDEV_UP/NETDEV_DOWN commands can't be vetoed.
NETDEV_PRE_UP, however, is used to effectively veto NETDEV_UP
commands[2]. We could introduce NETDEV_PRE_DOWN to support vetoing of
NETDEV_DOWN. This would allow us to install a hook to determine if
we actually want to allow the device to be downed.
In an ideal world, we could simply pass a set of netdev queues, a
packet, and maybe a blob of state to qdisc and let it work its
scheduling magic...
Any thoughts?
[1]: https://lore.kernel.org/netdev/20070129140958.0cf6880f@freekitty/
[2]: https://lore.kernel.org/all/[email protected]/
Thanks,
Bobby
On Tue, Aug 16, 2022 at 11:07:17AM -0700, Jakub Kicinski wrote:
> On Tue, 16 Aug 2022 12:38:52 -0400 Michael S. Tsirkin wrote:
> > On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > > In order to support usage of qdisc on vsock traffic, this commit
> > > introduces a struct net_device to vhost and virtio vsock.
> > >
> > > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > > for virtio. The devices are attached to the respective transports.
> > >
> > > To bypass the usage of the device, the user may "down" the associated
> > > network interface using common tools. For example, "ip link set dev
> > > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > > simply using the FIFO logic of the prior implementation.
> >
> > Ugh. That's quite a hack. Mark my words, at some point we will decide to
> > have down mean "down". Besides, multiple net devices with link up tend
> > to confuse userspace. So might want to keep it down at all times
> > even short term.
>
> Agreed!
>
> From a cursory look (and Documentation/ would be nice..) it feels
> very wrong to me. Do you know of any uses of a netdev which would
> be semantically similar to what you're doing? Treating netdevs as
> buildings blocks for arbitrary message passing solutions is something
> I dislike quite strongly.
The big difference between vsock and "arbitrary message passing" is that
vsock is actually constrained by the virtio device that backs it (made
up of virtqueues and the underlying protocol). That virtqueue pair is
acting like the queues on a physical NIC, so it actually makes sense to
manage the queuing of vsock's device like we would manage the queueing
of a real device.
Still, I concede that ignoring the netdev state is a probably bad idea.
That said, I also think that using packet scheduling in vsock is a good
idea, and that ideally we can reuse Linux's already robust library of
packet scheduling algorithms by introducing qdisc somehow.
>
> Could you recommend where I can learn more about vsocks?
I think the spec is probably the best place to start[1].
[1]: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
Best,
Bobby
On Tue, 16 Aug 2022 07:02:33 +0000 Bobby Eshleman wrote:
> > From a cursory look (and Documentation/ would be nice..) it feels
> > very wrong to me. Do you know of any uses of a netdev which would
> > be semantically similar to what you're doing? Treating netdevs as
> > buildings blocks for arbitrary message passing solutions is something
> > I dislike quite strongly.
>
> The big difference between vsock and "arbitrary message passing" is that
> vsock is actually constrained by the virtio device that backs it (made
> up of virtqueues and the underlying protocol). That virtqueue pair is
> acting like the queues on a physical NIC, so it actually makes sense to
> manage the queuing of vsock's device like we would manage the queueing
> of a real device.
>
> Still, I concede that ignoring the netdev state is a probably bad idea.
>
> That said, I also think that using packet scheduling in vsock is a good
> idea, and that ideally we can reuse Linux's already robust library of
> packet scheduling algorithms by introducing qdisc somehow.
We've been burnt in the past by people doing the "let me just pick
these useful pieces out of netdev" thing. Makes life hard both for
maintainers and users trying to make sense of the interfaces.
What comes to mind if you're just after queuing is that we already
bastardized the CoDel implementation (include/net/codel_impl.h).
If CoDel is good enough for you maybe that's the easiest way?
Although I suspect that you're after fairness not early drops.
Wireless folks use CoDel as a second layer queuing. (CC: Toke)
> > Could you recommend where I can learn more about vsocks?
>
> I think the spec is probably the best place to start[1].
>
> [1]: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
Eh, I was hoping it was a side channel of an existing virtio_net
which is not the case. Given the zero-config requirement IDK if
we'll be able to fit this into netdev semantics :(
On Tue, Aug 16, 2022 at 04:07:55PM -0700, Jakub Kicinski wrote:
> On Tue, 16 Aug 2022 07:02:33 +0000 Bobby Eshleman wrote:
> > > From a cursory look (and Documentation/ would be nice..) it feels
> > > very wrong to me. Do you know of any uses of a netdev which would
> > > be semantically similar to what you're doing? Treating netdevs as
> > > buildings blocks for arbitrary message passing solutions is something
> > > I dislike quite strongly.
> >
> > The big difference between vsock and "arbitrary message passing" is that
> > vsock is actually constrained by the virtio device that backs it (made
> > up of virtqueues and the underlying protocol). That virtqueue pair is
> > acting like the queues on a physical NIC, so it actually makes sense to
> > manage the queuing of vsock's device like we would manage the queueing
> > of a real device.
> >
> > Still, I concede that ignoring the netdev state is a probably bad idea.
> >
> > That said, I also think that using packet scheduling in vsock is a good
> > idea, and that ideally we can reuse Linux's already robust library of
> > packet scheduling algorithms by introducing qdisc somehow.
>
> We've been burnt in the past by people doing the "let me just pick
> these useful pieces out of netdev" thing. Makes life hard both for
> maintainers and users trying to make sense of the interfaces.
>
> What comes to mind if you're just after queuing is that we already
> bastardized the CoDel implementation (include/net/codel_impl.h).
> If CoDel is good enough for you maybe that's the easiest way?
> Although I suspect that you're after fairness not early drops.
> Wireless folks use CoDel as a second layer queuing. (CC: Toke)
>
That is certainly interesting to me. Sitting next to "codel_impl.h" is
"include/net/fq_impl.h", and it looks like it may solve the datagram
flooding issue. The downside to this approach is the baking of a
specific policy into vsock... which I don't exactly love either.
I'm not seeing too many other of these qdisc bastardizations in
include/net, are there any others that you are aware of?
> > > Could you recommend where I can learn more about vsocks?
> >
> > I think the spec is probably the best place to start[1].
> >
> > [1]: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
>
> Eh, I was hoping it was a side channel of an existing virtio_net
> which is not the case. Given the zero-config requirement IDK if
> we'll be able to fit this into netdev semantics :(
It's certainly possible that it may not fit :/ I feel that it partially
depends on what we mean by zero-config. Is it "no config required to
have a working socket" or is it "no config required, but also no
tuning/policy/etc... supported"?
Best,
Bobby
On Tue, 16 Aug 2022 08:29:04 +0000 Bobby Eshleman wrote:
> > We've been burnt in the past by people doing the "let me just pick
> > these useful pieces out of netdev" thing. Makes life hard both for
> > maintainers and users trying to make sense of the interfaces.
> >
> > What comes to mind if you're just after queuing is that we already
> > bastardized the CoDel implementation (include/net/codel_impl.h).
> > If CoDel is good enough for you maybe that's the easiest way?
> > Although I suspect that you're after fairness not early drops.
> > Wireless folks use CoDel as a second layer queuing. (CC: Toke)
>
> That is certainly interesting to me. Sitting next to "codel_impl.h" is
> "include/net/fq_impl.h", and it looks like it may solve the datagram
> flooding issue. The downside to this approach is the baking of a
> specific policy into vsock... which I don't exactly love either.
>
> I'm not seeing too many other of these qdisc bastardizations in
> include/net, are there any others that you are aware of?
Just what wireless uses (so codel and fq as you found out), nothing
else comes to mind.
> > Eh, I was hoping it was a side channel of an existing virtio_net
> > which is not the case. Given the zero-config requirement IDK if
> > we'll be able to fit this into netdev semantics :(
>
> It's certainly possible that it may not fit :/ I feel that it partially
> depends on what we mean by zero-config. Is it "no config required to
> have a working socket" or is it "no config required, but also no
> tuning/policy/etc... supported"?
The value of tuning vs confusion of a strange netdev floating around
in the system is hard to estimate upfront.
The nice thing about using a built-in fq with no user visible knobs is
that there's no extra uAPI. We can always rip it out and replace later.
And it shouldn't be controversial, making the path to upstream smoother.
On Tue, Aug 16, 2022 at 4:07 PM Jakub Kicinski <[email protected]> wrote:
>
> On Tue, 16 Aug 2022 07:02:33 +0000 Bobby Eshleman wrote:
> > > From a cursory look (and Documentation/ would be nice..) it feels
> > > very wrong to me. Do you know of any uses of a netdev which would
> > > be semantically similar to what you're doing? Treating netdevs as
> > > buildings blocks for arbitrary message passing solutions is something
> > > I dislike quite strongly.
> >
> > The big difference between vsock and "arbitrary message passing" is that
> > vsock is actually constrained by the virtio device that backs it (made
> > up of virtqueues and the underlying protocol). That virtqueue pair is
> > acting like the queues on a physical NIC, so it actually makes sense to
> > manage the queuing of vsock's device like we would manage the queueing
> > of a real device.
> >
> > Still, I concede that ignoring the netdev state is a probably bad idea.
> >
> > That said, I also think that using packet scheduling in vsock is a good
> > idea, and that ideally we can reuse Linux's already robust library of
> > packet scheduling algorithms by introducing qdisc somehow.
>
> We've been burnt in the past by people doing the "let me just pick
> these useful pieces out of netdev" thing. Makes life hard both for
> maintainers and users trying to make sense of the interfaces.
I interpret this in a different way: we just believe "one size does
not fit all",
as most Linux kernel developers do. I am very surprised you don't.
Feel free to suggest any other ways, eventually you will need to
reimplement TC one way or the other.
If you think about it in another way, vsock is networking too, its name
contains a "sock", do I need to say more? :)
>
> What comes to mind if you're just after queuing is that we already
> bastardized the CoDel implementation (include/net/codel_impl.h).
> If CoDel is good enough for you maybe that's the easiest way?
> Although I suspect that you're after fairness not early drops.
> Wireless folks use CoDel as a second layer queuing. (CC: Toke)
What makes you believe CoDel fits all cases? If it really does, you
probably have to convince Toke to give up his idea on XDP map
as it would no longer make any sense. I don't see you raise such
an argument there... What makes you treat this differently with XDP
map? I am very curious about your thought process here. ;-)
Thanks.
On 16.08.2022 05:32, Bobby Eshleman wrote:
> CC'ing [email protected]
>
> On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
>> This patch supports dgram in virtio and on the vhost side.
Hello,
sorry, i don't understand, how this maintains message boundaries? Or it
is unnecessary for SOCK_DGRAM?
Thanks
>>
>> Signed-off-by: Jiang Wang <[email protected]>
>> Signed-off-by: Bobby Eshleman <[email protected]>
>> ---
>> drivers/vhost/vsock.c | 2 +-
>> include/net/af_vsock.h | 2 +
>> include/uapi/linux/virtio_vsock.h | 1 +
>> net/vmw_vsock/af_vsock.c | 26 +++-
>> net/vmw_vsock/virtio_transport.c | 2 +-
>> net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
>> 6 files changed, 186 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index a5d1bdb786fe..3dc72a5647ca 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
>> int ret;
>>
>> ret = vsock_core_register(&vhost_transport.transport,
>> - VSOCK_TRANSPORT_F_H2G);
>> + VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
>> if (ret < 0)
>> return ret;
>>
>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> index 1c53c4c4d88f..37e55c81e4df 100644
>> --- a/include/net/af_vsock.h
>> +++ b/include/net/af_vsock.h
>> @@ -78,6 +78,8 @@ struct vsock_sock {
>> s64 vsock_stream_has_data(struct vsock_sock *vsk);
>> s64 vsock_stream_has_space(struct vsock_sock *vsk);
>> struct sock *vsock_create_connected(struct sock *parent);
>> +int vsock_bind_stream(struct vsock_sock *vsk,
>> + struct sockaddr_vm *addr);
>>
>> /**** TRANSPORT ****/
>>
>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>> index 857df3a3a70d..0975b9c88292 100644
>> --- a/include/uapi/linux/virtio_vsock.h
>> +++ b/include/uapi/linux/virtio_vsock.h
>> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
>> enum virtio_vsock_type {
>> VIRTIO_VSOCK_TYPE_STREAM = 1,
>> VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
>> + VIRTIO_VSOCK_TYPE_DGRAM = 3,
>> };
>>
>> enum virtio_vsock_op {
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index 1893f8aafa48..87e4ae1866d3 100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>> return 0;
>> }
>>
>> +int vsock_bind_stream(struct vsock_sock *vsk,
>> + struct sockaddr_vm *addr)
>> +{
>> + int retval;
>> +
>> + spin_lock_bh(&vsock_table_lock);
>> + retval = __vsock_bind_connectible(vsk, addr);
>> + spin_unlock_bh(&vsock_table_lock);
>> +
>> + return retval;
>> +}
>> +EXPORT_SYMBOL(vsock_bind_stream);
>> +
>> static int __vsock_bind_dgram(struct vsock_sock *vsk,
>> struct sockaddr_vm *addr)
>> {
>> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>> }
>>
>> if (features & VSOCK_TRANSPORT_F_DGRAM) {
>> - if (t_dgram) {
>> - err = -EBUSY;
>> - goto err_busy;
>> + /* TODO: always chose the G2H variant over others, support nesting later */
>> + if (features & VSOCK_TRANSPORT_F_G2H) {
>> + if (t_dgram)
>> + pr_warn("virtio_vsock: t_dgram already set\n");
>> + t_dgram = t;
>> + }
>> +
>> + if (!t_dgram) {
>> + t_dgram = t;
>> }
>> - t_dgram = t;
>> }
>>
>> if (features & VSOCK_TRANSPORT_F_LOCAL) {
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index 073314312683..d4526ca462d2 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
>> return -ENOMEM;
>>
>> ret = vsock_core_register(&virtio_transport.transport,
>> - VSOCK_TRANSPORT_F_G2H);
>> + VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
>> if (ret)
>> goto out_wq;
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index bdf16fff054f..aedb48728677 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>>
>> static u16 virtio_transport_get_type(struct sock *sk)
>> {
>> - if (sk->sk_type == SOCK_STREAM)
>> + if (sk->sk_type == SOCK_DGRAM)
>> + return VIRTIO_VSOCK_TYPE_DGRAM;
>> + else if (sk->sk_type == SOCK_STREAM)
>> return VIRTIO_VSOCK_TYPE_STREAM;
>> else
>> return VIRTIO_VSOCK_TYPE_SEQPACKET;
>> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>> vvs = vsk->trans;
>>
>> /* we can send less than pkt_len bytes */
>> - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
>> - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>> + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>> + pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>> + else
>> + return 0;
>> + }
>>
>> - /* virtio_transport_get_credit might return less than pkt_len credit */
>> - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
>> + /* virtio_transport_get_credit might return less than pkt_len credit */
>> + pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>>
>> - /* Do not send zero length OP_RW pkt */
>> - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>> - return pkt_len;
>> + /* Do not send zero length OP_RW pkt */
>> + if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>> + return pkt_len;
>> + }
>>
>> skb = virtio_transport_alloc_skb(info, pkt_len,
>> src_cid, src_port,
>> dst_cid, dst_port,
>> &err);
>> if (!skb) {
>> - virtio_transport_put_credit(vvs, pkt_len);
>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>> + virtio_transport_put_credit(vvs, pkt_len);
>> return err;
>> }
>>
>> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
>> }
>> EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
>>
>> +static ssize_t
>> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
>> + struct msghdr *msg, size_t len)
>> +{
>> + struct virtio_vsock_sock *vvs = vsk->trans;
>> + struct sk_buff *skb;
>> + size_t total = 0;
>> + u32 free_space;
>> + int err = -EFAULT;
>> +
>> + spin_lock_bh(&vvs->rx_lock);
>> + if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
>> + skb = __skb_dequeue(&vvs->rx_queue);
>> +
>> + total = len;
>> + if (total > skb->len - vsock_metadata(skb)->off)
>> + total = skb->len - vsock_metadata(skb)->off;
>> + else if (total < skb->len - vsock_metadata(skb)->off)
>> + msg->msg_flags |= MSG_TRUNC;
>> +
>> + /* sk_lock is held by caller so no one else can dequeue.
>> + * Unlock rx_lock since memcpy_to_msg() may sleep.
>> + */
>> + spin_unlock_bh(&vvs->rx_lock);
>> +
>> + err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
>> + if (err)
>> + return err;
>> +
>> + spin_lock_bh(&vvs->rx_lock);
>> +
>> + virtio_transport_dec_rx_pkt(vvs, skb);
>> + consume_skb(skb);
>> + }
>> +
>> + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
>> +
>> + spin_unlock_bh(&vvs->rx_lock);
>> +
>> + if (total > 0 && msg->msg_name) {
>> + /* Provide the address of the sender. */
>> + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
>> +
>> + vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
>> + le32_to_cpu(vsock_hdr(skb)->src_port));
>> + msg->msg_namelen = sizeof(*vm_addr);
>> + }
>> + return total;
>> +}
>> +
>> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
>> +{
>> + return virtio_transport_stream_has_data(vsk);
>> +}
>> +
>> int
>> virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
>> struct msghdr *msg,
>> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
>> struct msghdr *msg,
>> size_t len, int flags)
>> {
>> - return -EOPNOTSUPP;
>> + struct sock *sk;
>> + size_t err = 0;
>> + long timeout;
>> +
>> + DEFINE_WAIT(wait);
>> +
>> + sk = &vsk->sk;
>> + err = 0;
>> +
>> + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
>> + return -EOPNOTSUPP;
>> +
>> + lock_sock(sk);
>> +
>> + if (!len)
>> + goto out;
>> +
>> + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>> +
>> + while (1) {
>> + s64 ready;
>> +
>> + prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>> + ready = virtio_transport_dgram_has_data(vsk);
>> +
>> + if (ready == 0) {
>> + if (timeout == 0) {
>> + err = -EAGAIN;
>> + finish_wait(sk_sleep(sk), &wait);
>> + break;
>> + }
>> +
>> + release_sock(sk);
>> + timeout = schedule_timeout(timeout);
>> + lock_sock(sk);
>> +
>> + if (signal_pending(current)) {
>> + err = sock_intr_errno(timeout);
>> + finish_wait(sk_sleep(sk), &wait);
>> + break;
>> + } else if (timeout == 0) {
>> + err = -EAGAIN;
>> + finish_wait(sk_sleep(sk), &wait);
>> + break;
>> + }
>> + } else {
>> + finish_wait(sk_sleep(sk), &wait);
>> +
>> + if (ready < 0) {
>> + err = -ENOMEM;
>> + goto out;
>> + }
>> +
>> + err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
>> + break;
>> + }
>> + }
>> +out:
>> + release_sock(sk);
>> + return err;
>> }
>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
>>
>> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>> int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>> struct sockaddr_vm *addr)
>> {
>> - return -EOPNOTSUPP;
>> + return vsock_bind_stream(vsk, addr);
>> }
>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>>
>> bool virtio_transport_dgram_allow(u32 cid, u32 port)
>> {
>> - return false;
>> + return true;
>> }
>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
>>
>> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>> struct msghdr *msg,
>> size_t dgram_len)
>> {
>> - return -EOPNOTSUPP;
>> + struct virtio_vsock_pkt_info info = {
>> + .op = VIRTIO_VSOCK_OP_RW,
>> + .msg = msg,
>> + .pkt_len = dgram_len,
>> + .vsk = vsk,
>> + .remote_cid = remote_addr->svm_cid,
>> + .remote_port = remote_addr->svm_port,
>> + };
>> +
>> + return virtio_transport_send_pkt_info(vsk, &info);
>> }
>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>>
>> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
>> struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>> int err = 0;
>>
>> + if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
>> + virtio_transport_recv_enqueue(vsk, skb);
>> + sk->sk_data_ready(sk);
>> + return err;
>> + }
>> +
>> switch (le16_to_cpu(hdr->op)) {
>> case VIRTIO_VSOCK_OP_RW:
>> virtio_transport_recv_enqueue(vsk, skb);
>> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>> static bool virtio_transport_valid_type(u16 type)
>> {
>> return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
>> - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
>> + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
>> + (type == VIRTIO_VSOCK_TYPE_DGRAM);
>> }
>>
>> /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
>> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>> goto free_pkt;
>> }
>>
>> + if (sk->sk_type == SOCK_DGRAM) {
>> + virtio_transport_recv_connected(sk, skb);
>> + goto out;
>> + }
>> +
>> space_available = virtio_transport_space_update(sk, skb);
>>
>> /* Update CID in case it has changed after a transport reset event */
>> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>> break;
>> }
>>
>> +out:
>> release_sock(sk);
>>
>> /* Release refcnt obtained when we fetched this socket out of the
>> --
>> 2.35.1
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
On 17.08.2022 08:01, Arseniy Krasnov wrote:
> On 16.08.2022 05:32, Bobby Eshleman wrote:
>> CC'ing [email protected]
>>
>> On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
>>> This patch supports dgram in virtio and on the vhost side.
> Hello,
>
> sorry, i don't understand, how this maintains message boundaries? Or it
> is unnecessary for SOCK_DGRAM?
>
> Thanks
>>>
>>> Signed-off-by: Jiang Wang <[email protected]>
>>> Signed-off-by: Bobby Eshleman <[email protected]>
>>> ---
>>> drivers/vhost/vsock.c | 2 +-
>>> include/net/af_vsock.h | 2 +
>>> include/uapi/linux/virtio_vsock.h | 1 +
>>> net/vmw_vsock/af_vsock.c | 26 +++-
>>> net/vmw_vsock/virtio_transport.c | 2 +-
>>> net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
>>> 6 files changed, 186 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index a5d1bdb786fe..3dc72a5647ca 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
>>> int ret;
>>>
>>> ret = vsock_core_register(&vhost_transport.transport,
>>> - VSOCK_TRANSPORT_F_H2G);
>>> + VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
>>> if (ret < 0)
>>> return ret;
>>>
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index 1c53c4c4d88f..37e55c81e4df 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -78,6 +78,8 @@ struct vsock_sock {
>>> s64 vsock_stream_has_data(struct vsock_sock *vsk);
>>> s64 vsock_stream_has_space(struct vsock_sock *vsk);
>>> struct sock *vsock_create_connected(struct sock *parent);
>>> +int vsock_bind_stream(struct vsock_sock *vsk,
>>> + struct sockaddr_vm *addr);
>>>
>>> /**** TRANSPORT ****/
>>>
>>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>>> index 857df3a3a70d..0975b9c88292 100644
>>> --- a/include/uapi/linux/virtio_vsock.h
>>> +++ b/include/uapi/linux/virtio_vsock.h
>>> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
>>> enum virtio_vsock_type {
>>> VIRTIO_VSOCK_TYPE_STREAM = 1,
>>> VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
>>> + VIRTIO_VSOCK_TYPE_DGRAM = 3,
>>> };
>>>
>>> enum virtio_vsock_op {
>>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>> index 1893f8aafa48..87e4ae1866d3 100644
>>> --- a/net/vmw_vsock/af_vsock.c
>>> +++ b/net/vmw_vsock/af_vsock.c
>>> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>>> return 0;
>>> }
>>>
>>> +int vsock_bind_stream(struct vsock_sock *vsk,
>>> + struct sockaddr_vm *addr)
>>> +{
>>> + int retval;
>>> +
>>> + spin_lock_bh(&vsock_table_lock);
>>> + retval = __vsock_bind_connectible(vsk, addr);
>>> + spin_unlock_bh(&vsock_table_lock);
>>> +
>>> + return retval;
>>> +}
>>> +EXPORT_SYMBOL(vsock_bind_stream);
>>> +
>>> static int __vsock_bind_dgram(struct vsock_sock *vsk,
>>> struct sockaddr_vm *addr)
>>> {
>>> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>>> }
>>>
>>> if (features & VSOCK_TRANSPORT_F_DGRAM) {
>>> - if (t_dgram) {
>>> - err = -EBUSY;
>>> - goto err_busy;
>>> + /* TODO: always chose the G2H variant over others, support nesting later */
>>> + if (features & VSOCK_TRANSPORT_F_G2H) {
>>> + if (t_dgram)
>>> + pr_warn("virtio_vsock: t_dgram already set\n");
>>> + t_dgram = t;
>>> + }
>>> +
>>> + if (!t_dgram) {
>>> + t_dgram = t;
>>> }
>>> - t_dgram = t;
>>> }
>>>
>>> if (features & VSOCK_TRANSPORT_F_LOCAL) {
>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>> index 073314312683..d4526ca462d2 100644
>>> --- a/net/vmw_vsock/virtio_transport.c
>>> +++ b/net/vmw_vsock/virtio_transport.c
>>> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
>>> return -ENOMEM;
>>>
>>> ret = vsock_core_register(&virtio_transport.transport,
>>> - VSOCK_TRANSPORT_F_G2H);
>>> + VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
>>> if (ret)
>>> goto out_wq;
>>>
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index bdf16fff054f..aedb48728677 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>>>
>>> static u16 virtio_transport_get_type(struct sock *sk)
>>> {
>>> - if (sk->sk_type == SOCK_STREAM)
>>> + if (sk->sk_type == SOCK_DGRAM)
>>> + return VIRTIO_VSOCK_TYPE_DGRAM;
>>> + else if (sk->sk_type == SOCK_STREAM)
>>> return VIRTIO_VSOCK_TYPE_STREAM;
>>> else
>>> return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>> vvs = vsk->trans;
>>>
>>> /* we can send less than pkt_len bytes */
>>> - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
>>> - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>> + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
>>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>>> + pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>> + else
>>> + return 0;
>>> + }
>>>
>>> - /* virtio_transport_get_credit might return less than pkt_len credit */
>>> - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
>>> + /* virtio_transport_get_credit might return less than pkt_len credit */
>>> + pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>>>
>>> - /* Do not send zero length OP_RW pkt */
>>> - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>> - return pkt_len;
>>> + /* Do not send zero length OP_RW pkt */
>>> + if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>> + return pkt_len;
>>> + }
>>>
>>> skb = virtio_transport_alloc_skb(info, pkt_len,
>>> src_cid, src_port,
>>> dst_cid, dst_port,
>>> &err);
>>> if (!skb) {
>>> - virtio_transport_put_credit(vvs, pkt_len);
>>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>>> + virtio_transport_put_credit(vvs, pkt_len);
>>> return err;
>>> }
>>>
>>> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
>>> }
>>> EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
>>>
>>> +static ssize_t
>>> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
>>> + struct msghdr *msg, size_t len)
>>> +{
>>> + struct virtio_vsock_sock *vvs = vsk->trans;
>>> + struct sk_buff *skb;
>>> + size_t total = 0;
>>> + u32 free_space;
>>> + int err = -EFAULT;
>>> +
>>> + spin_lock_bh(&vvs->rx_lock);
>>> + if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
>>> + skb = __skb_dequeue(&vvs->rx_queue);
>>> +
>>> + total = len;
>>> + if (total > skb->len - vsock_metadata(skb)->off)
>>> + total = skb->len - vsock_metadata(skb)->off;
>>> + else if (total < skb->len - vsock_metadata(skb)->off)
>>> + msg->msg_flags |= MSG_TRUNC;
>>> +
>>> + /* sk_lock is held by caller so no one else can dequeue.
>>> + * Unlock rx_lock since memcpy_to_msg() may sleep.
>>> + */
>>> + spin_unlock_bh(&vvs->rx_lock);
>>> +
>>> + err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
>>> + if (err)
>>> + return err;
>>> +
>>> + spin_lock_bh(&vvs->rx_lock);
>>> +
>>> + virtio_transport_dec_rx_pkt(vvs, skb);
>>> + consume_skb(skb);
>>> + }
>>> +
>>> + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
>>> +
>>> + spin_unlock_bh(&vvs->rx_lock);
>>> +
>>> + if (total > 0 && msg->msg_name) {
>>> + /* Provide the address of the sender. */
>>> + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
>>> +
>>> + vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
>>> + le32_to_cpu(vsock_hdr(skb)->src_port));
>>> + msg->msg_namelen = sizeof(*vm_addr);
>>> + }
>>> + return total;
>>> +}
>>> +
>>> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
>>> +{
>>> + return virtio_transport_stream_has_data(vsk);
>>> +}
>>> +
>>> int
>>> virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
>>> struct msghdr *msg,
>>> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
>>> struct msghdr *msg,
>>> size_t len, int flags)
>>> {
>>> - return -EOPNOTSUPP;
>>> + struct sock *sk;
>>> + size_t err = 0;
>>> + long timeout;
>>> +
>>> + DEFINE_WAIT(wait);
>>> +
>>> + sk = &vsk->sk;
>>> + err = 0;
>>> +
>>> + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
>>> + return -EOPNOTSUPP;
>>> +
>>> + lock_sock(sk);
>>> +
>>> + if (!len)
>>> + goto out;
>>> +
>>> + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>>> +
>>> + while (1) {
>>> + s64 ready;
>>> +
>>> + prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>>> + ready = virtio_transport_dgram_has_data(vsk);
>>> +
>>> + if (ready == 0) {
>>> + if (timeout == 0) {
>>> + err = -EAGAIN;
>>> + finish_wait(sk_sleep(sk), &wait);
>>> + break;
>>> + }
>>> +
>>> + release_sock(sk);
>>> + timeout = schedule_timeout(timeout);
>>> + lock_sock(sk);
>>> +
>>> + if (signal_pending(current)) {
>>> + err = sock_intr_errno(timeout);
>>> + finish_wait(sk_sleep(sk), &wait);
>>> + break;
>>> + } else if (timeout == 0) {
>>> + err = -EAGAIN;
>>> + finish_wait(sk_sleep(sk), &wait);
>>> + break;
>>> + }
>>> + } else {
>>> + finish_wait(sk_sleep(sk), &wait);
>>> +
>>> + if (ready < 0) {
>>> + err = -ENOMEM;
>>> + goto out;
>>> + }
>>> +
>>> + err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
>>> + break;
>>> + }
>>> + }
>>> +out:
>>> + release_sock(sk);
>>> + return err;
>>> }
>>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
^^^
May be, this generic data waiting logic should be in af_vsock.c, as for stream/seqpacket?
In this way, another transport which supports SOCK_DGRAM could reuse it.
>>>
>>> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>>> int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>>> struct sockaddr_vm *addr)
>>> {
>>> - return -EOPNOTSUPP;
>>> + return vsock_bind_stream(vsk, addr);
>>> }
>>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>>>
>>> bool virtio_transport_dgram_allow(u32 cid, u32 port)
>>> {
>>> - return false;
>>> + return true;
>>> }
>>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
>>>
>>> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>>> struct msghdr *msg,
>>> size_t dgram_len)
>>> {
>>> - return -EOPNOTSUPP;
>>> + struct virtio_vsock_pkt_info info = {
>>> + .op = VIRTIO_VSOCK_OP_RW,
>>> + .msg = msg,
>>> + .pkt_len = dgram_len,
>>> + .vsk = vsk,
>>> + .remote_cid = remote_addr->svm_cid,
>>> + .remote_port = remote_addr->svm_port,
>>> + };
>>> +
>>> + return virtio_transport_send_pkt_info(vsk, &info);
>>> }
>>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>>>
>>> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
>>> struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>>> int err = 0;
>>>
>>> + if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
>>> + virtio_transport_recv_enqueue(vsk, skb);
>>> + sk->sk_data_ready(sk);
>>> + return err;
>>> + }
>>> +
>>> switch (le16_to_cpu(hdr->op)) {
>>> case VIRTIO_VSOCK_OP_RW:
>>> virtio_transport_recv_enqueue(vsk, skb);
>>> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>>> static bool virtio_transport_valid_type(u16 type)
>>> {
>>> return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
>>> - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
>>> + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
>>> + (type == VIRTIO_VSOCK_TYPE_DGRAM);
>>> }
>>>
>>> /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
>>> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>>> goto free_pkt;
>>> }
>>>
>>> + if (sk->sk_type == SOCK_DGRAM) {
>>> + virtio_transport_recv_connected(sk, skb);
>>> + goto out;
>>> + }
>>> +
>>> space_available = virtio_transport_space_update(sk, skb);
>>>
>>> /* Update CID in case it has changed after a transport reset event */
>>> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>>> break;
>>> }
>>>
>>> +out:
>>> release_sock(sk);
>>>
>>> /* Release refcnt obtained when we fetched this socket out of the
>>> --
>>> 2.35.1
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> Hey everybody,
>
> This series introduces datagrams, packet scheduling, and sk_buff usage
> to virtio vsock.
>
> The usage of struct sk_buff benefits users by a) preparing vsock to use
> other related systems that require sk_buff, such as sockmap and qdisc,
> b) supporting basic congestion control via sock_alloc_send_skb, and c)
> reducing copying when delivering packets to TAP.
>
> The socket layer no longer forces errors to be -ENOMEM, as typically
> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> messages are being sent with option MSG_DONTWAIT.
>
> The datagram work is based off previous patches by Jiang Wang[1].
>
> The introduction of datagrams creates a transport layer fairness issue
> where datagrams may freely starve streams of queue access. This happens
> because, unlike streams, datagrams lack the transactions necessary for
> calculating credits and throttling.
>
> Previous proposals introduce changes to the spec to add an additional
> virtqueue pair for datagrams[1]. Although this solution works, using
> Linux's qdisc for packet scheduling leverages already existing systems,
> avoids the need to change the virtio specification, and gives additional
> capabilities. The usage of SFQ or fq_codel, for example, may solve the
> transport layer starvation problem. It is easy to imagine other use
> cases as well. For example, services of varying importance may be
> assigned different priorities, and qdisc will apply appropriate
> priority-based scheduling. By default, the system default pfifo qdisc is
> used. The qdisc may be bypassed and legacy queuing is resumed by simply
> setting the virtio-vsock%d network device to state DOWN. This technique
> still allows vsock to work with zero-configuration.
The basic question to answer then is this: with a net device qdisc
etc in the picture, how is this different from virtio net then?
Why do you still want to use vsock?
> In summary, this series introduces these major changes to vsock:
>
> - virtio vsock supports datagrams
> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
> - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
> which applies the throttling threshold sk_sndbuf.
> - The vsock socket layer supports returning errors other than -ENOMEM.
> - This is used to return -EAGAIN when the sk_sndbuf threshold is
> reached.
> - virtio vsock uses a net_device, through which qdisc may be used.
> - qdisc allows scheduling policies to be applied to vsock flows.
> - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
> it may avoid datagrams from flooding out stream flows. The benefit
> to this is that additional virtqueues are not needed for datagrams.
> - The net_device and qdisc is bypassed by simply setting the
> net_device state to DOWN.
>
> [1]: https://lore.kernel.org/all/[email protected]/
>
> Bobby Eshleman (5):
> vsock: replace virtio_vsock_pkt with sk_buff
> vsock: return errors other than -ENOMEM to socket
> vsock: add netdev to vhost/virtio vsock
> virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
> virtio/vsock: add support for dgram
>
> Jiang Wang (1):
> vsock_test: add tests for vsock dgram
>
> drivers/vhost/vsock.c | 238 ++++----
> include/linux/virtio_vsock.h | 73 ++-
> include/net/af_vsock.h | 2 +
> include/uapi/linux/virtio_vsock.h | 2 +
> net/vmw_vsock/af_vsock.c | 30 +-
> net/vmw_vsock/hyperv_transport.c | 2 +-
> net/vmw_vsock/virtio_transport.c | 237 +++++---
> net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
> net/vmw_vsock/vmci_transport.c | 9 +-
> net/vmw_vsock/vsock_loopback.c | 51 +-
> tools/testing/vsock/util.c | 105 ++++
> tools/testing/vsock/util.h | 4 +
> tools/testing/vsock/vsock_test.c | 195 ++++++
> 13 files changed, 1176 insertions(+), 543 deletions(-)
>
> --
> 2.35.1
On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > On 16.08.2022 05:32, Bobby Eshleman wrote:
> >> CC'ing [email protected]
> >>
> >> On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> >>> This patch supports dgram in virtio and on the vhost side.
> > Hello,
> >
> > sorry, i don't understand, how this maintains message boundaries? Or it
> > is unnecessary for SOCK_DGRAM?
> >
> > Thanks
> >>>
> >>> Signed-off-by: Jiang Wang <[email protected]>
> >>> Signed-off-by: Bobby Eshleman <[email protected]>
> >>> ---
> >>> drivers/vhost/vsock.c | 2 +-
> >>> include/net/af_vsock.h | 2 +
> >>> include/uapi/linux/virtio_vsock.h | 1 +
> >>> net/vmw_vsock/af_vsock.c | 26 +++-
> >>> net/vmw_vsock/virtio_transport.c | 2 +-
> >>> net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
> >>> 6 files changed, 186 insertions(+), 20 deletions(-)
> >>>
> >>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >>> index a5d1bdb786fe..3dc72a5647ca 100644
> >>> --- a/drivers/vhost/vsock.c
> >>> +++ b/drivers/vhost/vsock.c
> >>> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> >>> int ret;
> >>>
> >>> ret = vsock_core_register(&vhost_transport.transport,
> >>> - VSOCK_TRANSPORT_F_H2G);
> >>> + VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
> >>> if (ret < 0)
> >>> return ret;
> >>>
> >>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >>> index 1c53c4c4d88f..37e55c81e4df 100644
> >>> --- a/include/net/af_vsock.h
> >>> +++ b/include/net/af_vsock.h
> >>> @@ -78,6 +78,8 @@ struct vsock_sock {
> >>> s64 vsock_stream_has_data(struct vsock_sock *vsk);
> >>> s64 vsock_stream_has_space(struct vsock_sock *vsk);
> >>> struct sock *vsock_create_connected(struct sock *parent);
> >>> +int vsock_bind_stream(struct vsock_sock *vsk,
> >>> + struct sockaddr_vm *addr);
> >>>
> >>> /**** TRANSPORT ****/
> >>>
> >>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> >>> index 857df3a3a70d..0975b9c88292 100644
> >>> --- a/include/uapi/linux/virtio_vsock.h
> >>> +++ b/include/uapi/linux/virtio_vsock.h
> >>> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> >>> enum virtio_vsock_type {
> >>> VIRTIO_VSOCK_TYPE_STREAM = 1,
> >>> VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> >>> + VIRTIO_VSOCK_TYPE_DGRAM = 3,
> >>> };
> >>>
> >>> enum virtio_vsock_op {
> >>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >>> index 1893f8aafa48..87e4ae1866d3 100644
> >>> --- a/net/vmw_vsock/af_vsock.c
> >>> +++ b/net/vmw_vsock/af_vsock.c
> >>> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> >>> return 0;
> >>> }
> >>>
> >>> +int vsock_bind_stream(struct vsock_sock *vsk,
> >>> + struct sockaddr_vm *addr)
> >>> +{
> >>> + int retval;
> >>> +
> >>> + spin_lock_bh(&vsock_table_lock);
> >>> + retval = __vsock_bind_connectible(vsk, addr);
> >>> + spin_unlock_bh(&vsock_table_lock);
> >>> +
> >>> + return retval;
> >>> +}
> >>> +EXPORT_SYMBOL(vsock_bind_stream);
> >>> +
> >>> static int __vsock_bind_dgram(struct vsock_sock *vsk,
> >>> struct sockaddr_vm *addr)
> >>> {
> >>> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
> >>> }
> >>>
> >>> if (features & VSOCK_TRANSPORT_F_DGRAM) {
> >>> - if (t_dgram) {
> >>> - err = -EBUSY;
> >>> - goto err_busy;
> >>> + /* TODO: always chose the G2H variant over others, support nesting later */
> >>> + if (features & VSOCK_TRANSPORT_F_G2H) {
> >>> + if (t_dgram)
> >>> + pr_warn("virtio_vsock: t_dgram already set\n");
> >>> + t_dgram = t;
> >>> + }
> >>> +
> >>> + if (!t_dgram) {
> >>> + t_dgram = t;
> >>> }
> >>> - t_dgram = t;
> >>> }
> >>>
> >>> if (features & VSOCK_TRANSPORT_F_LOCAL) {
> >>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >>> index 073314312683..d4526ca462d2 100644
> >>> --- a/net/vmw_vsock/virtio_transport.c
> >>> +++ b/net/vmw_vsock/virtio_transport.c
> >>> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> >>> return -ENOMEM;
> >>>
> >>> ret = vsock_core_register(&virtio_transport.transport,
> >>> - VSOCK_TRANSPORT_F_G2H);
> >>> + VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
> >>> if (ret)
> >>> goto out_wq;
> >>>
> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >>> index bdf16fff054f..aedb48728677 100644
> >>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> >>>
> >>> static u16 virtio_transport_get_type(struct sock *sk)
> >>> {
> >>> - if (sk->sk_type == SOCK_STREAM)
> >>> + if (sk->sk_type == SOCK_DGRAM)
> >>> + return VIRTIO_VSOCK_TYPE_DGRAM;
> >>> + else if (sk->sk_type == SOCK_STREAM)
> >>> return VIRTIO_VSOCK_TYPE_STREAM;
> >>> else
> >>> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >>> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>> vvs = vsk->trans;
> >>>
> >>> /* we can send less than pkt_len bytes */
> >>> - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> >>> - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>> + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> >>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >>> + pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>> + else
> >>> + return 0;
> >>> + }
> >>>
> >>> - /* virtio_transport_get_credit might return less than pkt_len credit */
> >>> - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> >>> + /* virtio_transport_get_credit might return less than pkt_len credit */
> >>> + pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >>>
> >>> - /* Do not send zero length OP_RW pkt */
> >>> - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>> - return pkt_len;
> >>> + /* Do not send zero length OP_RW pkt */
> >>> + if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>> + return pkt_len;
> >>> + }
> >>>
> >>> skb = virtio_transport_alloc_skb(info, pkt_len,
> >>> src_cid, src_port,
> >>> dst_cid, dst_port,
> >>> &err);
> >>> if (!skb) {
> >>> - virtio_transport_put_credit(vvs, pkt_len);
> >>> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >>> + virtio_transport_put_credit(vvs, pkt_len);
> >>> return err;
> >>> }
> >>>
> >>> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> >>> }
> >>> EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> >>>
> >>> +static ssize_t
> >>> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> >>> + struct msghdr *msg, size_t len)
> >>> +{
> >>> + struct virtio_vsock_sock *vvs = vsk->trans;
> >>> + struct sk_buff *skb;
> >>> + size_t total = 0;
> >>> + u32 free_space;
> >>> + int err = -EFAULT;
> >>> +
> >>> + spin_lock_bh(&vvs->rx_lock);
> >>> + if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> >>> + skb = __skb_dequeue(&vvs->rx_queue);
> >>> +
> >>> + total = len;
> >>> + if (total > skb->len - vsock_metadata(skb)->off)
> >>> + total = skb->len - vsock_metadata(skb)->off;
> >>> + else if (total < skb->len - vsock_metadata(skb)->off)
> >>> + msg->msg_flags |= MSG_TRUNC;
> >>> +
> >>> + /* sk_lock is held by caller so no one else can dequeue.
> >>> + * Unlock rx_lock since memcpy_to_msg() may sleep.
> >>> + */
> >>> + spin_unlock_bh(&vvs->rx_lock);
> >>> +
> >>> + err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
> >>> + if (err)
> >>> + return err;
> >>> +
> >>> + spin_lock_bh(&vvs->rx_lock);
> >>> +
> >>> + virtio_transport_dec_rx_pkt(vvs, skb);
> >>> + consume_skb(skb);
> >>> + }
> >>> +
> >>> + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> >>> +
> >>> + spin_unlock_bh(&vvs->rx_lock);
> >>> +
> >>> + if (total > 0 && msg->msg_name) {
> >>> + /* Provide the address of the sender. */
> >>> + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> >>> +
> >>> + vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
> >>> + le32_to_cpu(vsock_hdr(skb)->src_port));
> >>> + msg->msg_namelen = sizeof(*vm_addr);
> >>> + }
> >>> + return total;
> >>> +}
> >>> +
> >>> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
> >>> +{
> >>> + return virtio_transport_stream_has_data(vsk);
> >>> +}
> >>> +
> >>> int
> >>> virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> >>> struct msghdr *msg,
> >>> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
> >>> struct msghdr *msg,
> >>> size_t len, int flags)
> >>> {
> >>> - return -EOPNOTSUPP;
> >>> + struct sock *sk;
> >>> + size_t err = 0;
> >>> + long timeout;
> >>> +
> >>> + DEFINE_WAIT(wait);
> >>> +
> >>> + sk = &vsk->sk;
> >>> + err = 0;
> >>> +
> >>> + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
> >>> + return -EOPNOTSUPP;
> >>> +
> >>> + lock_sock(sk);
> >>> +
> >>> + if (!len)
> >>> + goto out;
> >>> +
> >>> + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> >>> +
> >>> + while (1) {
> >>> + s64 ready;
> >>> +
> >>> + prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> >>> + ready = virtio_transport_dgram_has_data(vsk);
> >>> +
> >>> + if (ready == 0) {
> >>> + if (timeout == 0) {
> >>> + err = -EAGAIN;
> >>> + finish_wait(sk_sleep(sk), &wait);
> >>> + break;
> >>> + }
> >>> +
> >>> + release_sock(sk);
> >>> + timeout = schedule_timeout(timeout);
> >>> + lock_sock(sk);
> >>> +
> >>> + if (signal_pending(current)) {
> >>> + err = sock_intr_errno(timeout);
> >>> + finish_wait(sk_sleep(sk), &wait);
> >>> + break;
> >>> + } else if (timeout == 0) {
> >>> + err = -EAGAIN;
> >>> + finish_wait(sk_sleep(sk), &wait);
> >>> + break;
> >>> + }
> >>> + } else {
> >>> + finish_wait(sk_sleep(sk), &wait);
> >>> +
> >>> + if (ready < 0) {
> >>> + err = -ENOMEM;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
> >>> + break;
> >>> + }
> >>> + }
> >>> +out:
> >>> + release_sock(sk);
> >>> + return err;
> >>> }
> >>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> ^^^
> May be, this generic data waiting logic should be in af_vsock.c, as for stream/seqpacket?
> In this way, another transport which supports SOCK_DGRAM could reuse it.
I think that is a great idea. I'll test that change for v2.
Thanks.
> >>>
> >>> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> >>> int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> >>> struct sockaddr_vm *addr)
> >>> {
> >>> - return -EOPNOTSUPP;
> >>> + return vsock_bind_stream(vsk, addr);
> >>> }
> >>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> >>>
> >>> bool virtio_transport_dgram_allow(u32 cid, u32 port)
> >>> {
> >>> - return false;
> >>> + return true;
> >>> }
> >>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> >>>
> >>> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> >>> struct msghdr *msg,
> >>> size_t dgram_len)
> >>> {
> >>> - return -EOPNOTSUPP;
> >>> + struct virtio_vsock_pkt_info info = {
> >>> + .op = VIRTIO_VSOCK_OP_RW,
> >>> + .msg = msg,
> >>> + .pkt_len = dgram_len,
> >>> + .vsk = vsk,
> >>> + .remote_cid = remote_addr->svm_cid,
> >>> + .remote_port = remote_addr->svm_port,
> >>> + };
> >>> +
> >>> + return virtio_transport_send_pkt_info(vsk, &info);
> >>> }
> >>> EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> >>>
> >>> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
> >>> struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> >>> int err = 0;
> >>>
> >>> + if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> >>> + virtio_transport_recv_enqueue(vsk, skb);
> >>> + sk->sk_data_ready(sk);
> >>> + return err;
> >>> + }
> >>> +
> >>> switch (le16_to_cpu(hdr->op)) {
> >>> case VIRTIO_VSOCK_OP_RW:
> >>> virtio_transport_recv_enqueue(vsk, skb);
> >>> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> >>> static bool virtio_transport_valid_type(u16 type)
> >>> {
> >>> return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> >>> - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> >>> + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> >>> + (type == VIRTIO_VSOCK_TYPE_DGRAM);
> >>> }
> >>>
> >>> /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
> >>> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >>> goto free_pkt;
> >>> }
> >>>
> >>> + if (sk->sk_type == SOCK_DGRAM) {
> >>> + virtio_transport_recv_connected(sk, skb);
> >>> + goto out;
> >>> + }
> >>> +
> >>> space_available = virtio_transport_space_update(sk, skb);
> >>>
> >>> /* Update CID in case it has changed after a transport reset event */
> >>> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >>> break;
> >>> }
> >>>
> >>> +out:
> >>> release_sock(sk);
> >>>
> >>> /* Release refcnt obtained when we fetched this socket out of the
> >>> --
> >>> 2.35.1
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
>
On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > The basic question to answer then is this: with a net device qdisc
> > etc in the picture, how is this different from virtio net then?
> > Why do you still want to use vsock?
> >
>
> When using virtio-net, users looking for inter-VM communication are
> required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> etc... and then finally when you have a network, you can open a socket
> on an IP address and port. This is the configuration that vsock avoids.
> For vsock, we just need a CID and a port, but no network configuration.
Surely when you mention DNS you are going overboard? vsock doesn't
remove the need for DNS as much as it does not support it.
--
MST
On Wed, Aug 17, 2022 at 02:54:33AM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> > Hey everybody,
> >
> > This series introduces datagrams, packet scheduling, and sk_buff usage
> > to virtio vsock.
> >
> > The usage of struct sk_buff benefits users by a) preparing vsock to use
> > other related systems that require sk_buff, such as sockmap and qdisc,
> > b) supporting basic congestion control via sock_alloc_send_skb, and c)
> > reducing copying when delivering packets to TAP.
> >
> > The socket layer no longer forces errors to be -ENOMEM, as typically
> > userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> > messages are being sent with option MSG_DONTWAIT.
> >
> > The datagram work is based off previous patches by Jiang Wang[1].
> >
> > The introduction of datagrams creates a transport layer fairness issue
> > where datagrams may freely starve streams of queue access. This happens
> > because, unlike streams, datagrams lack the transactions necessary for
> > calculating credits and throttling.
> >
> > Previous proposals introduce changes to the spec to add an additional
> > virtqueue pair for datagrams[1]. Although this solution works, using
> > Linux's qdisc for packet scheduling leverages already existing systems,
> > avoids the need to change the virtio specification, and gives additional
> > capabilities. The usage of SFQ or fq_codel, for example, may solve the
> > transport layer starvation problem. It is easy to imagine other use
> > cases as well. For example, services of varying importance may be
> > assigned different priorities, and qdisc will apply appropriate
> > priority-based scheduling. By default, the system default pfifo qdisc is
> > used. The qdisc may be bypassed and legacy queuing is resumed by simply
> > setting the virtio-vsock%d network device to state DOWN. This technique
> > still allows vsock to work with zero-configuration.
>
> The basic question to answer then is this: with a net device qdisc
> etc in the picture, how is this different from virtio net then?
> Why do you still want to use vsock?
>
When using virtio-net, users looking for inter-VM communication are
required to setup bridges, TAPs, allocate IP addresses or setup DNS,
etc... and then finally when you have a network, you can open a socket
on an IP address and port. This is the configuration that vsock avoids.
For vsock, we just need a CID and a port, but no network configuration.
This benefit still exists after introducing a netdev to vsock. The major
added benefit is that when you have many different vsock flows in
parallel and you are observing issues like starvation and tail latency
that are caused by pure FIFO queuing, now there is a mechanism to fix
those issues. You might recall such an issue discussed here[1].
[1]: https://gitlab.com/vsock/vsock/-/issues/1
> > In summary, this series introduces these major changes to vsock:
> >
> > - virtio vsock supports datagrams
> > - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
> > - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
> > which applies the throttling threshold sk_sndbuf.
> > - The vsock socket layer supports returning errors other than -ENOMEM.
> > - This is used to return -EAGAIN when the sk_sndbuf threshold is
> > reached.
> > - virtio vsock uses a net_device, through which qdisc may be used.
> > - qdisc allows scheduling policies to be applied to vsock flows.
> > - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
> > it may avoid datagrams from flooding out stream flows. The benefit
> > to this is that additional virtqueues are not needed for datagrams.
> > - The net_device and qdisc is bypassed by simply setting the
> > net_device state to DOWN.
> >
> > [1]: https://lore.kernel.org/all/[email protected]/
> >
> > Bobby Eshleman (5):
> > vsock: replace virtio_vsock_pkt with sk_buff
> > vsock: return errors other than -ENOMEM to socket
> > vsock: add netdev to vhost/virtio vsock
> > virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
> > virtio/vsock: add support for dgram
> >
> > Jiang Wang (1):
> > vsock_test: add tests for vsock dgram
> >
> > drivers/vhost/vsock.c | 238 ++++----
> > include/linux/virtio_vsock.h | 73 ++-
> > include/net/af_vsock.h | 2 +
> > include/uapi/linux/virtio_vsock.h | 2 +
> > net/vmw_vsock/af_vsock.c | 30 +-
> > net/vmw_vsock/hyperv_transport.c | 2 +-
> > net/vmw_vsock/virtio_transport.c | 237 +++++---
> > net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
> > net/vmw_vsock/vmci_transport.c | 9 +-
> > net/vmw_vsock/vsock_loopback.c | 51 +-
> > tools/testing/vsock/util.c | 105 ++++
> > tools/testing/vsock/util.h | 4 +
> > tools/testing/vsock/vsock_test.c | 195 ++++++
> > 13 files changed, 1176 insertions(+), 543 deletions(-)
> >
> > --
> > 2.35.1
>
On Wed, Aug 17, 2022 at 05:01:00AM +0000, Arseniy Krasnov wrote:
> On 16.08.2022 05:32, Bobby Eshleman wrote:
> > CC'ing [email protected]
> >
> > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> >> This patch supports dgram in virtio and on the vhost side.
> Hello,
>
> sorry, i don't understand, how this maintains message boundaries? Or it
> is unnecessary for SOCK_DGRAM?
>
> Thanks
If I understand your question, the length is included in the header, so
receivers always know that header start + header length + payload length
marks the message boundary.
> >>
> >> Signed-off-by: Jiang Wang <[email protected]>
> >> Signed-off-by: Bobby Eshleman <[email protected]>
> >> ---
> >> drivers/vhost/vsock.c | 2 +-
> >> include/net/af_vsock.h | 2 +
> >> include/uapi/linux/virtio_vsock.h | 1 +
> >> net/vmw_vsock/af_vsock.c | 26 +++-
> >> net/vmw_vsock/virtio_transport.c | 2 +-
> >> net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
> >> 6 files changed, 186 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >> index a5d1bdb786fe..3dc72a5647ca 100644
> >> --- a/drivers/vhost/vsock.c
> >> +++ b/drivers/vhost/vsock.c
> >> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> >> int ret;
> >>
> >> ret = vsock_core_register(&vhost_transport.transport,
> >> - VSOCK_TRANSPORT_F_H2G);
> >> + VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
> >> if (ret < 0)
> >> return ret;
> >>
> >> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >> index 1c53c4c4d88f..37e55c81e4df 100644
> >> --- a/include/net/af_vsock.h
> >> +++ b/include/net/af_vsock.h
> >> @@ -78,6 +78,8 @@ struct vsock_sock {
> >> s64 vsock_stream_has_data(struct vsock_sock *vsk);
> >> s64 vsock_stream_has_space(struct vsock_sock *vsk);
> >> struct sock *vsock_create_connected(struct sock *parent);
> >> +int vsock_bind_stream(struct vsock_sock *vsk,
> >> + struct sockaddr_vm *addr);
> >>
> >> /**** TRANSPORT ****/
> >>
> >> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> >> index 857df3a3a70d..0975b9c88292 100644
> >> --- a/include/uapi/linux/virtio_vsock.h
> >> +++ b/include/uapi/linux/virtio_vsock.h
> >> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> >> enum virtio_vsock_type {
> >> VIRTIO_VSOCK_TYPE_STREAM = 1,
> >> VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> >> + VIRTIO_VSOCK_TYPE_DGRAM = 3,
> >> };
> >>
> >> enum virtio_vsock_op {
> >> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >> index 1893f8aafa48..87e4ae1866d3 100644
> >> --- a/net/vmw_vsock/af_vsock.c
> >> +++ b/net/vmw_vsock/af_vsock.c
> >> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> >> return 0;
> >> }
> >>
> >> +int vsock_bind_stream(struct vsock_sock *vsk,
> >> + struct sockaddr_vm *addr)
> >> +{
> >> + int retval;
> >> +
> >> + spin_lock_bh(&vsock_table_lock);
> >> + retval = __vsock_bind_connectible(vsk, addr);
> >> + spin_unlock_bh(&vsock_table_lock);
> >> +
> >> + return retval;
> >> +}
> >> +EXPORT_SYMBOL(vsock_bind_stream);
> >> +
> >> static int __vsock_bind_dgram(struct vsock_sock *vsk,
> >> struct sockaddr_vm *addr)
> >> {
> >> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
> >> }
> >>
> >> if (features & VSOCK_TRANSPORT_F_DGRAM) {
> >> - if (t_dgram) {
> >> - err = -EBUSY;
> >> - goto err_busy;
> >> + /* TODO: always chose the G2H variant over others, support nesting later */
> >> + if (features & VSOCK_TRANSPORT_F_G2H) {
> >> + if (t_dgram)
> >> + pr_warn("virtio_vsock: t_dgram already set\n");
> >> + t_dgram = t;
> >> + }
> >> +
> >> + if (!t_dgram) {
> >> + t_dgram = t;
> >> }
> >> - t_dgram = t;
> >> }
> >>
> >> if (features & VSOCK_TRANSPORT_F_LOCAL) {
> >> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >> index 073314312683..d4526ca462d2 100644
> >> --- a/net/vmw_vsock/virtio_transport.c
> >> +++ b/net/vmw_vsock/virtio_transport.c
> >> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> >> return -ENOMEM;
> >>
> >> ret = vsock_core_register(&virtio_transport.transport,
> >> - VSOCK_TRANSPORT_F_G2H);
> >> + VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
> >> if (ret)
> >> goto out_wq;
> >>
> >> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >> index bdf16fff054f..aedb48728677 100644
> >> --- a/net/vmw_vsock/virtio_transport_common.c
> >> +++ b/net/vmw_vsock/virtio_transport_common.c
> >> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> >>
> >> static u16 virtio_transport_get_type(struct sock *sk)
> >> {
> >> - if (sk->sk_type == SOCK_STREAM)
> >> + if (sk->sk_type == SOCK_DGRAM)
> >> + return VIRTIO_VSOCK_TYPE_DGRAM;
> >> + else if (sk->sk_type == SOCK_STREAM)
> >> return VIRTIO_VSOCK_TYPE_STREAM;
> >> else
> >> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >> vvs = vsk->trans;
> >>
> >> /* we can send less than pkt_len bytes */
> >> - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> >> - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >> + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> >> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >> + pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >> + else
> >> + return 0;
> >> + }
> >>
> >> - /* virtio_transport_get_credit might return less than pkt_len credit */
> >> - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> >> + /* virtio_transport_get_credit might return less than pkt_len credit */
> >> + pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >>
> >> - /* Do not send zero length OP_RW pkt */
> >> - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >> - return pkt_len;
> >> + /* Do not send zero length OP_RW pkt */
> >> + if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >> + return pkt_len;
> >> + }
> >>
> >> skb = virtio_transport_alloc_skb(info, pkt_len,
> >> src_cid, src_port,
> >> dst_cid, dst_port,
> >> &err);
> >> if (!skb) {
> >> - virtio_transport_put_credit(vvs, pkt_len);
> >> + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >> + virtio_transport_put_credit(vvs, pkt_len);
> >> return err;
> >> }
> >>
> >> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> >> }
> >> EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> >>
> >> +static ssize_t
> >> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> >> + struct msghdr *msg, size_t len)
> >> +{
> >> + struct virtio_vsock_sock *vvs = vsk->trans;
> >> + struct sk_buff *skb;
> >> + size_t total = 0;
> >> + u32 free_space;
> >> + int err = -EFAULT;
> >> +
> >> + spin_lock_bh(&vvs->rx_lock);
> >> + if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> >> + skb = __skb_dequeue(&vvs->rx_queue);
> >> +
> >> + total = len;
> >> + if (total > skb->len - vsock_metadata(skb)->off)
> >> + total = skb->len - vsock_metadata(skb)->off;
> >> + else if (total < skb->len - vsock_metadata(skb)->off)
> >> + msg->msg_flags |= MSG_TRUNC;
> >> +
> >> + /* sk_lock is held by caller so no one else can dequeue.
> >> + * Unlock rx_lock since memcpy_to_msg() may sleep.
> >> + */
> >> + spin_unlock_bh(&vvs->rx_lock);
> >> +
> >> + err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
> >> + if (err)
> >> + return err;
> >> +
> >> + spin_lock_bh(&vvs->rx_lock);
> >> +
> >> + virtio_transport_dec_rx_pkt(vvs, skb);
> >> + consume_skb(skb);
> >> + }
> >> +
> >> + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> >> +
> >> + spin_unlock_bh(&vvs->rx_lock);
> >> +
> >> + if (total > 0 && msg->msg_name) {
> >> + /* Provide the address of the sender. */
> >> + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> >> +
> >> + vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
> >> + le32_to_cpu(vsock_hdr(skb)->src_port));
> >> + msg->msg_namelen = sizeof(*vm_addr);
> >> + }
> >> + return total;
> >> +}
> >> +
> >> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
> >> +{
> >> + return virtio_transport_stream_has_data(vsk);
> >> +}
> >> +
> >> int
> >> virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> >> struct msghdr *msg,
> >> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
> >> struct msghdr *msg,
> >> size_t len, int flags)
> >> {
> >> - return -EOPNOTSUPP;
> >> + struct sock *sk;
> >> + size_t err = 0;
> >> + long timeout;
> >> +
> >> + DEFINE_WAIT(wait);
> >> +
> >> + sk = &vsk->sk;
> >> + err = 0;
> >> +
> >> + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
> >> + return -EOPNOTSUPP;
> >> +
> >> + lock_sock(sk);
> >> +
> >> + if (!len)
> >> + goto out;
> >> +
> >> + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> >> +
> >> + while (1) {
> >> + s64 ready;
> >> +
> >> + prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> >> + ready = virtio_transport_dgram_has_data(vsk);
> >> +
> >> + if (ready == 0) {
> >> + if (timeout == 0) {
> >> + err = -EAGAIN;
> >> + finish_wait(sk_sleep(sk), &wait);
> >> + break;
> >> + }
> >> +
> >> + release_sock(sk);
> >> + timeout = schedule_timeout(timeout);
> >> + lock_sock(sk);
> >> +
> >> + if (signal_pending(current)) {
> >> + err = sock_intr_errno(timeout);
> >> + finish_wait(sk_sleep(sk), &wait);
> >> + break;
> >> + } else if (timeout == 0) {
> >> + err = -EAGAIN;
> >> + finish_wait(sk_sleep(sk), &wait);
> >> + break;
> >> + }
> >> + } else {
> >> + finish_wait(sk_sleep(sk), &wait);
> >> +
> >> + if (ready < 0) {
> >> + err = -ENOMEM;
> >> + goto out;
> >> + }
> >> +
> >> + err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
> >> + break;
> >> + }
> >> + }
> >> +out:
> >> + release_sock(sk);
> >> + return err;
> >> }
> >> EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> >>
> >> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> >> int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> >> struct sockaddr_vm *addr)
> >> {
> >> - return -EOPNOTSUPP;
> >> + return vsock_bind_stream(vsk, addr);
> >> }
> >> EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> >>
> >> bool virtio_transport_dgram_allow(u32 cid, u32 port)
> >> {
> >> - return false;
> >> + return true;
> >> }
> >> EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> >>
> >> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> >> struct msghdr *msg,
> >> size_t dgram_len)
> >> {
> >> - return -EOPNOTSUPP;
> >> + struct virtio_vsock_pkt_info info = {
> >> + .op = VIRTIO_VSOCK_OP_RW,
> >> + .msg = msg,
> >> + .pkt_len = dgram_len,
> >> + .vsk = vsk,
> >> + .remote_cid = remote_addr->svm_cid,
> >> + .remote_port = remote_addr->svm_port,
> >> + };
> >> +
> >> + return virtio_transport_send_pkt_info(vsk, &info);
> >> }
> >> EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> >>
> >> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
> >> struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> >> int err = 0;
> >>
> >> + if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> >> + virtio_transport_recv_enqueue(vsk, skb);
> >> + sk->sk_data_ready(sk);
> >> + return err;
> >> + }
> >> +
> >> switch (le16_to_cpu(hdr->op)) {
> >> case VIRTIO_VSOCK_OP_RW:
> >> virtio_transport_recv_enqueue(vsk, skb);
> >> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> >> static bool virtio_transport_valid_type(u16 type)
> >> {
> >> return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> >> - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> >> + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> >> + (type == VIRTIO_VSOCK_TYPE_DGRAM);
> >> }
> >>
> >> /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
> >> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >> goto free_pkt;
> >> }
> >>
> >> + if (sk->sk_type == SOCK_DGRAM) {
> >> + virtio_transport_recv_connected(sk, skb);
> >> + goto out;
> >> + }
> >> +
> >> space_available = virtio_transport_space_update(sk, skb);
> >>
> >> /* Update CID in case it has changed after a transport reset event */
> >> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >> break;
> >> }
> >>
> >> +out:
> >> release_sock(sk);
> >>
> >> /* Release refcnt obtained when we fetched this socket out of the
> >> --
> >> 2.35.1
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
On Tue, Aug 16, 2022 at 10:50:55AM +0000, Bobby Eshleman wrote:
> > > > Eh, I was hoping it was a side channel of an existing virtio_net
> > > > which is not the case. Given the zero-config requirement IDK if
> > > > we'll be able to fit this into netdev semantics :(
> > >
> > > It's certainly possible that it may not fit :/ I feel that it partially
> > > depends on what we mean by zero-config. Is it "no config required to
> > > have a working socket" or is it "no config required, but also no
> > > tuning/policy/etc... supported"?
> >
> > The value of tuning vs confusion of a strange netdev floating around
> > in the system is hard to estimate upfront.
>
> I think "a strange netdev floating around" is a total
> mischaracterization... vsock is a networking device and it supports
> vsock networks. Sure, it is a virtual device and the routing is done in
> host software, but the same is true for virtio-net and VM-to-VM vlan.
>
> This patch actually uses netdev for its intended purpose: to support and
> manage the transmission of packets via a network device to a network.
>
> Furthermore, it actually prepares vsock to eliminate a "strange" use of
> a netdev. The netdev in vsockmon isn't even used to transmit
> packets, it's "floating around" for no other reason than it is needed to
> support packet capture, which vsock couldn't support because it didn't
> have a netdev.
>
> Something smells when we are required to build workaround kernel modules
> that use netdev for ciphoning packets off to userspace, when we could
> instead be using netdev for its intended purpose and get the same and
> more benefit.
So what happens when userspace inevitably attempts to bind a raw
packet socket to this device? Assign it an IP? Set up some firewall
rules?
These things all need to be addressed before merging since they affect UAPI.
> >
> > The nice thing about using a built-in fq with no user visible knobs is
> > that there's no extra uAPI. We can always rip it out and replace later.
> > And it shouldn't be controversial, making the path to upstream smoother.
>
> The issue is that after pulling in fq for one kind of flow management,
> then as users observe other flow issues, we will need to re-implement
> pfifo, and then TBF, and then we need to build an interface to let users
> select one, and to choose queue sizes... and then after awhile we've
> needlessly re-implemented huge chunks of the tc system.
>
> I don't see any good reason to restrict vsock users to using suboptimal
> and rigid queuing.
>
> Thanks.
On Tue, Aug 16, 2022 at 06:15:28PM -0700, Jakub Kicinski wrote:
> On Tue, 16 Aug 2022 08:29:04 +0000 Bobby Eshleman wrote:
> > > We've been burnt in the past by people doing the "let me just pick
> > > these useful pieces out of netdev" thing. Makes life hard both for
> > > maintainers and users trying to make sense of the interfaces.
> > >
> > > What comes to mind if you're just after queuing is that we already
> > > bastardized the CoDel implementation (include/net/codel_impl.h).
> > > If CoDel is good enough for you maybe that's the easiest way?
> > > Although I suspect that you're after fairness not early drops.
> > > Wireless folks use CoDel as a second layer queuing. (CC: Toke)
> >
> > That is certainly interesting to me. Sitting next to "codel_impl.h" is
> > "include/net/fq_impl.h", and it looks like it may solve the datagram
> > flooding issue. The downside to this approach is the baking of a
> > specific policy into vsock... which I don't exactly love either.
> >
> > I'm not seeing too many other of these qdisc bastardizations in
> > include/net, are there any others that you are aware of?
>
> Just what wireless uses (so codel and fq as you found out), nothing
> else comes to mind.
>
> > > Eh, I was hoping it was a side channel of an existing virtio_net
> > > which is not the case. Given the zero-config requirement IDK if
> > > we'll be able to fit this into netdev semantics :(
> >
> > It's certainly possible that it may not fit :/ I feel that it partially
> > depends on what we mean by zero-config. Is it "no config required to
> > have a working socket" or is it "no config required, but also no
> > tuning/policy/etc... supported"?
>
> The value of tuning vs confusion of a strange netdev floating around
> in the system is hard to estimate upfront.
I think "a strange netdev floating around" is a total
mischaracterization... vsock is a networking device and it supports
vsock networks. Sure, it is a virtual device and the routing is done in
host software, but the same is true for virtio-net and VM-to-VM vlan.
This patch actually uses netdev for its intended purpose: to support and
manage the transmission of packets via a network device to a network.
Furthermore, it actually prepares vsock to eliminate a "strange" use of
a netdev. The netdev in vsockmon isn't even used to transmit
packets, it's "floating around" for no other reason than it is needed to
support packet capture, which vsock couldn't support because it didn't
have a netdev.
Something smells when we are required to build workaround kernel modules
that use netdev for ciphoning packets off to userspace, when we could
instead be using netdev for its intended purpose and get the same and
more benefit.
>
> The nice thing about using a built-in fq with no user visible knobs is
> that there's no extra uAPI. We can always rip it out and replace later.
> And it shouldn't be controversial, making the path to upstream smoother.
The issue is that after pulling in fq for one kind of flow management,
then as users observe other flow issues, we will need to re-implement
pfifo, and then TBF, and then we need to build an interface to let users
select one, and to choose queue sizes... and then after awhile we've
needlessly re-implemented huge chunks of the tc system.
I don't see any good reason to restrict vsock users to using suboptimal
and rigid queuing.
Thanks.
On Wed, Aug 17, 2022 at 01:02:52PM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > > The basic question to answer then is this: with a net device qdisc
> > > etc in the picture, how is this different from virtio net then?
> > > Why do you still want to use vsock?
> > >
> >
> > When using virtio-net, users looking for inter-VM communication are
> > required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> > etc... and then finally when you have a network, you can open a socket
> > on an IP address and port. This is the configuration that vsock avoids.
> > For vsock, we just need a CID and a port, but no network configuration.
>
> Surely when you mention DNS you are going overboard? vsock doesn't
> remove the need for DNS as much as it does not support it.
>
Oops, s/DNS/dhcp.
On Tue, Aug 16, 2022 at 11:08:26AM +0000, Bobby Eshleman wrote:
> On Wed, Aug 17, 2022 at 01:02:52PM -0400, Michael S. Tsirkin wrote:
> > On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > > > The basic question to answer then is this: with a net device qdisc
> > > > etc in the picture, how is this different from virtio net then?
> > > > Why do you still want to use vsock?
> > > >
> > >
> > > When using virtio-net, users looking for inter-VM communication are
> > > required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> > > etc... and then finally when you have a network, you can open a socket
> > > on an IP address and port. This is the configuration that vsock avoids.
> > > For vsock, we just need a CID and a port, but no network configuration.
> >
> > Surely when you mention DNS you are going overboard? vsock doesn't
> > remove the need for DNS as much as it does not support it.
> >
>
> Oops, s/DNS/dhcp.
That too.
--
MST
On Wed, Aug 17, 2022 at 01:53:32PM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 16, 2022 at 11:08:26AM +0000, Bobby Eshleman wrote:
> > On Wed, Aug 17, 2022 at 01:02:52PM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > > > > The basic question to answer then is this: with a net device qdisc
> > > > > etc in the picture, how is this different from virtio net then?
> > > > > Why do you still want to use vsock?
> > > > >
> > > >
> > > > When using virtio-net, users looking for inter-VM communication are
> > > > required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> > > > etc... and then finally when you have a network, you can open a socket
> > > > on an IP address and port. This is the configuration that vsock avoids.
> > > > For vsock, we just need a CID and a port, but no network configuration.
> > >
> > > Surely when you mention DNS you are going overboard? vsock doesn't
> > > remove the need for DNS as much as it does not support it.
> > >
> >
> > Oops, s/DNS/dhcp.
>
> That too.
>
Sure, setting up dhcp would be overboard for just inter-VM comms.
It is fair to mention that vsock CIDs also need to be managed /
allocated somehow.
在 2022/8/17 14:54, Michael S. Tsirkin 写道:
> On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>> Hey everybody,
>>
>> This series introduces datagrams, packet scheduling, and sk_buff usage
>> to virtio vsock.
>>
>> The usage of struct sk_buff benefits users by a) preparing vsock to use
>> other related systems that require sk_buff, such as sockmap and qdisc,
>> b) supporting basic congestion control via sock_alloc_send_skb, and c)
>> reducing copying when delivering packets to TAP.
>>
>> The socket layer no longer forces errors to be -ENOMEM, as typically
>> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
>> messages are being sent with option MSG_DONTWAIT.
>>
>> The datagram work is based off previous patches by Jiang Wang[1].
>>
>> The introduction of datagrams creates a transport layer fairness issue
>> where datagrams may freely starve streams of queue access. This happens
>> because, unlike streams, datagrams lack the transactions necessary for
>> calculating credits and throttling.
>>
>> Previous proposals introduce changes to the spec to add an additional
>> virtqueue pair for datagrams[1]. Although this solution works, using
>> Linux's qdisc for packet scheduling leverages already existing systems,
>> avoids the need to change the virtio specification, and gives additional
>> capabilities. The usage of SFQ or fq_codel, for example, may solve the
>> transport layer starvation problem. It is easy to imagine other use
>> cases as well. For example, services of varying importance may be
>> assigned different priorities, and qdisc will apply appropriate
>> priority-based scheduling. By default, the system default pfifo qdisc is
>> used. The qdisc may be bypassed and legacy queuing is resumed by simply
>> setting the virtio-vsock%d network device to state DOWN. This technique
>> still allows vsock to work with zero-configuration.
> The basic question to answer then is this: with a net device qdisc
> etc in the picture, how is this different from virtio net then?
> Why do you still want to use vsock?
Or maybe it's time to revisit an old idea[1] to unify at least the
driver part (e.g using virtio-net driver for vsock then we can all
features that vsock is lacking now)?
Thanks
[1]
https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039783.html
>
>> In summary, this series introduces these major changes to vsock:
>>
>> - virtio vsock supports datagrams
>> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
>> - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
>> which applies the throttling threshold sk_sndbuf.
>> - The vsock socket layer supports returning errors other than -ENOMEM.
>> - This is used to return -EAGAIN when the sk_sndbuf threshold is
>> reached.
>> - virtio vsock uses a net_device, through which qdisc may be used.
>> - qdisc allows scheduling policies to be applied to vsock flows.
>> - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
>> it may avoid datagrams from flooding out stream flows. The benefit
>> to this is that additional virtqueues are not needed for datagrams.
>> - The net_device and qdisc is bypassed by simply setting the
>> net_device state to DOWN.
>>
>> [1]: https://lore.kernel.org/all/[email protected]/
>>
>> Bobby Eshleman (5):
>> vsock: replace virtio_vsock_pkt with sk_buff
>> vsock: return errors other than -ENOMEM to socket
>> vsock: add netdev to vhost/virtio vsock
>> virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
>> virtio/vsock: add support for dgram
>>
>> Jiang Wang (1):
>> vsock_test: add tests for vsock dgram
>>
>> drivers/vhost/vsock.c | 238 ++++----
>> include/linux/virtio_vsock.h | 73 ++-
>> include/net/af_vsock.h | 2 +
>> include/uapi/linux/virtio_vsock.h | 2 +
>> net/vmw_vsock/af_vsock.c | 30 +-
>> net/vmw_vsock/hyperv_transport.c | 2 +-
>> net/vmw_vsock/virtio_transport.c | 237 +++++---
>> net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
>> net/vmw_vsock/vmci_transport.c | 9 +-
>> net/vmw_vsock/vsock_loopback.c | 51 +-
>> tools/testing/vsock/util.c | 105 ++++
>> tools/testing/vsock/util.h | 4 +
>> tools/testing/vsock/vsock_test.c | 195 ++++++
>> 13 files changed, 1176 insertions(+), 543 deletions(-)
>>
>> --
>> 2.35.1
在 2022/8/18 01:20, Michael S. Tsirkin 写道:
> On Tue, Aug 16, 2022 at 10:50:55AM +0000, Bobby Eshleman wrote:
>>>>> Eh, I was hoping it was a side channel of an existing virtio_net
>>>>> which is not the case. Given the zero-config requirement IDK if
>>>>> we'll be able to fit this into netdev semantics :(
>>>> It's certainly possible that it may not fit :/ I feel that it partially
>>>> depends on what we mean by zero-config. Is it "no config required to
>>>> have a working socket" or is it "no config required, but also no
>>>> tuning/policy/etc... supported"?
>>> The value of tuning vs confusion of a strange netdev floating around
>>> in the system is hard to estimate upfront.
>> I think "a strange netdev floating around" is a total
>> mischaracterization... vsock is a networking device and it supports
>> vsock networks. Sure, it is a virtual device and the routing is done in
>> host software, but the same is true for virtio-net and VM-to-VM vlan.
>>
>> This patch actually uses netdev for its intended purpose: to support and
>> manage the transmission of packets via a network device to a network.
>>
>> Furthermore, it actually prepares vsock to eliminate a "strange" use of
>> a netdev. The netdev in vsockmon isn't even used to transmit
>> packets, it's "floating around" for no other reason than it is needed to
>> support packet capture, which vsock couldn't support because it didn't
>> have a netdev.
>>
>> Something smells when we are required to build workaround kernel modules
>> that use netdev for ciphoning packets off to userspace, when we could
>> instead be using netdev for its intended purpose and get the same and
>> more benefit.
> So what happens when userspace inevitably attempts to bind a raw
> packet socket to this device? Assign it an IP? Set up some firewall
> rules?
>
> These things all need to be addressed before merging since they affect UAPI.
It's possible if
1) extend virtio-net to have vsock queues
2) present vsock device on top of virtio-net via e.g auxiliary bus
Then raw socket still work at ethernet level while vsock works too.
The value is to share codes between the two type of devices (queues).
Thanks
>
>>> The nice thing about using a built-in fq with no user visible knobs is
>>> that there's no extra uAPI. We can always rip it out and replace later.
>>> And it shouldn't be controversial, making the path to upstream smoother.
>> The issue is that after pulling in fq for one kind of flow management,
>> then as users observe other flow issues, we will need to re-implement
>> pfifo, and then TBF, and then we need to build an interface to let users
>> select one, and to choose queue sizes... and then after awhile we've
>> needlessly re-implemented huge chunks of the tc system.
>>
>> I don't see any good reason to restrict vsock users to using suboptimal
>> and rigid queuing.
>>
>> Thanks.
On Tue, 2022-08-16 at 09:57 +0000, Bobby Eshleman wrote:
> On Wed, Aug 17, 2022 at 05:01:00AM +0000, Arseniy Krasnov wrote:
> > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > CC'ing [email protected]
> > >
> > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> > > > This patch supports dgram in virtio and on the vhost side.
> > Hello,
> >
> > sorry, i don't understand, how this maintains message boundaries?
> > Or it
> > is unnecessary for SOCK_DGRAM?
> >
> > Thanks
>
> If I understand your question, the length is included in the header,
> so
> receivers always know that header start + header length + payload
> length
> marks the message boundary.
I mean, consider the following case: host sends 5kb packet to guest.
Guest uses 4kb virtio rx buffers, so in drivers/vhost/vsock.c this 5kb
packet(e.g. its payload) will be placed to 2 virtio rx buffers - 4kb
to first buffer and rest 1kb to second buffer. Is it implemented, that
receiver gets whole 5kb piece of data during single 'read()/recv()'
system call?
Thanks
>
> > > > Signed-off-by: Jiang Wang <[email protected]>
> > > > Signed-off-by: Bobby Eshleman <[email protected]>
> > > > ---
> > > > drivers/vhost/vsock.c | 2 +-
> > > > include/net/af_vsock.h | 2 +
> > > > include/uapi/linux/virtio_vsock.h | 1 +
> > > > net/vmw_vsock/af_vsock.c | 26 +++-
> > > > net/vmw_vsock/virtio_transport.c | 2 +-
> > > > net/vmw_vsock/virtio_transport_common.c | 173
> > > > ++++++++++++++++++++++--
> > > > 6 files changed, 186 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > --- a/drivers/vhost/vsock.c
> > > > +++ b/drivers/vhost/vsock.c
> > > > @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> > > > int ret;
> > > >
> > > > ret = vsock_core_register(&vhost_transport.transport,
> > > > - VSOCK_TRANSPORT_F_H2G);
> > > > + VSOCK_TRANSPORT_F_H2G |
> > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > if (ret < 0)
> > > > return ret;
> > > >
> > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > --- a/include/net/af_vsock.h
> > > > +++ b/include/net/af_vsock.h
> > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > > s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > > s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > > struct sock *vsock_create_connected(struct sock *parent);
> > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > + struct sockaddr_vm *addr);
> > > >
> > > > /**** TRANSPORT ****/
> > > >
> > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > b/include/uapi/linux/virtio_vsock.h
> > > > index 857df3a3a70d..0975b9c88292 100644
> > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > > enum virtio_vsock_type {
> > > > VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > > VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > + VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > > };
> > > >
> > > > enum virtio_vsock_op {
> > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > b/net/vmw_vsock/af_vsock.c
> > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > --- a/net/vmw_vsock/af_vsock.c
> > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct
> > > > vsock_sock *vsk,
> > > > return 0;
> > > > }
> > > >
> > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > + struct sockaddr_vm *addr)
> > > > +{
> > > > + int retval;
> > > > +
> > > > + spin_lock_bh(&vsock_table_lock);
> > > > + retval = __vsock_bind_connectible(vsk, addr);
> > > > + spin_unlock_bh(&vsock_table_lock);
> > > > +
> > > > + return retval;
> > > > +}
> > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > +
> > > > static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > > struct sockaddr_vm *addr)
> > > > {
> > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct
> > > > vsock_transport *t, int features)
> > > > }
> > > >
> > > > if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > - if (t_dgram) {
> > > > - err = -EBUSY;
> > > > - goto err_busy;
> > > > + /* TODO: always chose the G2H variant over
> > > > others, support nesting later */
> > > > + if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > + if (t_dgram)
> > > > + pr_warn("virtio_vsock: t_dgram
> > > > already set\n");
> > > > + t_dgram = t;
> > > > + }
> > > > +
> > > > + if (!t_dgram) {
> > > > + t_dgram = t;
> > > > }
> > > > - t_dgram = t;
> > > > }
> > > >
> > > > if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > b/net/vmw_vsock/virtio_transport.c
> > > > index 073314312683..d4526ca462d2 100644
> > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> > > > return -ENOMEM;
> > > >
> > > > ret = vsock_core_register(&virtio_transport.transport,
> > > > - VSOCK_TRANSPORT_F_G2H);
> > > > + VSOCK_TRANSPORT_F_G2H |
> > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > if (ret)
> > > > goto out_wq;
> > > >
> > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > index bdf16fff054f..aedb48728677 100644
> > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > @@ -229,7 +229,9 @@
> > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > >
> > > > static u16 virtio_transport_get_type(struct sock *sk)
> > > > {
> > > > - if (sk->sk_type == SOCK_STREAM)
> > > > + if (sk->sk_type == SOCK_DGRAM)
> > > > + return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > + else if (sk->sk_type == SOCK_STREAM)
> > > > return VIRTIO_VSOCK_TYPE_STREAM;
> > > > else
> > > > return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > @@ -287,22 +289,29 @@ static int
> > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > vvs = vsk->trans;
> > > >
> > > > /* we can send less than pkt_len bytes */
> > > > - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > + pkt_len =
> > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > + else
> > > > + return 0;
> > > > + }
> > > >
> > > > - /* virtio_transport_get_credit might return less than
> > > > pkt_len credit */
> > > > - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > + /* virtio_transport_get_credit might return
> > > > less than pkt_len credit */
> > > > + pkt_len = virtio_transport_get_credit(vvs,
> > > > pkt_len);
> > > >
> > > > - /* Do not send zero length OP_RW pkt */
> > > > - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > - return pkt_len;
> > > > + /* Do not send zero length OP_RW pkt */
> > > > + if (pkt_len == 0 && info->op ==
> > > > VIRTIO_VSOCK_OP_RW)
> > > > + return pkt_len;
> > > > + }
> > > >
> > > > skb = virtio_transport_alloc_skb(info, pkt_len,
> > > > src_cid, src_port,
> > > > dst_cid, dst_port,
> > > > &err);
> > > > if (!skb) {
> > > > - virtio_transport_put_credit(vvs, pkt_len);
> > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > + virtio_transport_put_credit(vvs,
> > > > pkt_len);
> > > > return err;
> > > > }
> > > >
> > > > @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct
> > > > vsock_sock *vsk,
> > > > }
> > > > EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > >
> > > > +static ssize_t
> > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> > > > + struct msghdr *msg, size_t
> > > > len)
> > > > +{
> > > > + struct virtio_vsock_sock *vvs = vsk->trans;
> > > > + struct sk_buff *skb;
> > > > + size_t total = 0;
> > > > + u32 free_space;
> > > > + int err = -EFAULT;
> > > > +
> > > > + spin_lock_bh(&vvs->rx_lock);
> > > > + if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > >rx_queue)) {
> > > > + skb = __skb_dequeue(&vvs->rx_queue);
> > > > +
> > > > + total = len;
> > > > + if (total > skb->len - vsock_metadata(skb)-
> > > > >off)
> > > > + total = skb->len - vsock_metadata(skb)-
> > > > >off;
> > > > + else if (total < skb->len -
> > > > vsock_metadata(skb)->off)
> > > > + msg->msg_flags |= MSG_TRUNC;
> > > > +
> > > > + /* sk_lock is held by caller so no one else can
> > > > dequeue.
> > > > + * Unlock rx_lock since memcpy_to_msg() may
> > > > sleep.
> > > > + */
> > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > +
> > > > + err = memcpy_to_msg(msg, skb->data +
> > > > vsock_metadata(skb)->off, total);
> > > > + if (err)
> > > > + return err;
> > > > +
> > > > + spin_lock_bh(&vvs->rx_lock);
> > > > +
> > > > + virtio_transport_dec_rx_pkt(vvs, skb);
> > > > + consume_skb(skb);
> > > > + }
> > > > +
> > > > + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > >last_fwd_cnt);
> > > > +
> > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > +
> > > > + if (total > 0 && msg->msg_name) {
> > > > + /* Provide the address of the sender. */
> > > > + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > msg->msg_name);
> > > > +
> > > > + vsock_addr_init(vm_addr,
> > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > + le32_to_cpu(vsock_hdr(skb)-
> > > > >src_port));
> > > > + msg->msg_namelen = sizeof(*vm_addr);
> > > > + }
> > > > + return total;
> > > > +}
> > > > +
> > > > +static s64 virtio_transport_dgram_has_data(struct vsock_sock
> > > > *vsk)
> > > > +{
> > > > + return virtio_transport_stream_has_data(vsk);
> > > > +}
> > > > +
> > > > int
> > > > virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> > > > struct msghdr *msg,
> > > > @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct
> > > > vsock_sock *vsk,
> > > > struct msghdr *msg,
> > > > size_t len, int flags)
> > > > {
> > > > - return -EOPNOTSUPP;
> > > > + struct sock *sk;
> > > > + size_t err = 0;
> > > > + long timeout;
> > > > +
> > > > + DEFINE_WAIT(wait);
> > > > +
> > > > + sk = &vsk->sk;
> > > > + err = 0;
> > > > +
> > > > + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > MSG_PEEK)
> > > > + return -EOPNOTSUPP;
> > > > +
> > > > + lock_sock(sk);
> > > > +
> > > > + if (!len)
> > > > + goto out;
> > > > +
> > > > + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > +
> > > > + while (1) {
> > > > + s64 ready;
> > > > +
> > > > + prepare_to_wait(sk_sleep(sk), &wait,
> > > > TASK_INTERRUPTIBLE);
> > > > + ready = virtio_transport_dgram_has_data(vsk);
> > > > +
> > > > + if (ready == 0) {
> > > > + if (timeout == 0) {
> > > > + err = -EAGAIN;
> > > > + finish_wait(sk_sleep(sk),
> > > > &wait);
> > > > + break;
> > > > + }
> > > > +
> > > > + release_sock(sk);
> > > > + timeout = schedule_timeout(timeout);
> > > > + lock_sock(sk);
> > > > +
> > > > + if (signal_pending(current)) {
> > > > + err = sock_intr_errno(timeout);
> > > > + finish_wait(sk_sleep(sk),
> > > > &wait);
> > > > + break;
> > > > + } else if (timeout == 0) {
> > > > + err = -EAGAIN;
> > > > + finish_wait(sk_sleep(sk),
> > > > &wait);
> > > > + break;
> > > > + }
> > > > + } else {
> > > > + finish_wait(sk_sleep(sk), &wait);
> > > > +
> > > > + if (ready < 0) {
> > > > + err = -ENOMEM;
> > > > + goto out;
> > > > + }
> > > > +
> > > > + err =
> > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > + break;
> > > > + }
> > > > + }
> > > > +out:
> > > > + release_sock(sk);
> > > > + return err;
> > > > }
> > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > > >
> > > > @@ -819,13 +942,13 @@
> > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > > int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > > struct sockaddr_vm *addr)
> > > > {
> > > > - return -EOPNOTSUPP;
> > > > + return vsock_bind_stream(vsk, addr);
> > > > }
> > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > >
> > > > bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > > {
> > > > - return false;
> > > > + return true;
> > > > }
> > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > >
> > > > @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct
> > > > vsock_sock *vsk,
> > > > struct msghdr *msg,
> > > > size_t dgram_len)
> > > > {
> > > > - return -EOPNOTSUPP;
> > > > + struct virtio_vsock_pkt_info info = {
> > > > + .op = VIRTIO_VSOCK_OP_RW,
> > > > + .msg = msg,
> > > > + .pkt_len = dgram_len,
> > > > + .vsk = vsk,
> > > > + .remote_cid = remote_addr->svm_cid,
> > > > + .remote_port = remote_addr->svm_port,
> > > > + };
> > > > +
> > > > + return virtio_transport_send_pkt_info(vsk, &info);
> > > > }
> > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > >
> > > > @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct
> > > > sock *sk,
> > > > struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > > int err = 0;
> > > >
> > > > + if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > + virtio_transport_recv_enqueue(vsk, skb);
> > > > + sk->sk_data_ready(sk);
> > > > + return err;
> > > > + }
> > > > +
> > > > switch (le16_to_cpu(hdr->op)) {
> > > > case VIRTIO_VSOCK_OP_RW:
> > > > virtio_transport_recv_enqueue(vsk, skb);
> > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock
> > > > *sk, struct sk_buff *skb,
> > > > static bool virtio_transport_valid_type(u16 type)
> > > > {
> > > > return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > + (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > > }
> > > >
> > > > /* We are under the virtio-vsock's vsock->rx_lock or vhost-
> > > > vsock's vq->mutex
> > > > @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct
> > > > virtio_transport *t,
> > > > goto free_pkt;
> > > > }
> > > >
> > > > + if (sk->sk_type == SOCK_DGRAM) {
> > > > + virtio_transport_recv_connected(sk, skb);
> > > > + goto out;
> > > > + }
> > > > +
> > > > space_available = virtio_transport_space_update(sk,
> > > > skb);
> > > >
> > > > /* Update CID in case it has changed after a transport
> > > > reset event */
> > > > @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct
> > > > virtio_transport *t,
> > > > break;
> > > > }
> > > >
> > > > +out:
> > > > release_sock(sk);
> > > >
> > > > /* Release refcnt obtained when we fetched this socket
> > > > out of the
> > > > --
> > > > 2.35.1
> > > >
> > >
> > > ---------------------------------------------------------------
> > > ------
> > > To unsubscribe, e-mail:
> > > [email protected]
> > > For additional commands, e-mail:
> > > [email protected]
> > >
On Tue, 2022-08-16 at 09:58 +0000, Bobby Eshleman wrote:
> On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> > On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > > CC'ing [email protected]
> > > >
> > > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> > > > > This patch supports dgram in virtio and on the vhost side.
> > > Hello,
> > >
> > > sorry, i don't understand, how this maintains message boundaries?
> > > Or it
> > > is unnecessary for SOCK_DGRAM?
> > >
> > > Thanks
> > > > > Signed-off-by: Jiang Wang <[email protected]>
> > > > > Signed-off-by: Bobby Eshleman <[email protected]>
> > > > > ---
> > > > > drivers/vhost/vsock.c | 2 +-
> > > > > include/net/af_vsock.h | 2 +
> > > > > include/uapi/linux/virtio_vsock.h | 1 +
> > > > > net/vmw_vsock/af_vsock.c | 26 +++-
> > > > > net/vmw_vsock/virtio_transport.c | 2 +-
> > > > > net/vmw_vsock/virtio_transport_common.c | 173
> > > > > ++++++++++++++++++++++--
> > > > > 6 files changed, 186 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > > --- a/drivers/vhost/vsock.c
> > > > > +++ b/drivers/vhost/vsock.c
> > > > > @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> > > > > int ret;
> > > > >
> > > > > ret = vsock_core_register(&vhost_transport.transport,
> > > > > - VSOCK_TRANSPORT_F_H2G);
> > > > > + VSOCK_TRANSPORT_F_H2G |
> > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > if (ret < 0)
> > > > > return ret;
> > > > >
> > > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > > --- a/include/net/af_vsock.h
> > > > > +++ b/include/net/af_vsock.h
> > > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > > > s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > > > s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > > > struct sock *vsock_create_connected(struct sock *parent);
> > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > + struct sockaddr_vm *addr);
> > > > >
> > > > > /**** TRANSPORT ****/
> > > > >
> > > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > > b/include/uapi/linux/virtio_vsock.h
> > > > > index 857df3a3a70d..0975b9c88292 100644
> > > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > > > enum virtio_vsock_type {
> > > > > VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > > > VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > > + VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > > > };
> > > > >
> > > > > enum virtio_vsock_op {
> > > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > > b/net/vmw_vsock/af_vsock.c
> > > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > > --- a/net/vmw_vsock/af_vsock.c
> > > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > > @@ -675,6 +675,19 @@ static int
> > > > > __vsock_bind_connectible(struct vsock_sock *vsk,
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > + struct sockaddr_vm *addr)
> > > > > +{
> > > > > + int retval;
> > > > > +
> > > > > + spin_lock_bh(&vsock_table_lock);
> > > > > + retval = __vsock_bind_connectible(vsk, addr);
> > > > > + spin_unlock_bh(&vsock_table_lock);
> > > > > +
> > > > > + return retval;
> > > > > +}
> > > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > > +
> > > > > static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > > > struct sockaddr_vm *addr)
> > > > > {
> > > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct
> > > > > vsock_transport *t, int features)
> > > > > }
> > > > >
> > > > > if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > > - if (t_dgram) {
> > > > > - err = -EBUSY;
> > > > > - goto err_busy;
> > > > > + /* TODO: always chose the G2H variant over
> > > > > others, support nesting later */
> > > > > + if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > > + if (t_dgram)
> > > > > + pr_warn("virtio_vsock: t_dgram
> > > > > already set\n");
> > > > > + t_dgram = t;
> > > > > + }
> > > > > +
> > > > > + if (!t_dgram) {
> > > > > + t_dgram = t;
> > > > > }
> > > > > - t_dgram = t;
> > > > > }
> > > > >
> > > > > if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > > b/net/vmw_vsock/virtio_transport.c
> > > > > index 073314312683..d4526ca462d2 100644
> > > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > > @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> > > > > return -ENOMEM;
> > > > >
> > > > > ret = vsock_core_register(&virtio_transport.transport,
> > > > > - VSOCK_TRANSPORT_F_G2H);
> > > > > + VSOCK_TRANSPORT_F_G2H |
> > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > if (ret)
> > > > > goto out_wq;
> > > > >
> > > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > > index bdf16fff054f..aedb48728677 100644
> > > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > > @@ -229,7 +229,9 @@
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > > >
> > > > > static u16 virtio_transport_get_type(struct sock *sk)
> > > > > {
> > > > > - if (sk->sk_type == SOCK_STREAM)
> > > > > + if (sk->sk_type == SOCK_DGRAM)
> > > > > + return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > > + else if (sk->sk_type == SOCK_STREAM)
> > > > > return VIRTIO_VSOCK_TYPE_STREAM;
> > > > > else
> > > > > return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > > @@ -287,22 +289,29 @@ static int
> > > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > > vvs = vsk->trans;
> > > > >
> > > > > /* we can send less than pkt_len bytes */
> > > > > - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > > - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > + pkt_len =
> > > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > + else
> > > > > + return 0;
> > > > > + }
> > > > >
> > > > > - /* virtio_transport_get_credit might return less than
> > > > > pkt_len credit */
> > > > > - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > + /* virtio_transport_get_credit might return
> > > > > less than pkt_len credit */
> > > > > + pkt_len = virtio_transport_get_credit(vvs,
> > > > > pkt_len);
> > > > >
> > > > > - /* Do not send zero length OP_RW pkt */
> > > > > - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > > - return pkt_len;
> > > > > + /* Do not send zero length OP_RW pkt */
> > > > > + if (pkt_len == 0 && info->op ==
> > > > > VIRTIO_VSOCK_OP_RW)
> > > > > + return pkt_len;
> > > > > + }
> > > > >
> > > > > skb = virtio_transport_alloc_skb(info, pkt_len,
> > > > > src_cid, src_port,
> > > > > dst_cid, dst_port,
> > > > > &err);
> > > > > if (!skb) {
> > > > > - virtio_transport_put_credit(vvs, pkt_len);
> > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > + virtio_transport_put_credit(vvs,
> > > > > pkt_len);
> > > > > return err;
> > > > > }
> > > > >
> > > > > @@ -586,6 +595,61 @@
> > > > > virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > > >
> > > > > +static ssize_t
> > > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> > > > > + struct msghdr *msg, size_t
> > > > > len)
> > > > > +{
> > > > > + struct virtio_vsock_sock *vvs = vsk->trans;
> > > > > + struct sk_buff *skb;
> > > > > + size_t total = 0;
> > > > > + u32 free_space;
> > > > > + int err = -EFAULT;
> > > > > +
> > > > > + spin_lock_bh(&vvs->rx_lock);
> > > > > + if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > > >rx_queue)) {
> > > > > + skb = __skb_dequeue(&vvs->rx_queue);
> > > > > +
> > > > > + total = len;
> > > > > + if (total > skb->len - vsock_metadata(skb)-
> > > > > >off)
> > > > > + total = skb->len - vsock_metadata(skb)-
> > > > > >off;
> > > > > + else if (total < skb->len -
> > > > > vsock_metadata(skb)->off)
> > > > > + msg->msg_flags |= MSG_TRUNC;
> > > > > +
> > > > > + /* sk_lock is held by caller so no one else can
> > > > > dequeue.
> > > > > + * Unlock rx_lock since memcpy_to_msg() may
> > > > > sleep.
> > > > > + */
> > > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > > +
> > > > > + err = memcpy_to_msg(msg, skb->data +
> > > > > vsock_metadata(skb)->off, total);
> > > > > + if (err)
> > > > > + return err;
> > > > > +
> > > > > + spin_lock_bh(&vvs->rx_lock);
> > > > > +
> > > > > + virtio_transport_dec_rx_pkt(vvs, skb);
> > > > > + consume_skb(skb);
> > > > > + }
> > > > > +
> > > > > + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > > >last_fwd_cnt);
> > > > > +
> > > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > > +
> > > > > + if (total > 0 && msg->msg_name) {
> > > > > + /* Provide the address of the sender. */
> > > > > + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > > msg->msg_name);
> > > > > +
> > > > > + vsock_addr_init(vm_addr,
> > > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > > + le32_to_cpu(vsock_hdr(skb)-
> > > > > >src_port));
> > > > > + msg->msg_namelen = sizeof(*vm_addr);
> > > > > + }
> > > > > + return total;
> > > > > +}
> > > > > +
> > > > > +static s64 virtio_transport_dgram_has_data(struct vsock_sock
> > > > > *vsk)
> > > > > +{
> > > > > + return virtio_transport_stream_has_data(vsk);
> > > > > +}
> > > > > +
> > > > > int
> > > > > virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> > > > > struct msghdr *msg,
> > > > > @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct
> > > > > vsock_sock *vsk,
> > > > > struct msghdr *msg,
> > > > > size_t len, int flags)
> > > > > {
> > > > > - return -EOPNOTSUPP;
> > > > > + struct sock *sk;
> > > > > + size_t err = 0;
> > > > > + long timeout;
> > > > > +
> > > > > + DEFINE_WAIT(wait);
> > > > > +
> > > > > + sk = &vsk->sk;
> > > > > + err = 0;
> > > > > +
> > > > > + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > > MSG_PEEK)
> > > > > + return -EOPNOTSUPP;
> > > > > +
> > > > > + lock_sock(sk);
> > > > > +
> > > > > + if (!len)
> > > > > + goto out;
> > > > > +
> > > > > + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > > +
> > > > > + while (1) {
> > > > > + s64 ready;
> > > > > +
> > > > > + prepare_to_wait(sk_sleep(sk), &wait,
> > > > > TASK_INTERRUPTIBLE);
> > > > > + ready = virtio_transport_dgram_has_data(vsk);
> > > > > +
> > > > > + if (ready == 0) {
> > > > > + if (timeout == 0) {
> > > > > + err = -EAGAIN;
> > > > > + finish_wait(sk_sleep(sk),
> > > > > &wait);
> > > > > + break;
> > > > > + }
> > > > > +
> > > > > + release_sock(sk);
> > > > > + timeout = schedule_timeout(timeout);
> > > > > + lock_sock(sk);
> > > > > +
> > > > > + if (signal_pending(current)) {
> > > > > + err = sock_intr_errno(timeout);
> > > > > + finish_wait(sk_sleep(sk),
> > > > > &wait);
> > > > > + break;
> > > > > + } else if (timeout == 0) {
> > > > > + err = -EAGAIN;
> > > > > + finish_wait(sk_sleep(sk),
> > > > > &wait);
> > > > > + break;
> > > > > + }
> > > > > + } else {
> > > > > + finish_wait(sk_sleep(sk), &wait);
> > > > > +
> > > > > + if (ready < 0) {
> > > > > + err = -ENOMEM;
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > + err =
> > > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > > + break;
> > > > > + }
> > > > > + }
> > > > > +out:
> > > > > + release_sock(sk);
> > > > > + return err;
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > ^^^
> > May be, this generic data waiting logic should be in af_vsock.c, as
> > for stream/seqpacket?
> > In this way, another transport which supports SOCK_DGRAM could
> > reuse it.
>
> I think that is a great idea. I'll test that change for v2.
>
> Thanks.
Also for v2, i tested Your patchset a little bit(write here to not
spread over all mails):
1) seqpacket test in vsock_test.c fails(seems MSG_EOR flag issue)
2) i can't do rmmod with the following config(after testing):
CONFIG_VSOCKETS=m
CONFIG_VIRTIO_VSOCKETS=m
CONFIG_VIRTIO_VSOCKETS_COMMON=m
CONFIG_VHOST=m
CONFIG_VHOST_VSOCK=m
Guest is shutdown, but rmmod fails.
3) virtio_transport_init + virtio_transport_exit seems must be
under EXPORT_SYMBOL_GPL(), because both used in another module.
4) I tried to send 5kb(or 20kb not matter) piece of data, but got
kernel panic both in guest and later in host.
Thank You
>
> > > > >
> > > > > @@ -819,13 +942,13 @@
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > > > int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > > > struct sockaddr_vm *addr)
> > > > > {
> > > > > - return -EOPNOTSUPP;
> > > > > + return vsock_bind_stream(vsk, addr);
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > > >
> > > > > bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > > > {
> > > > > - return false;
> > > > > + return true;
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > > >
> > > > > @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct
> > > > > vsock_sock *vsk,
> > > > > struct msghdr *msg,
> > > > > size_t dgram_len)
> > > > > {
> > > > > - return -EOPNOTSUPP;
> > > > > + struct virtio_vsock_pkt_info info = {
> > > > > + .op = VIRTIO_VSOCK_OP_RW,
> > > > > + .msg = msg,
> > > > > + .pkt_len = dgram_len,
> > > > > + .vsk = vsk,
> > > > > + .remote_cid = remote_addr->svm_cid,
> > > > > + .remote_port = remote_addr->svm_port,
> > > > > + };
> > > > > +
> > > > > + return virtio_transport_send_pkt_info(vsk, &info);
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > > >
> > > > > @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct
> > > > > sock *sk,
> > > > > struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > > > int err = 0;
> > > > >
> > > > > + if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > + virtio_transport_recv_enqueue(vsk, skb);
> > > > > + sk->sk_data_ready(sk);
> > > > > + return err;
> > > > > + }
> > > > > +
> > > > > switch (le16_to_cpu(hdr->op)) {
> > > > > case VIRTIO_VSOCK_OP_RW:
> > > > > virtio_transport_recv_enqueue(vsk, skb);
> > > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct
> > > > > sock *sk, struct sk_buff *skb,
> > > > > static bool virtio_transport_valid_type(u16 type)
> > > > > {
> > > > > return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > > - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > > + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > > + (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > > > }
> > > > >
> > > > > /* We are under the virtio-vsock's vsock->rx_lock or vhost-
> > > > > vsock's vq->mutex
> > > > > @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct
> > > > > virtio_transport *t,
> > > > > goto free_pkt;
> > > > > }
> > > > >
> > > > > + if (sk->sk_type == SOCK_DGRAM) {
> > > > > + virtio_transport_recv_connected(sk, skb);
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > space_available = virtio_transport_space_update(sk,
> > > > > skb);
> > > > >
> > > > > /* Update CID in case it has changed after a transport
> > > > > reset event */
> > > > > @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct
> > > > > virtio_transport *t,
> > > > > break;
> > > > > }
> > > > >
> > > > > +out:
> > > > > release_sock(sk);
> > > > >
> > > > > /* Release refcnt obtained when we fetched this socket
> > > > > out of the
> > > > > --
> > > > > 2.35.1
> > > > >
> > > >
> > > > -------------------------------------------------------------
> > > > --------
> > > > To unsubscribe, e-mail:
> > > > [email protected]
> > > > For additional commands, e-mail:
> > > > [email protected]
> > > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
On Thu, Aug 18, 2022 at 08:35:48AM +0000, Arseniy Krasnov wrote:
> On Tue, 2022-08-16 at 09:58 +0000, Bobby Eshleman wrote:
> > On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> > > On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > > > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > > > CC'ing [email protected]
> > > > >
> > > > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> > > > > > This patch supports dgram in virtio and on the vhost side.
> > > > Hello,
> > > >
> > > > sorry, i don't understand, how this maintains message boundaries?
> > > > Or it
> > > > is unnecessary for SOCK_DGRAM?
> > > >
> > > > Thanks
> > > > > > Signed-off-by: Jiang Wang <[email protected]>
> > > > > > Signed-off-by: Bobby Eshleman <[email protected]>
> > > > > > ---
> > > > > > drivers/vhost/vsock.c | 2 +-
> > > > > > include/net/af_vsock.h | 2 +
> > > > > > include/uapi/linux/virtio_vsock.h | 1 +
> > > > > > net/vmw_vsock/af_vsock.c | 26 +++-
> > > > > > net/vmw_vsock/virtio_transport.c | 2 +-
> > > > > > net/vmw_vsock/virtio_transport_common.c | 173
> > > > > > ++++++++++++++++++++++--
> > > > > > 6 files changed, 186 insertions(+), 20 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > > > --- a/drivers/vhost/vsock.c
> > > > > > +++ b/drivers/vhost/vsock.c
> > > > > > @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> > > > > > int ret;
> > > > > >
> > > > > > ret = vsock_core_register(&vhost_transport.transport,
> > > > > > - VSOCK_TRANSPORT_F_H2G);
> > > > > > + VSOCK_TRANSPORT_F_H2G |
> > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > > if (ret < 0)
> > > > > > return ret;
> > > > > >
> > > > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > > > --- a/include/net/af_vsock.h
> > > > > > +++ b/include/net/af_vsock.h
> > > > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > > > > s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > > > > s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > > > > struct sock *vsock_create_connected(struct sock *parent);
> > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > + struct sockaddr_vm *addr);
> > > > > >
> > > > > > /**** TRANSPORT ****/
> > > > > >
> > > > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > > > b/include/uapi/linux/virtio_vsock.h
> > > > > > index 857df3a3a70d..0975b9c88292 100644
> > > > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > > > > enum virtio_vsock_type {
> > > > > > VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > > > > VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > > > + VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > > > > };
> > > > > >
> > > > > > enum virtio_vsock_op {
> > > > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > > > b/net/vmw_vsock/af_vsock.c
> > > > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > > > --- a/net/vmw_vsock/af_vsock.c
> > > > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > > > @@ -675,6 +675,19 @@ static int
> > > > > > __vsock_bind_connectible(struct vsock_sock *vsk,
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > + struct sockaddr_vm *addr)
> > > > > > +{
> > > > > > + int retval;
> > > > > > +
> > > > > > + spin_lock_bh(&vsock_table_lock);
> > > > > > + retval = __vsock_bind_connectible(vsk, addr);
> > > > > > + spin_unlock_bh(&vsock_table_lock);
> > > > > > +
> > > > > > + return retval;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > > > +
> > > > > > static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > > > > struct sockaddr_vm *addr)
> > > > > > {
> > > > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct
> > > > > > vsock_transport *t, int features)
> > > > > > }
> > > > > >
> > > > > > if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > > > - if (t_dgram) {
> > > > > > - err = -EBUSY;
> > > > > > - goto err_busy;
> > > > > > + /* TODO: always chose the G2H variant over
> > > > > > others, support nesting later */
> > > > > > + if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > > > + if (t_dgram)
> > > > > > + pr_warn("virtio_vsock: t_dgram
> > > > > > already set\n");
> > > > > > + t_dgram = t;
> > > > > > + }
> > > > > > +
> > > > > > + if (!t_dgram) {
> > > > > > + t_dgram = t;
> > > > > > }
> > > > > > - t_dgram = t;
> > > > > > }
> > > > > >
> > > > > > if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > > > b/net/vmw_vsock/virtio_transport.c
> > > > > > index 073314312683..d4526ca462d2 100644
> > > > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > > > @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> > > > > > return -ENOMEM;
> > > > > >
> > > > > > ret = vsock_core_register(&virtio_transport.transport,
> > > > > > - VSOCK_TRANSPORT_F_G2H);
> > > > > > + VSOCK_TRANSPORT_F_G2H |
> > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > > if (ret)
> > > > > > goto out_wq;
> > > > > >
> > > > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > > > index bdf16fff054f..aedb48728677 100644
> > > > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > > > @@ -229,7 +229,9 @@
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > > > >
> > > > > > static u16 virtio_transport_get_type(struct sock *sk)
> > > > > > {
> > > > > > - if (sk->sk_type == SOCK_STREAM)
> > > > > > + if (sk->sk_type == SOCK_DGRAM)
> > > > > > + return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > > > + else if (sk->sk_type == SOCK_STREAM)
> > > > > > return VIRTIO_VSOCK_TYPE_STREAM;
> > > > > > else
> > > > > > return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > > > @@ -287,22 +289,29 @@ static int
> > > > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > > > vvs = vsk->trans;
> > > > > >
> > > > > > /* we can send less than pkt_len bytes */
> > > > > > - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > > > - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > + pkt_len =
> > > > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > + else
> > > > > > + return 0;
> > > > > > + }
> > > > > >
> > > > > > - /* virtio_transport_get_credit might return less than
> > > > > > pkt_len credit */
> > > > > > - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > + /* virtio_transport_get_credit might return
> > > > > > less than pkt_len credit */
> > > > > > + pkt_len = virtio_transport_get_credit(vvs,
> > > > > > pkt_len);
> > > > > >
> > > > > > - /* Do not send zero length OP_RW pkt */
> > > > > > - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > > > - return pkt_len;
> > > > > > + /* Do not send zero length OP_RW pkt */
> > > > > > + if (pkt_len == 0 && info->op ==
> > > > > > VIRTIO_VSOCK_OP_RW)
> > > > > > + return pkt_len;
> > > > > > + }
> > > > > >
> > > > > > skb = virtio_transport_alloc_skb(info, pkt_len,
> > > > > > src_cid, src_port,
> > > > > > dst_cid, dst_port,
> > > > > > &err);
> > > > > > if (!skb) {
> > > > > > - virtio_transport_put_credit(vvs, pkt_len);
> > > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > + virtio_transport_put_credit(vvs,
> > > > > > pkt_len);
> > > > > > return err;
> > > > > > }
> > > > > >
> > > > > > @@ -586,6 +595,61 @@
> > > > > > virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > > > >
> > > > > > +static ssize_t
> > > > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> > > > > > + struct msghdr *msg, size_t
> > > > > > len)
> > > > > > +{
> > > > > > + struct virtio_vsock_sock *vvs = vsk->trans;
> > > > > > + struct sk_buff *skb;
> > > > > > + size_t total = 0;
> > > > > > + u32 free_space;
> > > > > > + int err = -EFAULT;
> > > > > > +
> > > > > > + spin_lock_bh(&vvs->rx_lock);
> > > > > > + if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > > > >rx_queue)) {
> > > > > > + skb = __skb_dequeue(&vvs->rx_queue);
> > > > > > +
> > > > > > + total = len;
> > > > > > + if (total > skb->len - vsock_metadata(skb)-
> > > > > > >off)
> > > > > > + total = skb->len - vsock_metadata(skb)-
> > > > > > >off;
> > > > > > + else if (total < skb->len -
> > > > > > vsock_metadata(skb)->off)
> > > > > > + msg->msg_flags |= MSG_TRUNC;
> > > > > > +
> > > > > > + /* sk_lock is held by caller so no one else can
> > > > > > dequeue.
> > > > > > + * Unlock rx_lock since memcpy_to_msg() may
> > > > > > sleep.
> > > > > > + */
> > > > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > > > +
> > > > > > + err = memcpy_to_msg(msg, skb->data +
> > > > > > vsock_metadata(skb)->off, total);
> > > > > > + if (err)
> > > > > > + return err;
> > > > > > +
> > > > > > + spin_lock_bh(&vvs->rx_lock);
> > > > > > +
> > > > > > + virtio_transport_dec_rx_pkt(vvs, skb);
> > > > > > + consume_skb(skb);
> > > > > > + }
> > > > > > +
> > > > > > + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > > > >last_fwd_cnt);
> > > > > > +
> > > > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > > > +
> > > > > > + if (total > 0 && msg->msg_name) {
> > > > > > + /* Provide the address of the sender. */
> > > > > > + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > > > msg->msg_name);
> > > > > > +
> > > > > > + vsock_addr_init(vm_addr,
> > > > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > > > + le32_to_cpu(vsock_hdr(skb)-
> > > > > > >src_port));
> > > > > > + msg->msg_namelen = sizeof(*vm_addr);
> > > > > > + }
> > > > > > + return total;
> > > > > > +}
> > > > > > +
> > > > > > +static s64 virtio_transport_dgram_has_data(struct vsock_sock
> > > > > > *vsk)
> > > > > > +{
> > > > > > + return virtio_transport_stream_has_data(vsk);
> > > > > > +}
> > > > > > +
> > > > > > int
> > > > > > virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> > > > > > struct msghdr *msg,
> > > > > > @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct
> > > > > > vsock_sock *vsk,
> > > > > > struct msghdr *msg,
> > > > > > size_t len, int flags)
> > > > > > {
> > > > > > - return -EOPNOTSUPP;
> > > > > > + struct sock *sk;
> > > > > > + size_t err = 0;
> > > > > > + long timeout;
> > > > > > +
> > > > > > + DEFINE_WAIT(wait);
> > > > > > +
> > > > > > + sk = &vsk->sk;
> > > > > > + err = 0;
> > > > > > +
> > > > > > + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > > > MSG_PEEK)
> > > > > > + return -EOPNOTSUPP;
> > > > > > +
> > > > > > + lock_sock(sk);
> > > > > > +
> > > > > > + if (!len)
> > > > > > + goto out;
> > > > > > +
> > > > > > + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > > > +
> > > > > > + while (1) {
> > > > > > + s64 ready;
> > > > > > +
> > > > > > + prepare_to_wait(sk_sleep(sk), &wait,
> > > > > > TASK_INTERRUPTIBLE);
> > > > > > + ready = virtio_transport_dgram_has_data(vsk);
> > > > > > +
> > > > > > + if (ready == 0) {
> > > > > > + if (timeout == 0) {
> > > > > > + err = -EAGAIN;
> > > > > > + finish_wait(sk_sleep(sk),
> > > > > > &wait);
> > > > > > + break;
> > > > > > + }
> > > > > > +
> > > > > > + release_sock(sk);
> > > > > > + timeout = schedule_timeout(timeout);
> > > > > > + lock_sock(sk);
> > > > > > +
> > > > > > + if (signal_pending(current)) {
> > > > > > + err = sock_intr_errno(timeout);
> > > > > > + finish_wait(sk_sleep(sk),
> > > > > > &wait);
> > > > > > + break;
> > > > > > + } else if (timeout == 0) {
> > > > > > + err = -EAGAIN;
> > > > > > + finish_wait(sk_sleep(sk),
> > > > > > &wait);
> > > > > > + break;
> > > > > > + }
> > > > > > + } else {
> > > > > > + finish_wait(sk_sleep(sk), &wait);
> > > > > > +
> > > > > > + if (ready < 0) {
> > > > > > + err = -ENOMEM;
> > > > > > + goto out;
> > > > > > + }
> > > > > > +
> > > > > > + err =
> > > > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > > > + break;
> > > > > > + }
> > > > > > + }
> > > > > > +out:
> > > > > > + release_sock(sk);
> > > > > > + return err;
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > > ^^^
> > > May be, this generic data waiting logic should be in af_vsock.c, as
> > > for stream/seqpacket?
> > > In this way, another transport which supports SOCK_DGRAM could
> > > reuse it.
> >
> > I think that is a great idea. I'll test that change for v2.
> >
> > Thanks.
>
> Also for v2, i tested Your patchset a little bit(write here to not
> spread over all mails):
> 1) seqpacket test in vsock_test.c fails(seems MSG_EOR flag issue)
I will investigate.
> 2) i can't do rmmod with the following config(after testing):
> CONFIG_VSOCKETS=m
> CONFIG_VIRTIO_VSOCKETS=m
> CONFIG_VIRTIO_VSOCKETS_COMMON=m
> CONFIG_VHOST=m
> CONFIG_VHOST_VSOCK=m
> Guest is shutdown, but rmmod fails.
> 3) virtio_transport_init + virtio_transport_exit seems must be
> under EXPORT_SYMBOL_GPL(), because both used in another module.
Definitely, will fix.
> 4) I tried to send 5kb(or 20kb not matter) piece of data, but got
> kernel panic both in guest and later in host.
>
Thanks for catching that. I can reproduce it intermittently, but only
for seqpacket. Did you happen to see this for other socket types as
well?
Thanks
> Thank You
> >
> > > > > >
> > > > > > @@ -819,13 +942,13 @@
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > > > > int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > > > > struct sockaddr_vm *addr)
> > > > > > {
> > > > > > - return -EOPNOTSUPP;
> > > > > > + return vsock_bind_stream(vsk, addr);
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > > > >
> > > > > > bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > > > > {
> > > > > > - return false;
> > > > > > + return true;
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > > > >
> > > > > > @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct
> > > > > > vsock_sock *vsk,
> > > > > > struct msghdr *msg,
> > > > > > size_t dgram_len)
> > > > > > {
> > > > > > - return -EOPNOTSUPP;
> > > > > > + struct virtio_vsock_pkt_info info = {
> > > > > > + .op = VIRTIO_VSOCK_OP_RW,
> > > > > > + .msg = msg,
> > > > > > + .pkt_len = dgram_len,
> > > > > > + .vsk = vsk,
> > > > > > + .remote_cid = remote_addr->svm_cid,
> > > > > > + .remote_port = remote_addr->svm_port,
> > > > > > + };
> > > > > > +
> > > > > > + return virtio_transport_send_pkt_info(vsk, &info);
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > > > >
> > > > > > @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct
> > > > > > sock *sk,
> > > > > > struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > > > > int err = 0;
> > > > > >
> > > > > > + if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > + virtio_transport_recv_enqueue(vsk, skb);
> > > > > > + sk->sk_data_ready(sk);
> > > > > > + return err;
> > > > > > + }
> > > > > > +
> > > > > > switch (le16_to_cpu(hdr->op)) {
> > > > > > case VIRTIO_VSOCK_OP_RW:
> > > > > > virtio_transport_recv_enqueue(vsk, skb);
> > > > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct
> > > > > > sock *sk, struct sk_buff *skb,
> > > > > > static bool virtio_transport_valid_type(u16 type)
> > > > > > {
> > > > > > return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > > > - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > > > + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > > > + (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > > > > }
> > > > > >
> > > > > > /* We are under the virtio-vsock's vsock->rx_lock or vhost-
> > > > > > vsock's vq->mutex
> > > > > > @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct
> > > > > > virtio_transport *t,
> > > > > > goto free_pkt;
> > > > > > }
> > > > > >
> > > > > > + if (sk->sk_type == SOCK_DGRAM) {
> > > > > > + virtio_transport_recv_connected(sk, skb);
> > > > > > + goto out;
> > > > > > + }
> > > > > > +
> > > > > > space_available = virtio_transport_space_update(sk,
> > > > > > skb);
> > > > > >
> > > > > > /* Update CID in case it has changed after a transport
> > > > > > reset event */
> > > > > > @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct
> > > > > > virtio_transport *t,
> > > > > > break;
> > > > > > }
> > > > > >
> > > > > > +out:
> > > > > > release_sock(sk);
> > > > > >
> > > > > > /* Release refcnt obtained when we fetched this socket
> > > > > > out of the
> > > > > > --
> > > > > > 2.35.1
> > > > > >
> > > > >
> > > > > -------------------------------------------------------------
> > > > > --------
> > > > > To unsubscribe, e-mail:
> > > > > [email protected]
> > > > > For additional commands, e-mail:
> > > > > [email protected]
> > > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
On Tue, 2022-08-16 at 20:52 +0000, Bobby Eshleman wrote:
> On Thu, Aug 18, 2022 at 08:35:48AM +0000, Arseniy Krasnov wrote:
> > On Tue, 2022-08-16 at 09:58 +0000, Bobby Eshleman wrote:
> > > On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> > > > On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > > > > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > > > > CC'ing [email protected]
> > > > > >
> > > > > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman
> > > > > > wrote:
> > > > > > > This patch supports dgram in virtio and on the vhost
> > > > > > > side.
> > > > > Hello,
> > > > >
> > > > > sorry, i don't understand, how this maintains message
> > > > > boundaries?
> > > > > Or it
> > > > > is unnecessary for SOCK_DGRAM?
> > > > >
> > > > > Thanks
> > > > > > > Signed-off-by: Jiang Wang <[email protected]>
> > > > > > > Signed-off-by: Bobby Eshleman <
> > > > > > > [email protected]>
> > > > > > > ---
> > > > > > > drivers/vhost/vsock.c | 2 +-
> > > > > > > include/net/af_vsock.h | 2 +
> > > > > > > include/uapi/linux/virtio_vsock.h | 1 +
> > > > > > > net/vmw_vsock/af_vsock.c | 26 +++-
> > > > > > > net/vmw_vsock/virtio_transport.c | 2 +-
> > > > > > > net/vmw_vsock/virtio_transport_common.c | 173
> > > > > > > ++++++++++++++++++++++--
> > > > > > > 6 files changed, 186 insertions(+), 20 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vhost/vsock.c
> > > > > > > b/drivers/vhost/vsock.c
> > > > > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > > > > --- a/drivers/vhost/vsock.c
> > > > > > > +++ b/drivers/vhost/vsock.c
> > > > > > > @@ -925,7 +925,7 @@ static int __init
> > > > > > > vhost_vsock_init(void)
> > > > > > > int ret;
> > > > > > >
> > > > > > > ret = vsock_core_register(&vhost_transport.transport,
> > > > > > > - VSOCK_TRANSPORT_F_H2G);
> > > > > > > + VSOCK_TRANSPORT_F_H2G |
> > > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > > > if (ret < 0)
> > > > > > > return ret;
> > > > > > >
> > > > > > > diff --git a/include/net/af_vsock.h
> > > > > > > b/include/net/af_vsock.h
> > > > > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > > > > --- a/include/net/af_vsock.h
> > > > > > > +++ b/include/net/af_vsock.h
> > > > > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > > > > > s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > > > > > s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > > > > > struct sock *vsock_create_connected(struct sock
> > > > > > > *parent);
> > > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > > + struct sockaddr_vm *addr);
> > > > > > >
> > > > > > > /**** TRANSPORT ****/
> > > > > > >
> > > > > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > > > > b/include/uapi/linux/virtio_vsock.h
> > > > > > > index 857df3a3a70d..0975b9c88292 100644
> > > > > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > > > > > enum virtio_vsock_type {
> > > > > > > VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > > > > > VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > > > > + VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > > > > > };
> > > > > > >
> > > > > > > enum virtio_vsock_op {
> > > > > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > > > > b/net/vmw_vsock/af_vsock.c
> > > > > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > > > > --- a/net/vmw_vsock/af_vsock.c
> > > > > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > > > > @@ -675,6 +675,19 @@ static int
> > > > > > > __vsock_bind_connectible(struct vsock_sock *vsk,
> > > > > > > return 0;
> > > > > > > }
> > > > > > >
> > > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > > + struct sockaddr_vm *addr)
> > > > > > > +{
> > > > > > > + int retval;
> > > > > > > +
> > > > > > > + spin_lock_bh(&vsock_table_lock);
> > > > > > > + retval = __vsock_bind_connectible(vsk, addr);
> > > > > > > + spin_unlock_bh(&vsock_table_lock);
> > > > > > > +
> > > > > > > + return retval;
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > > > > +
> > > > > > > static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > > > > > struct sockaddr_vm *addr)
> > > > > > > {
> > > > > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const
> > > > > > > struct
> > > > > > > vsock_transport *t, int features)
> > > > > > > }
> > > > > > >
> > > > > > > if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > > > > - if (t_dgram) {
> > > > > > > - err = -EBUSY;
> > > > > > > - goto err_busy;
> > > > > > > + /* TODO: always chose the G2H variant over
> > > > > > > others, support nesting later */
> > > > > > > + if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > > > > + if (t_dgram)
> > > > > > > + pr_warn("virtio_vsock: t_dgram
> > > > > > > already set\n");
> > > > > > > + t_dgram = t;
> > > > > > > + }
> > > > > > > +
> > > > > > > + if (!t_dgram) {
> > > > > > > + t_dgram = t;
> > > > > > > }
> > > > > > > - t_dgram = t;
> > > > > > > }
> > > > > > >
> > > > > > > if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > > > > b/net/vmw_vsock/virtio_transport.c
> > > > > > > index 073314312683..d4526ca462d2 100644
> > > > > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > > > > @@ -850,7 +850,7 @@ static int __init
> > > > > > > virtio_vsock_init(void)
> > > > > > > return -ENOMEM;
> > > > > > >
> > > > > > > ret = vsock_core_register(&virtio_transport.transport,
> > > > > > > - VSOCK_TRANSPORT_F_G2H);
> > > > > > > + VSOCK_TRANSPORT_F_G2H |
> > > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > > > if (ret)
> > > > > > > goto out_wq;
> > > > > > >
> > > > > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > > > > index bdf16fff054f..aedb48728677 100644
> > > > > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > > > > @@ -229,7 +229,9 @@
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > > > > >
> > > > > > > static u16 virtio_transport_get_type(struct sock *sk)
> > > > > > > {
> > > > > > > - if (sk->sk_type == SOCK_STREAM)
> > > > > > > + if (sk->sk_type == SOCK_DGRAM)
> > > > > > > + return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > > > > + else if (sk->sk_type == SOCK_STREAM)
> > > > > > > return VIRTIO_VSOCK_TYPE_STREAM;
> > > > > > > else
> > > > > > > return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > > > > @@ -287,22 +289,29 @@ static int
> > > > > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > > > > vvs = vsk->trans;
> > > > > > >
> > > > > > > /* we can send less than pkt_len bytes */
> > > > > > > - if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > > > > - pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > > + if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > > + pkt_len =
> > > > > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > > + else
> > > > > > > + return 0;
> > > > > > > + }
> > > > > > >
> > > > > > > - /* virtio_transport_get_credit might return less than
> > > > > > > pkt_len credit */
> > > > > > > - pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > > + /* virtio_transport_get_credit might return
> > > > > > > less than pkt_len credit */
> > > > > > > + pkt_len = virtio_transport_get_credit(vvs,
> > > > > > > pkt_len);
> > > > > > >
> > > > > > > - /* Do not send zero length OP_RW pkt */
> > > > > > > - if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > > > > - return pkt_len;
> > > > > > > + /* Do not send zero length OP_RW pkt */
> > > > > > > + if (pkt_len == 0 && info->op ==
> > > > > > > VIRTIO_VSOCK_OP_RW)
> > > > > > > + return pkt_len;
> > > > > > > + }
> > > > > > >
> > > > > > > skb = virtio_transport_alloc_skb(info, pkt_len,
> > > > > > > src_cid, src_port,
> > > > > > > dst_cid, dst_port,
> > > > > > > &err);
> > > > > > > if (!skb) {
> > > > > > > - virtio_transport_put_credit(vvs, pkt_len);
> > > > > > > + if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > > + virtio_transport_put_credit(vvs,
> > > > > > > pkt_len);
> > > > > > > return err;
> > > > > > > }
> > > > > > >
> > > > > > > @@ -586,6 +595,61 @@
> > > > > > > virtio_transport_seqpacket_dequeue(struct vsock_sock
> > > > > > > *vsk,
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > > > > >
> > > > > > > +static ssize_t
> > > > > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock
> > > > > > > *vsk,
> > > > > > > + struct msghdr *msg, size_t
> > > > > > > len)
> > > > > > > +{
> > > > > > > + struct virtio_vsock_sock *vvs = vsk->trans;
> > > > > > > + struct sk_buff *skb;
> > > > > > > + size_t total = 0;
> > > > > > > + u32 free_space;
> > > > > > > + int err = -EFAULT;
> > > > > > > +
> > > > > > > + spin_lock_bh(&vvs->rx_lock);
> > > > > > > + if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > > > > > rx_queue)) {
> > > > > > > + skb = __skb_dequeue(&vvs->rx_queue);
> > > > > > > +
> > > > > > > + total = len;
> > > > > > > + if (total > skb->len - vsock_metadata(skb)-
> > > > > > > > off)
> > > > > > > + total = skb->len - vsock_metadata(skb)-
> > > > > > > > off;
> > > > > > > + else if (total < skb->len -
> > > > > > > vsock_metadata(skb)->off)
> > > > > > > + msg->msg_flags |= MSG_TRUNC;
> > > > > > > +
> > > > > > > + /* sk_lock is held by caller so no one else can
> > > > > > > dequeue.
> > > > > > > + * Unlock rx_lock since memcpy_to_msg() may
> > > > > > > sleep.
> > > > > > > + */
> > > > > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > > > > +
> > > > > > > + err = memcpy_to_msg(msg, skb->data +
> > > > > > > vsock_metadata(skb)->off, total);
> > > > > > > + if (err)
> > > > > > > + return err;
> > > > > > > +
> > > > > > > + spin_lock_bh(&vvs->rx_lock);
> > > > > > > +
> > > > > > > + virtio_transport_dec_rx_pkt(vvs, skb);
> > > > > > > + consume_skb(skb);
> > > > > > > + }
> > > > > > > +
> > > > > > > + free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > > > > > last_fwd_cnt);
> > > > > > > +
> > > > > > > + spin_unlock_bh(&vvs->rx_lock);
> > > > > > > +
> > > > > > > + if (total > 0 && msg->msg_name) {
> > > > > > > + /* Provide the address of the sender. */
> > > > > > > + DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > > > > msg->msg_name);
> > > > > > > +
> > > > > > > + vsock_addr_init(vm_addr,
> > > > > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > > > > + le32_to_cpu(vsock_hdr(skb)-
> > > > > > > > src_port));
> > > > > > > + msg->msg_namelen = sizeof(*vm_addr);
> > > > > > > + }
> > > > > > > + return total;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static s64 virtio_transport_dgram_has_data(struct
> > > > > > > vsock_sock
> > > > > > > *vsk)
> > > > > > > +{
> > > > > > > + return virtio_transport_stream_has_data(vsk);
> > > > > > > +}
> > > > > > > +
> > > > > > > int
> > > > > > > virtio_transport_seqpacket_enqueue(struct vsock_sock
> > > > > > > *vsk,
> > > > > > > struct msghdr *msg,
> > > > > > > @@ -611,7 +675,66 @@
> > > > > > > virtio_transport_dgram_dequeue(struct
> > > > > > > vsock_sock *vsk,
> > > > > > > struct msghdr *msg,
> > > > > > > size_t len, int flags)
> > > > > > > {
> > > > > > > - return -EOPNOTSUPP;
> > > > > > > + struct sock *sk;
> > > > > > > + size_t err = 0;
> > > > > > > + long timeout;
> > > > > > > +
> > > > > > > + DEFINE_WAIT(wait);
> > > > > > > +
> > > > > > > + sk = &vsk->sk;
> > > > > > > + err = 0;
> > > > > > > +
> > > > > > > + if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > > > > MSG_PEEK)
> > > > > > > + return -EOPNOTSUPP;
> > > > > > > +
> > > > > > > + lock_sock(sk);
> > > > > > > +
> > > > > > > + if (!len)
> > > > > > > + goto out;
> > > > > > > +
> > > > > > > + timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > > > > +
> > > > > > > + while (1) {
> > > > > > > + s64 ready;
> > > > > > > +
> > > > > > > + prepare_to_wait(sk_sleep(sk), &wait,
> > > > > > > TASK_INTERRUPTIBLE);
> > > > > > > + ready = virtio_transport_dgram_has_data(vsk);
> > > > > > > +
> > > > > > > + if (ready == 0) {
> > > > > > > + if (timeout == 0) {
> > > > > > > + err = -EAGAIN;
> > > > > > > + finish_wait(sk_sleep(sk),
> > > > > > > &wait);
> > > > > > > + break;
> > > > > > > + }
> > > > > > > +
> > > > > > > + release_sock(sk);
> > > > > > > + timeout = schedule_timeout(timeout);
> > > > > > > + lock_sock(sk);
> > > > > > > +
> > > > > > > + if (signal_pending(current)) {
> > > > > > > + err = sock_intr_errno(timeout);
> > > > > > > + finish_wait(sk_sleep(sk),
> > > > > > > &wait);
> > > > > > > + break;
> > > > > > > + } else if (timeout == 0) {
> > > > > > > + err = -EAGAIN;
> > > > > > > + finish_wait(sk_sleep(sk),
> > > > > > > &wait);
> > > > > > > + break;
> > > > > > > + }
> > > > > > > + } else {
> > > > > > > + finish_wait(sk_sleep(sk), &wait);
> > > > > > > +
> > > > > > > + if (ready < 0) {
> > > > > > > + err = -ENOMEM;
> > > > > > > + goto out;
> > > > > > > + }
> > > > > > > +
> > > > > > > + err =
> > > > > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > > > > + break;
> > > > > > > + }
> > > > > > > + }
> > > > > > > +out:
> > > > > > > + release_sock(sk);
> > > > > > > + return err;
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > > > ^^^
> > > > May be, this generic data waiting logic should be in
> > > > af_vsock.c, as
> > > > for stream/seqpacket?
> > > > In this way, another transport which supports SOCK_DGRAM could
> > > > reuse it.
> > >
> > > I think that is a great idea. I'll test that change for v2.
> > >
> > > Thanks.
> >
> > Also for v2, i tested Your patchset a little bit(write here to not
> > spread over all mails):
> > 1) seqpacket test in vsock_test.c fails(seems MSG_EOR flag issue)
>
> I will investigate.
>
> > 2) i can't do rmmod with the following config(after testing):
> > CONFIG_VSOCKETS=m
> > CONFIG_VIRTIO_VSOCKETS=m
> > CONFIG_VIRTIO_VSOCKETS_COMMON=m
> > CONFIG_VHOST=m
> > CONFIG_VHOST_VSOCK=m
> > Guest is shutdown, but rmmod fails.
> > 3) virtio_transport_init + virtio_transport_exit seems must be
> > under EXPORT_SYMBOL_GPL(), because both used in another module.
>
> Definitely, will fix.
>
> > 4) I tried to send 5kb(or 20kb not matter) piece of data, but
> > got
> > kernel panic both in guest and later in host.
> >
>
> Thanks for catching that. I can reproduce it intermittently, but only
> for seqpacket. Did you happen to see this for other socket types as
> well?
>
> Thanks
I got this for SOCK_DGRAM, i didnt test seqpacket or stream.
Thanks, Arseniy
>
> > Thank You
> > > > > > >
> > > > > > > @@ -819,13 +942,13 @@
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > > > > > int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > > > > > struct sockaddr_vm *addr)
> > > > > > > {
> > > > > > > - return -EOPNOTSUPP;
> > > > > > > + return vsock_bind_stream(vsk, addr);
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > > > > >
> > > > > > > bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > > > > > {
> > > > > > > - return false;
> > > > > > > + return true;
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > > > > >
> > > > > > > @@ -861,7 +984,16 @@
> > > > > > > virtio_transport_dgram_enqueue(struct
> > > > > > > vsock_sock *vsk,
> > > > > > > struct msghdr *msg,
> > > > > > > size_t dgram_len)
> > > > > > > {
> > > > > > > - return -EOPNOTSUPP;
> > > > > > > + struct virtio_vsock_pkt_info info = {
> > > > > > > + .op = VIRTIO_VSOCK_OP_RW,
> > > > > > > + .msg = msg,
> > > > > > > + .pkt_len = dgram_len,
> > > > > > > + .vsk = vsk,
> > > > > > > + .remote_cid = remote_addr->svm_cid,
> > > > > > > + .remote_port = remote_addr->svm_port,
> > > > > > > + };
> > > > > > > +
> > > > > > > + return virtio_transport_send_pkt_info(vsk, &info);
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > > > > >
> > > > > > > @@ -1165,6 +1297,12 @@
> > > > > > > virtio_transport_recv_connected(struct
> > > > > > > sock *sk,
> > > > > > > struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > > > > > int err = 0;
> > > > > > >
> > > > > > > + if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > > + virtio_transport_recv_enqueue(vsk, skb);
> > > > > > > + sk->sk_data_ready(sk);
> > > > > > > + return err;
> > > > > > > + }
> > > > > > > +
> > > > > > > switch (le16_to_cpu(hdr->op)) {
> > > > > > > case VIRTIO_VSOCK_OP_RW:
> > > > > > > virtio_transport_recv_enqueue(vsk, skb);
> > > > > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct
> > > > > > > sock *sk, struct sk_buff *skb,
> > > > > > > static bool virtio_transport_valid_type(u16 type)
> > > > > > > {
> > > > > > > return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > > > > - (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > > > > + (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > > > > + (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > > > > > }
> > > > > > >
> > > > > > > /* We are under the virtio-vsock's vsock->rx_lock or
> > > > > > > vhost-
> > > > > > > vsock's vq->mutex
> > > > > > > @@ -1384,6 +1523,11 @@ void
> > > > > > > virtio_transport_recv_pkt(struct
> > > > > > > virtio_transport *t,
> > > > > > > goto free_pkt;
> > > > > > > }
> > > > > > >
> > > > > > > + if (sk->sk_type == SOCK_DGRAM) {
> > > > > > > + virtio_transport_recv_connected(sk, skb);
> > > > > > > + goto out;
> > > > > > > + }
> > > > > > > +
> > > > > > > space_available = virtio_transport_space_update(sk,
> > > > > > > skb);
> > > > > > >
> > > > > > > /* Update CID in case it has changed after a transport
> > > > > > > reset event */
> > > > > > > @@ -1415,6 +1559,7 @@ void
> > > > > > > virtio_transport_recv_pkt(struct
> > > > > > > virtio_transport *t,
> > > > > > > break;
> > > > > > > }
> > > > > > >
> > > > > > > +out:
> > > > > > > release_sock(sk);
> > > > > > >
> > > > > > > /* Release refcnt obtained when we fetched this socket
> > > > > > > out of the
> > > > > > > --
> > > > > > > 2.35.1
> > > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------
> > > > > > ----
> > > > > > --------
> > > > > > To unsubscribe, e-mail:
> > > > > > [email protected]
> > > > > > For additional commands, e-mail:
> > > > > > [email protected]
> > > > > >
> > >
> > > ---------------------------------------------------------------
> > > ------
> > > To unsubscribe, e-mail:
> > > [email protected]
> > > For additional commands, e-mail:
> > > [email protected]
> > >
On Thu, Aug 18, 2022 at 12:28:48PM +0800, Jason Wang wrote:
>
>在 2022/8/17 14:54, Michael S. Tsirkin 写道:
>>On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>>>Hey everybody,
>>>
>>>This series introduces datagrams, packet scheduling, and sk_buff usage
>>>to virtio vsock.
>>>
>>>The usage of struct sk_buff benefits users by a) preparing vsock to use
>>>other related systems that require sk_buff, such as sockmap and qdisc,
>>>b) supporting basic congestion control via sock_alloc_send_skb, and c)
>>>reducing copying when delivering packets to TAP.
>>>
>>>The socket layer no longer forces errors to be -ENOMEM, as typically
>>>userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
>>>messages are being sent with option MSG_DONTWAIT.
>>>
>>>The datagram work is based off previous patches by Jiang Wang[1].
>>>
>>>The introduction of datagrams creates a transport layer fairness issue
>>>where datagrams may freely starve streams of queue access. This happens
>>>because, unlike streams, datagrams lack the transactions necessary for
>>>calculating credits and throttling.
>>>
>>>Previous proposals introduce changes to the spec to add an additional
>>>virtqueue pair for datagrams[1]. Although this solution works, using
>>>Linux's qdisc for packet scheduling leverages already existing systems,
>>>avoids the need to change the virtio specification, and gives additional
>>>capabilities. The usage of SFQ or fq_codel, for example, may solve the
>>>transport layer starvation problem. It is easy to imagine other use
>>>cases as well. For example, services of varying importance may be
>>>assigned different priorities, and qdisc will apply appropriate
>>>priority-based scheduling. By default, the system default pfifo qdisc is
>>>used. The qdisc may be bypassed and legacy queuing is resumed by simply
>>>setting the virtio-vsock%d network device to state DOWN. This technique
>>>still allows vsock to work with zero-configuration.
>>The basic question to answer then is this: with a net device qdisc
>>etc in the picture, how is this different from virtio net then?
>>Why do you still want to use vsock?
>
>
>Or maybe it's time to revisit an old idea[1] to unify at least the
>driver part (e.g using virtio-net driver for vsock then we can all
>features that vsock is lacking now)?
Sorry for coming late to the discussion!
This would be great, though, last time I had looked at it, I had found
it quite complicated. The main problem is trying to avoid all the
net-specific stuff (MTU, ethernet header, HW offloading, etc.).
Maybe we could start thinking about this idea by adding a new transport
to vsock (e.g. virtio-net-vsock) completely separate from what we have
now.
Thanks,
Stefano
On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> In order to support usage of qdisc on vsock traffic, this commit
> introduces a struct net_device to vhost and virtio vsock.
>
> Two new devices are created, vhost-vsock for vhost and virtio-vsock
> for virtio. The devices are attached to the respective transports.
>
> To bypass the usage of the device, the user may "down" the associated
> network interface using common tools. For example, "ip link set dev
> virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> simply using the FIFO logic of the prior implementation.
>
> For both hosts and guests, there is one device for all G2H vsock sockets
> and one device for all H2G vsock sockets. This makes sense for guests
> because the driver only supports a single vsock channel (one pair of
> TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> seem ideal for some workloads. However, it is possible to use a
> multi-queue qdisc, where a given queue is responsible for a range of
> sockets. This seems to be a better solution than having one device per
> socket, which may yield a very large number of devices and qdiscs, all
> of which are dynamically being created and destroyed. Because of this
> dynamism, it would also require a complex policy management daemon, as
> devices would constantly be spun up and down as sockets were created and
> destroyed. To avoid this, one device and qdisc also applies to all H2G
> sockets.
>
> Signed-off-by: Bobby Eshleman <[email protected]>
I've been thinking about this generally. vsock currently
assumes reliability, but with qdisc can't we get
packet drops e.g. depending on the queueing?
What prevents user from configuring such a discipline?
One thing people like about vsock is that it's very hard
to break H2G communication even with misconfigured
networking.
--
MST
Hi Bobby,
If you are attending Linux Foundation conferences in Dublin, Ireland
next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
Stefano Garzarella and others to discuss this patch series.
Using netdev and sk_buff is a big change to vsock. Discussing your
requirements and the future direction of vsock in person could help.
If you won't be in Dublin, don't worry. You can schedule a video call if
you feel it would be helpful to discuss these topics.
Stefan
On Tue, Sep 06, 2022 at 06:58:32AM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > In order to support usage of qdisc on vsock traffic, this commit
> > introduces a struct net_device to vhost and virtio vsock.
> >
> > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > for virtio. The devices are attached to the respective transports.
> >
> > To bypass the usage of the device, the user may "down" the associated
> > network interface using common tools. For example, "ip link set dev
> > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > simply using the FIFO logic of the prior implementation.
> >
> > For both hosts and guests, there is one device for all G2H vsock sockets
> > and one device for all H2G vsock sockets. This makes sense for guests
> > because the driver only supports a single vsock channel (one pair of
> > TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> > seem ideal for some workloads. However, it is possible to use a
> > multi-queue qdisc, where a given queue is responsible for a range of
> > sockets. This seems to be a better solution than having one device per
> > socket, which may yield a very large number of devices and qdiscs, all
> > of which are dynamically being created and destroyed. Because of this
> > dynamism, it would also require a complex policy management daemon, as
> > devices would constantly be spun up and down as sockets were created and
> > destroyed. To avoid this, one device and qdisc also applies to all H2G
> > sockets.
> >
> > Signed-off-by: Bobby Eshleman <[email protected]>
>
>
> I've been thinking about this generally. vsock currently
> assumes reliability, but with qdisc can't we get
> packet drops e.g. depending on the queueing?
>
> What prevents user from configuring such a discipline?
> One thing people like about vsock is that it's very hard
> to break H2G communication even with misconfigured
> networking.
>
If qdisc decides to discard a packet, it returns NET_XMIT_CN via
dev_queue_xmit(). This v1 allows this quietly, but v2 could return
an error to the user (-ENOMEM or maybe -ENOBUFS) when this happens, similar
to when vsock is unable to enqueue a packet currently.
The user could still, for example, choose the noop qdisc. Assuming the
v2 change mentioned above, their sendmsg() calls will return errors.
Similar to how if they choose the wrong CID they will get an error when
connecting a socket.
Best,
Bobby
On Tue, Sep 06, 2022 at 09:26:33AM -0400, Stefan Hajnoczi wrote:
> Hi Bobby,
> If you are attending Linux Foundation conferences in Dublin, Ireland
> next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
> Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
> Stefano Garzarella and others to discuss this patch series.
>
> Using netdev and sk_buff is a big change to vsock. Discussing your
> requirements and the future direction of vsock in person could help.
>
> If you won't be in Dublin, don't worry. You can schedule a video call if
> you feel it would be helpful to discuss these topics.
>
> Stefan
Hey Stefan,
That sounds like a great idea! I was unable to make the Dublin trip work
so I think a video call would be best, of course if okay with everyone.
Thanks,
Bobby
On Thu, Aug 18, 2022 at 02:39:32PM +0000, Bobby Eshleman wrote:
>On Tue, Sep 06, 2022 at 09:26:33AM -0400, Stefan Hajnoczi wrote:
>> Hi Bobby,
>> If you are attending Linux Foundation conferences in Dublin, Ireland
>> next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
>> Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
>> Stefano Garzarella and others to discuss this patch series.
>>
>> Using netdev and sk_buff is a big change to vsock. Discussing your
>> requirements and the future direction of vsock in person could help.
>>
>> If you won't be in Dublin, don't worry. You can schedule a video call if
>> you feel it would be helpful to discuss these topics.
>>
>> Stefan
>
>Hey Stefan,
>
>That sounds like a great idea!
Yep, I agree!
>I was unable to make the Dublin trip work
>so I think a video call would be best, of course if okay with everyone.
It will work for me, but I'll be a bit busy in the next 2 weeks:
From Sep 12 to Sep 14 I'll be at KVM Forum, so it may be difficult to
arrange, but we can try.
Sep 15 I'm not available.
Sep 16 I'm traveling, but early in my morning, so I should be available.
Form Sep 10 to Sep 23 I'll be mostly off, but I can try to find some
slots if needed.
From Sep 26 I'm back and fully available.
Let's see if others are available and try to find a slot :-)
Thanks,
Stefano
On Thu, Aug 18, 2022 at 02:39:32PM +0000, Bobby Eshleman wrote:
>On Tue, Sep 06, 2022 at 09:26:33AM -0400, Stefan Hajnoczi wrote:
>> Hi Bobby,
>> If you are attending Linux Foundation conferences in Dublin, Ireland
>> next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
>> Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
>> Stefano Garzarella and others to discuss this patch series.
>>
>> Using netdev and sk_buff is a big change to vsock. Discussing your
>> requirements and the future direction of vsock in person could help.
>>
>> If you won't be in Dublin, don't worry. You can schedule a video call if
>> you feel it would be helpful to discuss these topics.
>>
>> Stefan
>
>Hey Stefan,
>
>That sounds like a great idea! I was unable to make the Dublin trip work
>so I think a video call would be best, of course if okay with everyone.
Looking better at the KVM forum sched, I found 1h slot for Sep 15 at
16:30 UTC.
Could this work for you?
It would be nice to also have HyperV and VMCI people in the call and
anyone else who is interested of course.
@Dexuan @Bryan @Vishnu can you attend?
@MST @Jason @Stefan if you can be there that would be great, we could
connect together from Dublin.
Thanks,
Stefano
Hey Stefano, thanks for sending this out.
On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
>
> Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> UTC.
>
> Could this work for you?
Unfortunately, I can't make this time slot.
My schedule also opens up a lot the week of the 26th, especially between
16:00 and 19:00 UTC, as well as after 22:00 UTC.
Best,
Bobby
On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <[email protected]> wrote:
>
> Hey Stefano, thanks for sending this out.
>
> On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> >
> > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > UTC.
> >
> > Could this work for you?
>
> Unfortunately, I can't make this time slot.
No problem at all!
>
> My schedule also opens up a lot the week of the 26th, especially between
> 16:00 and 19:00 UTC, as well as after 22:00 UTC.
Great, that week works for me too.
What about Sep 27 @ 16:00 UTC?
Thanks,
Stefano
On Mon, Sep 12, 2022 at 08:12:58PM +0200, Stefano Garzarella wrote:
> On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <[email protected]> wrote:
> >
> > Hey Stefano, thanks for sending this out.
> >
> > On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> > >
> > > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > > UTC.
> > >
> > > Could this work for you?
> >
> > Unfortunately, I can't make this time slot.
>
> No problem at all!
>
> >
> > My schedule also opens up a lot the week of the 26th, especially between
> > 16:00 and 19:00 UTC, as well as after 22:00 UTC.
>
> Great, that week works for me too.
> What about Sep 27 @ 16:00 UTC?
>
That time works for me!
Thanks,
Bobby
On Mon, Sep 12, 2022 at 8:28 PM Bobby Eshleman <[email protected]> wrote:
>
> On Mon, Sep 12, 2022 at 08:12:58PM +0200, Stefano Garzarella wrote:
> > On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <[email protected]> wrote:
> > >
> > > Hey Stefano, thanks for sending this out.
> > >
> > > On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> > > >
> > > > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > > > UTC.
> > > >
> > > > Could this work for you?
> > >
> > > Unfortunately, I can't make this time slot.
> >
> > No problem at all!
> >
> > >
> > > My schedule also opens up a lot the week of the 26th, especially between
> > > 16:00 and 19:00 UTC, as well as after 22:00 UTC.
> >
> > Great, that week works for me too.
> > What about Sep 27 @ 16:00 UTC?
> >
>
> That time works for me!
Great! I sent you an invitation.
For others that want to join the discussion, we will meet Sep 27 @
16:00 UTC at this room: https://meet.google.com/fxi-vuzr-jjb
Thanks,
Stefano
On Fri, Sep 16, 2022 at 05:51:22AM +0200, Stefano Garzarella wrote:
> On Mon, Sep 12, 2022 at 8:28 PM Bobby Eshleman <[email protected]> wrote:
> >
> > On Mon, Sep 12, 2022 at 08:12:58PM +0200, Stefano Garzarella wrote:
> > > On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <[email protected]> wrote:
> > > >
> > > > Hey Stefano, thanks for sending this out.
> > > >
> > > > On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> > > > >
> > > > > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > > > > UTC.
> > > > >
> > > > > Could this work for you?
> > > >
> > > > Unfortunately, I can't make this time slot.
> > >
> > > No problem at all!
> > >
> > > >
> > > > My schedule also opens up a lot the week of the 26th, especially between
> > > > 16:00 and 19:00 UTC, as well as after 22:00 UTC.
> > >
> > > Great, that week works for me too.
> > > What about Sep 27 @ 16:00 UTC?
> > >
> >
> > That time works for me!
>
> Great! I sent you an invitation.
>
Awesome, see you then!
Thanks,
Bobby
On Mon, Aug 15, 2022 at 10:56:07AM -0700, Bobby Eshleman wrote:
>This commit adds a feature bit for virtio vsock to support datagrams.
>
>Signed-off-by: Jiang Wang <[email protected]>
>Signed-off-by: Bobby Eshleman <[email protected]>
>---
> drivers/vhost/vsock.c | 3 ++-
> include/uapi/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport.c | 8 ++++++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index b20ddec2664b..a5d1bdb786fe 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -32,7 +32,8 @@
> enum {
> VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
>- (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
>+ (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
>+ (1ULL << VIRTIO_VSOCK_F_DGRAM)
> };
>
> enum {
>diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>index 64738838bee5..857df3a3a70d 100644
>--- a/include/uapi/linux/virtio_vsock.h
>+++ b/include/uapi/linux/virtio_vsock.h
>@@ -40,6 +40,7 @@
>
> /* The feature bitmap for virtio vsock */
> #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
>+#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
We already allocated bit 2 for F_NO_IMPLIED_STREAM , so we should use 3:
https://github.com/oasis-tcs/virtio-spec/blob/26ed30ccb049fd51d6e20aad3de2807d678edb3a/virtio-vsock.tex#L22
(I'll send patches to implement F_STREAM and F_NO_IMPLIED_STREAM
negotiation soon).
As long as it's RFC it's fine to introduce F_DGRAM, but we should first
change virtio-spec before merging this series.
About the patch, we should only negotiate the new feature when we really
have DGRAM support. So, it's better to move this patch after adding
support for datagram.
Thanks,
Stefano
>
> struct virtio_vsock_config {
> __le64 guest_cid;
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index c6212eb38d3c..073314312683 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /*
>forward declaration */
> struct virtio_vsock {
> struct virtio_device *vdev;
> struct virtqueue *vqs[VSOCK_VQ_MAX];
>+ bool has_dgram;
>
> /* Virtqueue processing is deferred to a workqueue */
> struct work_struct tx_work;
>@@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> }
>
> vsock->vdev = vdev;
>-
> vsock->rx_buf_nr = 0;
> vsock->rx_buf_max_nr = 0;
> atomic_set(&vsock->queued_replies, 0);
>@@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
> vsock->seqpacket_allow = true;
>
>+ if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
>+ vsock->has_dgram = true;
>+
> vdev->priv = vsock;
>
> ret = virtio_vsock_vqs_init(vsock);
>@@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
> };
>
> static unsigned int features[] = {
>- VIRTIO_VSOCK_F_SEQPACKET
>+ VIRTIO_VSOCK_F_SEQPACKET,
>+ VIRTIO_VSOCK_F_DGRAM
> };
>
> static struct virtio_driver virtio_vsock_driver = {
>--
>2.35.1
>
Hi,
On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>Hey everybody,
>
>This series introduces datagrams, packet scheduling, and sk_buff usage
>to virtio vsock.
Just a reminder for those who are interested, tomorrow Sep 27 @ 16:00
UTC we will discuss more about the next steps for this series in this
room: https://meet.google.com/fxi-vuzr-jjb
(I'll try to record it and take notes that we will share)
Bobby, thank you so much for working on this! It would be great to solve
the fairness issue and support datagram!
I took a look at the series, left some comments in the individual
patches, and add some advice here that we could pick up tomorrow:
- it would be nice to run benchmarks (e.g., iperf-vsock, uperf, etc.) to
see how much the changes cost (e.g. sk_buff use)
- we should take care also of other transports (i.e. vmci, hyperv), the
uAPI should be as close as possible regardless of the transport
About the use of netdev, it seems the most controversial point and I
understand Jakub and Michael's concerns. Tomorrow would be great if you
can update us if you have found any way to avoid it, just reusing a
packet scheduler somehow.
It would be great if we could make it available for all transports (I'm
not asking you to implement it for all, but to have a generic api that
others can use).
But we can talk about that tomorrow!
Thanks,
Stefano
>
>The usage of struct sk_buff benefits users by a) preparing vsock to use
>other related systems that require sk_buff, such as sockmap and qdisc,
>b) supporting basic congestion control via sock_alloc_send_skb, and c)
>reducing copying when delivering packets to TAP.
>
>The socket layer no longer forces errors to be -ENOMEM, as typically
>userspace expects -EAGAIN when the sk_sndbuf threshold )s reached and
>messages are being sent with option MSG_DONTWAIT.
>
>The datagram work is based off previous patches by Jiang Wang[1].
>
>The introduction of datagrams creates a transport layer fairness issue
>where datagrams may freely starve streams of queue access. This happens
>because, unlike streams, datagrams lack the transactions necessary for
>calculating credits and throttling.
>
>Previous proposals introduce changes to the spec to add an additional
>virtqueue pair for datagrams[1]. Although this solution works, using
>Linux's qdisc for packet scheduling leverages already existing systems,
>avoids the need to change the virtio specification, and gives additional
>capabilities. The usage of SFQ or fq_codel, for example, may solve the
>transport layer starvation problem. It is easy to imagine other use
>cases as well. For example, services of varying importance may be
>assigned different priorities, and qdisc will apply appropriate
>priority-based scheduling. By default, the system default pfifo qdisc is
>used. The qdisc may be bypassed and legacy queuing is resumed by simply
>setting the virtio-vsock%d network device to state DOWN. This technique
>still allows vsock to work with zero-configuration.
>
>In summary, this series introduces these major changes to vsock:
>
>- virtio vsock supports datagrams
>- virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
> - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
> which applies the throttling threshold sk_sndbuf.
>- The vsock socket layer supports returning errors other than -ENOMEM.
> - This is used to return -EAGAIN when the sk_sndbuf threshold is
> reached.
>- virtio vsock uses a net_device, through which qdisc may be used.
> - qdisc allows scheduling policies to be applied to vsock flows.
> - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
> it may avoid datagrams from flooding out stream flows. The benefit
> to this is that additional virtqueues are not needed for datagrams.
> - The net_device and qdisc is bypassed by simply setting the
> net_device state to DOWN.
>
>[1]: https://lore.kernel.org/all/[email protected]/
>
>Bobby Eshleman (5):
> vsock: replace virtio_vsock_pkt with sk_buff
> vsock: return errors other than -ENOMEM to socket
> vsock: add netdev to vhost/virtio vsock
> virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
> virtio/vsock: add support for dgram
>
>Jiang Wang (1):
> vsock_test: add tests for vsock dgram
>
> drivers/vhost/vsock.c | 238 ++++----
> include/linux/virtio_vsock.h | 73 ++-
> include/net/af_vsock.h | 2 +
> include/uapi/linux/virtio_vsock.h | 2 +
> net/vmw_vsock/af_vsock.c | 30 +-
> net/vmw_vsock/hyperv_transport.c | 2 +-
> net/vmw_vsock/virtio_transport.c | 237 +++++---
> net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
> net/vmw_vsock/vmci_transport.c | 9 +-
> net/vmw_vsock/vsock_loopback.c | 51 +-
> tools/testing/vsock/util.c | 105 ++++
> tools/testing/vsock/util.h | 4 +
> tools/testing/vsock/vsock_test.c | 195 ++++++
> 13 files changed, 1176 insertions(+), 543 deletions(-)
>
>--
>2.35.1
>
On Mon, Sep 26, 2022 at 03:42:19PM +0200, Stefano Garzarella wrote:
> Hi,
>
> On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> > Hey everybody,
> >
> > This series introduces datagrams, packet scheduling, and sk_buff usage
> > to virtio vsock.
>
> Just a reminder for those who are interested, tomorrow Sep 27 @ 16:00 UTC we
> will discuss more about the next steps for this series in this room:
> https://meet.google.com/fxi-vuzr-jjb
> (I'll try to record it and take notes that we will share)
>
> Bobby, thank you so much for working on this! It would be great to solve the
> fairness issue and support datagram!
>
I appreciate that, thanks!
> I took a look at the series, left some comments in the individual patches,
> and add some advice here that we could pick up tomorrow:
> - it would be nice to run benchmarks (e.g., iperf-vsock, uperf, etc.) to
> see how much the changes cost (e.g. sk_buff use)
> - we should take care also of other transports (i.e. vmci, hyperv), the
> uAPI should be as close as possible regardless of the transport
>
Duly noted. I have some measurements with uperf, I'll put the data
together and send that out here.
Regarding the uAPI topic, I'll save that topic for our conversation
tomorrow as I think the netdev topic will weigh on it.
> About the use of netdev, it seems the most controversial point and I
> understand Jakub and Michael's concerns. Tomorrow would be great if you can
> update us if you have found any way to avoid it, just reusing a packet
> scheduler somehow.
> It would be great if we could make it available for all transports (I'm not
> asking you to implement it for all, but to have a generic api that others
> can use).
>
> But we can talk about that tomorrow!
Sounds good, talk to you then!
Best,
Bobby
On Mon, Sep 26, 2022 at 03:17:51PM +0200, Stefano Garzarella wrote:
> On Mon, Aug 15, 2022 at 10:56:07AM -0700, Bobby Eshleman wrote:
> > This commit adds a feature bit for virtio vsock to support datagrams.
> >
> > Signed-off-by: Jiang Wang <[email protected]>
> > Signed-off-by: Bobby Eshleman <[email protected]>
> > ---
> > drivers/vhost/vsock.c | 3 ++-
> > include/uapi/linux/virtio_vsock.h | 1 +
> > net/vmw_vsock/virtio_transport.c | 8 ++++++--
> > 3 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index b20ddec2664b..a5d1bdb786fe 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -32,7 +32,8 @@
> > enum {
> > VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> > (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> > - (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> > + (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> > + (1ULL << VIRTIO_VSOCK_F_DGRAM)
> > };
> >
> > enum {
> > diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> > index 64738838bee5..857df3a3a70d 100644
> > --- a/include/uapi/linux/virtio_vsock.h
> > +++ b/include/uapi/linux/virtio_vsock.h
> > @@ -40,6 +40,7 @@
> >
> > /* The feature bitmap for virtio vsock */
> > #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
> > +#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
>
> We already allocated bit 2 for F_NO_IMPLIED_STREAM , so we should use 3:
> https://github.com/oasis-tcs/virtio-spec/blob/26ed30ccb049fd51d6e20aad3de2807d678edb3a/virtio-vsock.tex#L22
> (I'll send patches to implement F_STREAM and F_NO_IMPLIED_STREAM negotiation
> soon).
>
> As long as it's RFC it's fine to introduce F_DGRAM, but we should first
> change virtio-spec before merging this series.
>
> About the patch, we should only negotiate the new feature when we really
> have DGRAM support. So, it's better to move this patch after adding support
> for datagram.
Roger that, I'll reorder that for v2 and also clarify the series by
prefixing it with RFC.
Before removing "RFC" from the series, I'll be sure to send out
virtio-spec patches first.
Thanks,
Bobby
>
> Thanks,
> Stefano
>
> >
> > struct virtio_vsock_config {
> > __le64 guest_cid;
> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > index c6212eb38d3c..073314312683 100644
> > --- a/net/vmw_vsock/virtio_transport.c
> > +++ b/net/vmw_vsock/virtio_transport.c
> > @@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /*
> > forward declaration */
> > struct virtio_vsock {
> > struct virtio_device *vdev;
> > struct virtqueue *vqs[VSOCK_VQ_MAX];
> > + bool has_dgram;
> >
> > /* Virtqueue processing is deferred to a workqueue */
> > struct work_struct tx_work;
> > @@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> > }
> >
> > vsock->vdev = vdev;
> > -
> > vsock->rx_buf_nr = 0;
> > vsock->rx_buf_max_nr = 0;
> > atomic_set(&vsock->queued_replies, 0);
> > @@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> > if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
> > vsock->seqpacket_allow = true;
> >
> > + if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
> > + vsock->has_dgram = true;
> > +
> > vdev->priv = vsock;
> >
> > ret = virtio_vsock_vqs_init(vsock);
> > @@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
> > };
> >
> > static unsigned int features[] = {
> > - VIRTIO_VSOCK_F_SEQPACKET
> > + VIRTIO_VSOCK_F_SEQPACKET,
> > + VIRTIO_VSOCK_F_DGRAM
> > };
> >
> > static struct virtio_driver virtio_vsock_driver = {
> > --
> > 2.35.1
> >
>
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
On Mon, Sep 26, 2022 at 03:42:19PM +0200, Stefano Garzarella wrote:
>Hi,
>
>On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>>Hey everybody,
>>
>>This series introduces datagrams, packet scheduling, and sk_buff usage
>>to virtio vsock.
>
>Just a reminder for those who are interested, tomorrow Sep 27 @ 16:00
>UTC we will discuss more about the next steps for this series in this
>room: https://meet.google.com/fxi-vuzr-jjb
>(I'll try to record it and take notes that we will share)
>
Thank you all for participating in the call!
I'm attaching video/audio recording and notes (feel free to update it).
Notes:
https://docs.google.com/document/d/14UHH0tEaBKfElLZjNkyKUs_HnOgHhZZBqIS86VEIqR0/edit?usp=sharing
Video recording:
https://drive.google.com/file/d/1vUvTc_aiE1mB30tLPeJjANnb915-CIKa/view?usp=sharing
Thanks,
Stefano