# Motivition
Vsock is a lightweight and widely used data exchange mechanism between host
and guest. Kata Containers, a secure container runtime, leverages the
capability to exchange control data between the shim and the kata-agent.
The Linux kernel only supports one vsock device for virtio-vsock transport,
resulting in the following limitations:
* Poor performance isolation: All vsock connections share the same
virtqueue.
* Cannot enable more than one backend: Virtio-vsock, vhost-vsock, and
vhost-user-vsock cannot be enabled simultaneously on the transport.
We’d like to transfer networking data, such as TSI (Transparent Socket
Impersonation), over vsock via the vhost-user protocol to reduce overhead.
However, by default, the vsock device is occupied by the kata-agent.
# Usages
Principle: **Supporting virtio-vsock multi-devices while also being
compatible with existing ones.**
## Connection from Guest to Host
There are two valuable questions to take about:
1. How to be compatible with the existing usages?
2. How do we specify a virtio-vsock device?
### Question 1
Before we delve into question 1, I'd like to provide a piece of pseudocode
as an example of one of the existing use cases from the guest's
perspective.
Assuming there is one virtio-vsock device with CID 4. One of existing
usages to connect to host is shown as following.
```
fd = socket(AF_VSOCK);
connect(fd, 2, 1234);
n = write(fd, buffer);
```
The result is that a connection is established from the guest (4, ?) to the
host (2, 1234), where "?" denotes a random port.
In the context of multi-devices, there are more than two devices. If the
users don’t specify one CID explicitly, the kernel becomes confused about
which device to use. The new implementation should be compatible with the
old one.
We expanded the virtio-vsock specification to address this issue. The
specification now includes a new field called "order".
```
struct virtio_vsock_config {
__le64 guest_cid;
__le64 order;
} _attribute_((packed));
```
In the phase of virtio-vsock driver probing, the guest kernel reads from
VMM to get the order of each device. **We stipulate that the device with the
smallest order is regarded as the default device**(this mechanism functions
as a 'default gateway' in networking).
Assuming there are three virtio-vsock devices: device1 (CID=3), device2
(CID=4), and device3 (CID=5). The arrangement of the list is as follows
from the perspective of the guest kernel:
```
virtio_vsock_list =
virtio_vsock { cid: 4, order: 0 } -> virtio_vsock { cid: 3, order: 1 } -> virtio_vsock { cid: 5, order: 10 }
```
At this time, the guest kernel realizes that the device2 (CID=4) is the
default device. Execute the same code as before.
```
fd = socket(AF_VSOCK);
connect(fd, 2, 1234);
n = write(fd, buffer);
```
A connection will be established from the guest (4, ?) to the host (2, 1234).
### Question 2
Now, the user wants to specify a device instead of the default one. An
explicit binding operation is required to be performed.
Use the device (CID=3), where “-1” represents any port, the kernel will
search an available port automatically.
```
fd = socket(AF_VSOCK);
bind(fd, 3, -1);
connect(fd, 2, 1234);
n = write(fd, buffer);
```
Use the device (CID=4).
```
fd = socket(AF_VSOCK);
bind(fd, 4, -1);
connect(fd, 2, 1234);
n = write(fd, buffer);
```
## Connection from Host to Guest
Connection from host to guest is quite similar to the existing usages. The
device’s CID is specified by the bind operation.
Listen at the device (CID=3)’s port 10000.
```
fd = socket(AF_VSOCK);
bind(fd, 3, 10000);
listen(fd);
new_fd = accept(fd, &host_cid, &host_port);
n = write(fd, buffer);
```
Listen at the device (CID=4)’s port 10000.
```
fd = socket(AF_VSOCK);
bind(fd, 4, 10000);
listen(fd);
new_fd = accept(fd, &host_cid, &host_port);
n = write(fd, buffer);
```
# Use Cases
We've completed a POC with Kata Containers, Ztunnel, which is a
purpose-built per-node proxy for Istio ambient mesh, and TSI. Please refer
to the following link for more details.
Link: https://bit.ly/4bdPJbU
Xuewei Niu (5):
vsock/virtio: Extend virtio-vsock spec with an "order" field
vsock/virtio: Add support for multi-devices
vsock/virtio: can_msgzerocopy adapts to multi-devices
vsock: seqpacket_allow adapts to multi-devices
vsock: Add an ioctl request to get all CIDs
include/linux/virtio_vsock.h | 2 +-
include/net/af_vsock.h | 25 ++-
include/uapi/linux/virtio_vsock.h | 1 +
include/uapi/linux/vm_sockets.h | 14 ++
net/vmw_vsock/af_vsock.c | 116 +++++++++--
net/vmw_vsock/virtio_transport.c | 255 ++++++++++++++++++------
net/vmw_vsock/virtio_transport_common.c | 16 +-
net/vmw_vsock/vsock_loopback.c | 4 +-
8 files changed, 352 insertions(+), 81 deletions(-)
--
2.34.1
Adds a new argument, named "cid", to let them know which `virtio_vsock` to
be selected.
Signed-off-by: Xuewei Niu <[email protected]>
---
include/linux/virtio_vsock.h | 2 +-
net/vmw_vsock/virtio_transport.c | 5 ++---
net/vmw_vsock/virtio_transport_common.c | 6 +++---
3 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..21bfd5e0c2e7 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -168,7 +168,7 @@ struct virtio_transport {
* extra checks and can perform zerocopy transmission by
* default.
*/
- bool (*can_msgzerocopy)(int bufs_num);
+ bool (*can_msgzerocopy)(u32 cid, int bufs_num);
};
ssize_t
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 93d25aeafb83..998b22e5ce36 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -521,14 +521,13 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}
-static bool virtio_transport_can_msgzerocopy(int bufs_num)
+static bool virtio_transport_can_msgzerocopy(u32 cid, int bufs_num)
{
struct virtio_vsock *vsock;
bool res = false;
rcu_read_lock();
-
- vsock = rcu_dereference(the_virtio_vsock);
+ vsock = virtio_transport_get_virtio_vsock(cid);
if (vsock) {
struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index bed75a41419e..e7315d7b9af1 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -39,7 +39,7 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
struct virtio_vsock_pkt_info *info,
- size_t pkt_len)
+ size_t pkt_len, unsigned int cid)
{
struct iov_iter *iov_iter;
@@ -62,7 +62,7 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
int pages_to_send = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
/* +1 is for packet header. */
- return t_ops->can_msgzerocopy(pages_to_send + 1);
+ return t_ops->can_msgzerocopy(cid, pages_to_send + 1);
}
return true;
@@ -375,7 +375,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
info->msg->msg_flags &= ~MSG_ZEROCOPY;
if (info->msg->msg_flags & MSG_ZEROCOPY)
- can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+ can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len, src_cid);
if (can_zcopy)
max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
--
2.34.1
On Fri, May 17, 2024 at 10:46:05PM GMT, Xuewei Niu wrote:
>Adds a new argument, named "cid", to let them know which `virtio_vsock` to
>be selected.
>
>Signed-off-by: Xuewei Niu <[email protected]>
>---
> include/linux/virtio_vsock.h | 2 +-
> net/vmw_vsock/virtio_transport.c | 5 ++---
> net/vmw_vsock/virtio_transport_common.c | 6 +++---
> 3 files changed, 6 insertions(+), 7 deletions(-)
Every commit in linux must be working to support bisection. So these
changes should be made before adding multi-device support.
>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index c82089dee0c8..21bfd5e0c2e7 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -168,7 +168,7 @@ struct virtio_transport {
> * extra checks and can perform zerocopy transmission by
> * default.
> */
>- bool (*can_msgzerocopy)(int bufs_num);
>+ bool (*can_msgzerocopy)(u32 cid, int bufs_num);
> };
>
> ssize_t
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index 93d25aeafb83..998b22e5ce36 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -521,14 +521,13 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> }
>
>-static bool virtio_transport_can_msgzerocopy(int bufs_num)
>+static bool virtio_transport_can_msgzerocopy(u32 cid, int bufs_num)
> {
> struct virtio_vsock *vsock;
> bool res = false;
>
> rcu_read_lock();
>-
>- vsock = rcu_dereference(the_virtio_vsock);
>+ vsock = virtio_transport_get_virtio_vsock(cid);
> if (vsock) {
> struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index bed75a41419e..e7315d7b9af1 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -39,7 +39,7 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>
> static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
> struct virtio_vsock_pkt_info *info,
>- size_t pkt_len)
>+ size_t pkt_len, unsigned int cid)
> {
> struct iov_iter *iov_iter;
>
>@@ -62,7 +62,7 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
> int pages_to_send = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
>
> /* +1 is for packet header. */
>- return t_ops->can_msgzerocopy(pages_to_send + 1);
>+ return t_ops->can_msgzerocopy(cid, pages_to_send + 1);
> }
>
> return true;
>@@ -375,7 +375,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> info->msg->msg_flags &= ~MSG_ZEROCOPY;
>
> if (info->msg->msg_flags & MSG_ZEROCOPY)
>- can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>+ can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len, src_cid);
>
> if (can_zcopy)
> max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>--
>2.34.1
>
Hi,
thanks for this RFC!
On Fri, May 17, 2024 at 10:46:02PM GMT, Xuewei Niu wrote:
># Motivition
>
>Vsock is a lightweight and widely used data exchange mechanism between host
>and guest. Kata Containers, a secure container runtime, leverages the
>capability to exchange control data between the shim and the kata-agent.
>
>The Linux kernel only supports one vsock device for virtio-vsock transport,
>resulting in the following limitations:
>
>* Poor performance isolation: All vsock connections share the same
>virtqueue.
This might be fixed if we implement multi-queue in virtio-vsock.
>* Cannot enable more than one backend: Virtio-vsock, vhost-vsock, and
>vhost-user-vsock cannot be enabled simultaneously on the transport.
>
>We’d like to transfer networking data, such as TSI (Transparent Socket
>Impersonation), over vsock via the vhost-user protocol to reduce overhead.
>However, by default, the vsock device is occupied by the kata-agent.
>
># Usages
>
>Principle: **Supporting virtio-vsock multi-devices while also being
>compatible with existing ones.**
>
>## Connection from Guest to Host
>
>There are two valuable questions to take about:
>
>1. How to be compatible with the existing usages?
>2. How do we specify a virtio-vsock device?
>
>### Question 1
>
>Before we delve into question 1, I'd like to provide a piece of pseudocode
>as an example of one of the existing use cases from the guest's
>perspective.
>
>Assuming there is one virtio-vsock device with CID 4. One of existing
>usages to connect to host is shown as following.
>
>```
>fd = socket(AF_VSOCK);
>connect(fd, 2, 1234);
>n = write(fd, buffer);
>```
>
>The result is that a connection is established from the guest (4, ?) to the
>host (2, 1234), where "?" denotes a random port.
>
>In the context of multi-devices, there are more than two devices. If the
>users don’t specify one CID explicitly, the kernel becomes confused about
>which device to use. The new implementation should be compatible with the
>old one.
>
>We expanded the virtio-vsock specification to address this issue. The
>specification now includes a new field called "order".
>
>```
>struct virtio_vsock_config {
> __le64 guest_cid;
> __le64 order;
>} _attribute_((packed));
>```
>
>In the phase of virtio-vsock driver probing, the guest kernel reads
>from
>VMM to get the order of each device. **We stipulate that the device with the
>smallest order is regarded as the default device**(this mechanism functions
>as a 'default gateway' in networking).
>
>Assuming there are three virtio-vsock devices: device1 (CID=3), device2
>(CID=4), and device3 (CID=5). The arrangement of the list is as follows
>from the perspective of the guest kernel:
>
>```
>virtio_vsock_list =
>virtio_vsock { cid: 4, order: 0 } -> virtio_vsock { cid: 3, order: 1 } -> virtio_vsock { cid: 5, order: 10 }
>```
>
>At this time, the guest kernel realizes that the device2 (CID=4) is the
>default device. Execute the same code as before.
>
>```
>fd = socket(AF_VSOCK);
>connect(fd, 2, 1234);
>n = write(fd, buffer);
>```
>
>A connection will be established from the guest (4, ?) to the host (2, 1234).
It seems that only the one with order 0 is used here though, so what is
the ordering for?
Wouldn't it suffice to simply indicate the default device (e.g., like
the default gateway for networking)?
>
>### Question 2
>
>Now, the user wants to specify a device instead of the default one. An
>explicit binding operation is required to be performed.
>
>Use the device (CID=3), where “-1” represents any port, the kernel will
We have a macro: VMADDR_PORT_ANY (which is -1)
>search an available port automatically.
>
>```
>fd = socket(AF_VSOCK);
>bind(fd, 3, -1);
>connect(fd, 2, 1234);)
>n = write(fd, buffer);
>```
>
>Use the device (CID=4).
>
>```
>fd = socket(AF_VSOCK);
>bind(fd, 4, -1);
>connect(fd, 2, 1234);
>n = write(fd, buffer);
>```
>
>## Connection from Host to Guest
>
>Connection from host to guest is quite similar to the existing usages. The
>device’s CID is specified by the bind operation.
>
>Listen at the device (CID=3)’s port 10000.
>
>```
>fd = socket(AF_VSOCK);
>bind(fd, 3, 10000);
>listen(fd);
>new_fd = accept(fd, &host_cid, &host_port);
>n = write(fd, buffer);
>```
>
>Listen at the device (CID=4)’s port 10000.
>
>```
>fd = socket(AF_VSOCK);
>bind(fd, 4, 10000);
>listen(fd);
>new_fd = accept(fd, &host_cid, &host_port);
>n = write(fd, buffer);
>```
>
># Use Cases
>
>We've completed a POC with Kata Containers, Ztunnel, which is a
>purpose-built per-node proxy for Istio ambient mesh, and TSI. Please refer
>to the following link for more details.
>
>Link: https://bit.ly/4bdPJbU
Thank you for this RFC, I left several comments in the patches, we still
have some work to do, but I think it is something we can support :-)
Here I summarize the things that I think we need to fix:
1. Avoid adding transport-specific things in af_vsock.c
We need to have a generic API to allow other transports to implement
the same functionality.
2. We need to add negotiation of a new feature in virtio/vhost transports
We need to enable or disable support depending on whether the
feature is negotiated, since guest and host may not support it.
3. Re-work the patch order for bisectability (more detail on patches 3/4)
4. Do we really need the order or just a default device?
5. Check if we can add some tests in tools/testing/vsock
6. When we agree on the RFC, we should discuss the spec changes in the
virtio ML before sending a non-RFC series on Linux
These are the main things, but I left other comments in the patches.
Thanks,
Stefano
On Fri, May 17, 2024 at 10:46:02PM +0800, Xuewei Niu wrote:
> include/linux/virtio_vsock.h | 2 +-
> include/net/af_vsock.h | 25 ++-
> include/uapi/linux/virtio_vsock.h | 1 +
> include/uapi/linux/vm_sockets.h | 14 ++
> net/vmw_vsock/af_vsock.c | 116 +++++++++--
> net/vmw_vsock/virtio_transport.c | 255 ++++++++++++++++++------
> net/vmw_vsock/virtio_transport_common.c | 16 +-
> net/vmw_vsock/vsock_loopback.c | 4 +-
> 8 files changed, 352 insertions(+), 81 deletions(-)
As any change to virtio device/driver interface, this has to
go through the virtio TC. Please subscribe at
[email protected] and then
contact the TC at [email protected]
You will likely eventually need to write a spec draft document, too.
--
MST
Hi, MST!
> > include/linux/virtio_vsock.h | 2 +-
> > include/net/af_vsock.h | 25 ++-
> > include/uapi/linux/virtio_vsock.h | 1 +
> > include/uapi/linux/vm_sockets.h | 14 ++
> > net/vmw_vsock/af_vsock.c | 116 +++++++++--
> > net/vmw_vsock/virtio_transport.c | 255 ++++++++++++++++++------
> > net/vmw_vsock/virtio_transport_common.c | 16 +-
> > net/vmw_vsock/vsock_loopback.c | 4 +-
> > 8 files changed, 352 insertions(+), 81 deletions(-)
>
> As any change to virtio device/driver interface, this has to
> go through the virtio TC. Please subscribe at
> [email protected] and then
> contact the TC at [email protected]
>
> You will likely eventually need to write a spec draft document, too.
Thanks for your reply. I've sent a series of RFC documents for the spec
changes to virtio TC [1].
[1]: https://lore.kernel.org/virtio-comment/[email protected]/
Thanks
Xuewei