Hey all!
This series introduces support for datagrams to virtio/vsock.
It is a spin-off (and smaller version) of this series from the summer:
https://lore.kernel.org/all/[email protected]/
Please note that this is an RFC and should not be merged until
associated changes are made to the virtio specification, which will
follow after discussion from this series.
This series first supports datagrams in a basic form for virtio, and
then optimizes the sendpath for all transports.
The result is a very fast datagram communication protocol that
outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
of multi-threaded workload samples.
For those that are curious, some summary data comparing UDP and VSOCK
DGRAM (N=5):
vCPUS: 16
virtio-net queues: 16
payload size: 4KB
Setup: bare metal + vm (non-nested)
UDP: 287.59 MB/s
VSOCK DGRAM: 509.2 MB/s
Some notes about the implementation...
This datagram implementation forces datagrams to self-throttle according
to the threshold set by sk_sndbuf. It behaves similar to the credits
used by streams in its effect on throughput and memory consumption, but
it is not influenced by the receiving socket as credits are.
The device drops packets silently. There is room for improvement by
building into the device and driver some intelligence around how to
reduce frequency of kicking the virtqueue when packet loss is high. I
think there is a good discussion to be had on this.
In this series I am also proposing that fairness be reexamined as an
issue separate from datagrams, which differs from my previous series
that coupled these issues. After further testing and reflection on the
design, I do not believe that these need to be coupled and I do not
believe this implementation introduces additional unfairness or
exacerbates pre-existing unfairness.
I attempted to characterize vsock fairness by using a pool of processes
to stress test the shared resources while measuring the performance of a
lone stream socket. Given unfair preference for datagrams, we would
assume that a lone stream socket would degrade much more when a pool of
datagram sockets was stressing the system than when a pool of stream
sockets are stressing the system. The result, however, showed no
significant difference between the degradation of throughput of the lone
stream socket when using a pool of datagrams to stress the queue over
using a pool of streams. The absolute difference in throughput actually
favored datagrams as interfering least as the mean difference was +16%
compared to using streams to stress test (N=7), but it was not
statistically significant. Workloads were matched for payload size and
buffer size (to approximate memory consumption) and process count, and
stress workloads were configured to start before and last long after the
lifetime of the "lone" stream socket flow to ensure that competing flows
were continuously hot.
Given the above data, I propose that vsock fairness be addressed
independent of datagrams and to defer its implementation to a future
series.
Signed-off-by: Bobby Eshleman <[email protected]>
---
Bobby Eshleman (3):
virtio/vsock: support dgram
virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
vsock: Add lockless sendmsg() support
Jiang Wang (1):
tests: add vsock dgram tests
drivers/vhost/vsock.c | 17 +-
include/net/af_vsock.h | 20 ++-
include/uapi/linux/virtio_vsock.h | 2 +
net/vmw_vsock/af_vsock.c | 287 ++++++++++++++++++++++++++++----
net/vmw_vsock/diag.c | 10 +-
net/vmw_vsock/hyperv_transport.c | 15 +-
net/vmw_vsock/virtio_transport.c | 10 +-
net/vmw_vsock/virtio_transport_common.c | 221 ++++++++++++++++++++----
net/vmw_vsock/vmci_transport.c | 70 ++++++--
tools/testing/vsock/util.c | 105 ++++++++++++
tools/testing/vsock/util.h | 4 +
tools/testing/vsock/vsock_test.c | 193 +++++++++++++++++++++
12 files changed, 859 insertions(+), 95 deletions(-)
---
base-commit: ed72bd5a6790a0c3747cb32b0427f921bd03bb71
change-id: 20230413-b4-vsock-dgram-3b6eba6a64e5
Best regards,
--
Bobby Eshleman <[email protected]>
This commit adds a feature bit for virtio vsock to support datagrams.
This commit should not be applied without first applying the commit
that implements datagrams for virtio.
Signed-off-by: Jiang Wang <[email protected]>
Signed-off-by: Bobby Eshleman <[email protected]>
---
drivers/vhost/vsock.c | 3 ++-
include/uapi/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport.c | 8 ++++++--
3 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index dff6ee1c479b..028cf079225e 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -32,7 +32,8 @@
enum {
VHOST_VSOCK_FEATURES = VHOST_FEATURES |
(1ULL << VIRTIO_F_ACCESS_PLATFORM) |
- (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
+ (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
+ (1ULL << VIRTIO_VSOCK_F_DGRAM)
};
enum {
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 331be28b1d30..0975b9c88292 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -40,6 +40,7 @@
/* The feature bitmap for virtio vsock */
#define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
+#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
struct virtio_vsock_config {
__le64 guest_cid;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 582c6c0f788f..bb43eea9a6f9 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -29,6 +29,7 @@ static struct virtio_transport virtio_transport; /* forward declaration */
struct virtio_vsock {
struct virtio_device *vdev;
struct virtqueue *vqs[VSOCK_VQ_MAX];
+ bool has_dgram;
/* Virtqueue processing is deferred to a workqueue */
struct work_struct tx_work;
@@ -640,7 +641,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
}
vsock->vdev = vdev;
-
vsock->rx_buf_nr = 0;
vsock->rx_buf_max_nr = 0;
atomic_set(&vsock->queued_replies, 0);
@@ -657,6 +657,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
vsock->seqpacket_allow = true;
+ if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
+ vsock->has_dgram = true;
+
vdev->priv = vsock;
ret = virtio_vsock_vqs_init(vsock);
@@ -749,7 +752,8 @@ static struct virtio_device_id id_table[] = {
};
static unsigned int features[] = {
- VIRTIO_VSOCK_F_SEQPACKET
+ VIRTIO_VSOCK_F_SEQPACKET,
+ VIRTIO_VSOCK_F_DGRAM
};
static struct virtio_driver virtio_vsock_driver = {
--
2.30.2
Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
it does not scale when many senders share a socket.
Prior to this patch the socket lock is used to protect the local_addr,
remote_addr, transport, and buffer size variables. What follows are the
new protection schemes for the various protected fields that ensure a
race-free multi-sender sendmsg() path for vsock dgrams.
- local_addr
local_addr changes as a result of binding a socket. The write path
for local_addr is bind() and various vsock_auto_bind() call sites.
After a socket has been bound via vsock_auto_bind() or bind(), subsequent
calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
rejects the user request and vsock_auto_bind() early exits.
Therefore, the local addr can not change while a parallel thread is
in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
Change: only acquire lock for auto-binding as-needed in sendmsg().
- vsk->transport
Updated upon socket creation and it doesn't change again until the
socket is destroyed, which only happens after the socket refcnt reaches
zero. This prevents any sendmsg() call from being entered because the
sockfd lookup fails beforehand. That is, sendmsg() and vsk->transport
writes cannot execute in parallel. Additionally, connect() doesn't
update vsk->transport for dgrams as it does for streams. Therefore
vsk->transport is also safe to access lock-free in the sendmsg() path.
No change.
- buffer size variables
Not used by dgram, so they do not need protection. No change.
- remote_addr
Needs additional protection because before this patch the
remote_addr (consisting of several fields such as cid, port, and flags)
only changed atomically under socket lock context. By acquiring the
socket lock to read the structure, the changes made by connect() were
always made visible to sendmsg() atomically. Consequently, to retain
atomicity of updates but offer lock-free access, this patch
redesigns this field as an RCU-protected pointer.
Writers are still synchronized using the socket lock, but readers
only read inside RCU read-side critical sections.
Helpers are introduced for accessing and updating the new pointer.
The remote_addr structure is wrapped together with an rcu_head into a
sockaddr_vm_rcu structure so that kfree_rcu() can be used. This removes
the need of writers to use synchronize_rcu() after freeing old structures
which is simply more efficient and reduces code churn where remote_addr
is already being updated inside read-side sections.
Only virtio has been tested, but updates were necessary to the VMCI and
hyperv code. Unfortunately the author does not have access to
VMCI/hyperv systems so those changes are untested.
Perf Tests
vCPUS: 16
Threads: 16
Payload: 4KB
Test Runs: 5
Type: SOCK_DGRAM
Before: 245.2 MB/s
After: 509.2 MB/s (+107%)
Notably, on the same test system, vsock dgram even outperforms
multi-threaded UDP over virtio-net with vhost and MQ support enabled.
Throughput metrics for single-threaded SOCK_DGRAM and
single/multi-threaded SOCK_STREAM showed no statistically signficant
throughput changes (lowest p-value reaching 0.27), with the range of the
mean difference ranging between -5% to +1%.
Signed-off-by: Bobby Eshleman <[email protected]>
---
drivers/vhost/vsock.c | 12 +-
include/net/af_vsock.h | 19 ++-
net/vmw_vsock/af_vsock.c | 261 ++++++++++++++++++++++++++++----
net/vmw_vsock/diag.c | 10 +-
net/vmw_vsock/hyperv_transport.c | 15 +-
net/vmw_vsock/virtio_transport_common.c | 22 ++-
net/vmw_vsock/vmci_transport.c | 70 ++++++---
7 files changed, 344 insertions(+), 65 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 028cf079225e..da105cb856ac 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -296,13 +296,17 @@ static int
vhost_transport_cancel_pkt(struct vsock_sock *vsk)
{
struct vhost_vsock *vsock;
+ unsigned int cid;
int cnt = 0;
int ret = -ENODEV;
rcu_read_lock();
+ ret = vsock_remote_addr_cid(vsk, &cid);
+ if (ret < 0)
+ goto out;
/* Find the vhost_vsock according to guest context id */
- vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+ vsock = vhost_vsock_get(cid);
if (!vsock)
goto out;
@@ -686,6 +690,10 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
static void vhost_vsock_reset_orphans(struct sock *sk)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ unsigned int cid;
+
+ if (vsock_remote_addr_cid(vsk, &cid) < 0)
+ return;
/* vmci_transport.c doesn't take sk_lock here either. At least we're
* under vsock_table_lock so the sock cannot disappear while we're
@@ -693,7 +701,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
*/
/* If the peer is still valid, no need to reset connection */
- if (vhost_vsock_get(vsk->remote_addr.svm_cid))
+ if (vhost_vsock_get(cid))
return;
/* If the close timeout is pending, let it expire. This avoids races
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 57af28fede19..c02fd6ad0047 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -25,12 +25,17 @@ extern spinlock_t vsock_table_lock;
#define vsock_sk(__sk) ((struct vsock_sock *)__sk)
#define sk_vsock(__vsk) (&(__vsk)->sk)
+struct sockaddr_vm_rcu {
+ struct sockaddr_vm addr;
+ struct rcu_head rcu;
+};
+
struct vsock_sock {
/* sk must be the first member. */
struct sock sk;
const struct vsock_transport *transport;
struct sockaddr_vm local_addr;
- struct sockaddr_vm remote_addr;
+ struct sockaddr_vm_rcu * __rcu remote_addr;
/* Links for the global tables of bound and connected sockets. */
struct list_head bound_table;
struct list_head connected_table;
@@ -206,7 +211,7 @@ void vsock_release_pending(struct sock *pending);
void vsock_add_pending(struct sock *listener, struct sock *pending);
void vsock_remove_pending(struct sock *listener, struct sock *pending);
void vsock_enqueue_accept(struct sock *listener, struct sock *connected);
-void vsock_insert_connected(struct vsock_sock *vsk);
+int vsock_insert_connected(struct vsock_sock *vsk);
void vsock_remove_bound(struct vsock_sock *vsk);
void vsock_remove_connected(struct vsock_sock *vsk);
struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
@@ -244,4 +249,14 @@ static inline void __init vsock_bpf_build_proto(void)
{}
#endif
+/* RCU-protected remote addr helpers */
+int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid);
+int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port);
+int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
+ unsigned int *port);
+int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest);
+bool vsock_remote_addr_bound(struct vsock_sock *vsk);
+bool vsock_remote_addr_equals(struct vsock_sock *vsk, struct sockaddr_vm *other);
+int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port);
+
#endif /* __AF_VSOCK_H__ */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 46b3f35e3adc..93b4abbf20b4 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -145,6 +145,139 @@ static const struct vsock_transport *transport_local;
static DEFINE_MUTEX(vsock_register_mutex);
/**** UTILS ****/
+bool vsock_remote_addr_bound(struct vsock_sock *vsk)
+{
+ struct sockaddr_vm_rcu *remote_addr;
+ bool ret;
+
+ rcu_read_lock();
+ remote_addr = rcu_dereference(vsk->remote_addr);
+ if (!remote_addr) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ ret = vsock_addr_bound(&remote_addr->addr);
+ rcu_read_unlock();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_bound);
+
+int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest)
+{
+ struct sockaddr_vm_rcu *remote_addr;
+
+ rcu_read_lock();
+ remote_addr = rcu_dereference(vsk->remote_addr);
+ if (!remote_addr) {
+ rcu_read_unlock();
+ return -EINVAL;
+ }
+ memcpy(dest, &remote_addr->addr, sizeof(*dest));
+ rcu_read_unlock();
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_copy);
+
+int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid)
+{
+ return vsock_remote_addr_cid_port(vsk, cid, NULL);
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_cid);
+
+int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port)
+{
+ return vsock_remote_addr_cid_port(vsk, NULL, port);
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_port);
+
+int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
+ unsigned int *port)
+{
+ struct sockaddr_vm_rcu *remote_addr;
+
+ rcu_read_lock();
+ remote_addr = rcu_dereference(vsk->remote_addr);
+ if (!remote_addr) {
+ rcu_read_unlock();
+ return -EINVAL;
+ }
+
+ if (cid)
+ *cid = remote_addr->addr.svm_cid;
+ if (port)
+ *port = remote_addr->addr.svm_port;
+
+ rcu_read_unlock();
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_cid_port);
+
+/* The socket lock must be held by the caller */
+int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port)
+{
+ struct sockaddr_vm_rcu *old, *new;
+
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ rcu_read_lock();
+ old = rcu_dereference(vsk->remote_addr);
+ if (!old) {
+ kfree(new);
+ return -EINVAL;
+ }
+ memcpy(&new->addr, &old->addr, sizeof(new->addr));
+ rcu_read_unlock();
+
+ new->addr.svm_cid = cid;
+ new->addr.svm_port = port;
+
+ old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
+ kfree_rcu(old, rcu);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_update_cid_port);
+
+/* The socket lock must be held by the caller */
+int vsock_remote_addr_update(struct vsock_sock *vsk, struct sockaddr_vm *src)
+{
+ struct sockaddr_vm_rcu *old, *new;
+
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ memcpy(&new->addr, src, sizeof(new->addr));
+ old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
+ kfree_rcu(old, rcu);
+
+ return 0;
+}
+
+bool vsock_remote_addr_equals(struct vsock_sock *vsk,
+ struct sockaddr_vm *other)
+{
+ struct sockaddr_vm_rcu *remote_addr;
+ bool equals;
+
+ rcu_read_lock();
+ remote_addr = rcu_dereference(vsk->remote_addr);
+ if (!remote_addr) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ equals = vsock_addr_equals_addr(&remote_addr->addr, other);
+ rcu_read_unlock();
+
+ return equals;
+}
+EXPORT_SYMBOL_GPL(vsock_remote_addr_equals);
/* Each bound VSocket is stored in the bind hash table and each connected
* VSocket is stored in the connected hash table.
@@ -254,10 +387,16 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
connected_table) {
- if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
+ struct sockaddr_vm_rcu *remote_addr;
+
+ rcu_read_lock();
+ remote_addr = rcu_dereference(vsk->remote_addr);
+ if (vsock_addr_equals_addr(src, &remote_addr->addr) &&
dst->svm_port == vsk->local_addr.svm_port) {
+ rcu_read_unlock();
return sk_vsock(vsk);
}
+ rcu_read_unlock();
}
return NULL;
@@ -270,14 +409,25 @@ static void vsock_insert_unbound(struct vsock_sock *vsk)
spin_unlock_bh(&vsock_table_lock);
}
-void vsock_insert_connected(struct vsock_sock *vsk)
+int vsock_insert_connected(struct vsock_sock *vsk)
{
- struct list_head *list = vsock_connected_sockets(
- &vsk->remote_addr, &vsk->local_addr);
+ struct list_head *list;
+ struct sockaddr_vm_rcu *remote_addr;
+
+ rcu_read_lock();
+ remote_addr = rcu_dereference(vsk->remote_addr);
+ if (!remote_addr) {
+ rcu_read_unlock();
+ return -EINVAL;
+ }
+ list = vsock_connected_sockets(&remote_addr->addr, &vsk->local_addr);
+ rcu_read_unlock();
spin_lock_bh(&vsock_table_lock);
__vsock_insert_connected(list, vsk);
spin_unlock_bh(&vsock_table_lock);
+
+ return 0;
}
EXPORT_SYMBOL_GPL(vsock_insert_connected);
@@ -438,10 +588,17 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
{
const struct vsock_transport *new_transport;
struct sock *sk = sk_vsock(vsk);
- unsigned int remote_cid = vsk->remote_addr.svm_cid;
+ struct sockaddr_vm remote_addr;
+ unsigned int remote_cid;
__u8 remote_flags;
int ret;
+ ret = vsock_remote_addr_copy(vsk, &remote_addr);
+ if (ret < 0)
+ return ret;
+
+ remote_cid = remote_addr.svm_cid;
+
/* If the packet is coming with the source and destination CIDs higher
* than VMADDR_CID_HOST, then a vsock channel where all the packets are
* forwarded to the host should be established. Then the host will
@@ -451,10 +608,15 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
* the connect path the flag can be set by the user space application.
*/
if (psk && vsk->local_addr.svm_cid > VMADDR_CID_HOST &&
- vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
- vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;
+ remote_addr.svm_cid > VMADDR_CID_HOST) {
+ remote_addr.svm_flags |= VMADDR_CID_HOST;
+
+ ret = vsock_remote_addr_update(vsk, &remote_addr);
+ if (ret < 0)
+ return ret;
+ }
- remote_flags = vsk->remote_addr.svm_flags;
+ remote_flags = remote_addr.svm_flags;
switch (sk->sk_type) {
case SOCK_DGRAM:
@@ -742,6 +904,7 @@ static struct sock *__vsock_create(struct net *net,
unsigned short type,
int kern)
{
+ struct sockaddr_vm *remote_addr;
struct sock *sk;
struct vsock_sock *psk;
struct vsock_sock *vsk;
@@ -761,7 +924,14 @@ static struct sock *__vsock_create(struct net *net,
vsk = vsock_sk(sk);
vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
- vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+
+ remote_addr = kmalloc(sizeof(*remote_addr), GFP_KERNEL);
+ if (!remote_addr) {
+ sk_free(sk);
+ return NULL;
+ }
+ vsock_addr_init(remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
+ rcu_assign_pointer(vsk->remote_addr, remote_addr);
sk->sk_destruct = vsock_sk_destruct;
sk->sk_backlog_rcv = vsock_queue_rcv_skb;
@@ -845,6 +1015,7 @@ static void __vsock_release(struct sock *sk, int level)
static void vsock_sk_destruct(struct sock *sk)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ struct sockaddr_vm_rcu *remote_addr;
vsock_deassign_transport(vsk);
@@ -852,8 +1023,8 @@ static void vsock_sk_destruct(struct sock *sk)
* possibly register the address family with the kernel.
*/
vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
- vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
-
+ remote_addr = rcu_replace_pointer(vsk->remote_addr, NULL, 1);
+ kfree_rcu(remote_addr);
put_cred(vsk->owner);
}
@@ -943,6 +1114,7 @@ static int vsock_getname(struct socket *sock,
struct sock *sk;
struct vsock_sock *vsk;
struct sockaddr_vm *vm_addr;
+ struct sockaddr_vm_rcu *rcu_ptr;
sk = sock->sk;
vsk = vsock_sk(sk);
@@ -951,11 +1123,17 @@ static int vsock_getname(struct socket *sock,
lock_sock(sk);
if (peer) {
+ rcu_read_lock();
if (sock->state != SS_CONNECTED) {
err = -ENOTCONN;
goto out;
}
- vm_addr = &vsk->remote_addr;
+ rcu_ptr = rcu_dereference(vsk->remote_addr);
+ if (!rcu_ptr) {
+ err = -EINVAL;
+ goto out;
+ }
+ vm_addr = &rcu_ptr->addr;
} else {
vm_addr = &vsk->local_addr;
}
@@ -975,6 +1153,8 @@ static int vsock_getname(struct socket *sock,
err = sizeof(*vm_addr);
out:
+ if (peer)
+ rcu_read_unlock();
release_sock(sk);
return err;
}
@@ -1161,7 +1341,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
int err;
struct sock *sk;
struct vsock_sock *vsk;
- struct sockaddr_vm *remote_addr;
+ struct sockaddr_vm stack_addr, *remote_addr;
const struct vsock_transport *transport;
if (msg->msg_flags & MSG_OOB)
@@ -1172,15 +1352,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
sk = sock->sk;
vsk = vsock_sk(sk);
- lock_sock(sk);
+ /* If auto-binding is required, acquire the slock to avoid potential
+ * race conditions. Otherwise, do not acquire the lock.
+ *
+ * We know that the first check of local_addr is racy (indicated by
+ * data_race()). By acquiring the lock and then subsequently checking
+ * again if local_addr is bound (inside vsock_auto_bind()), we can
+ * ensure there are no real data races.
+ *
+ * This technique is borrowed by inet_send_prepare().
+ */
+ if (data_race(!vsock_addr_bound(&vsk->local_addr))) {
+ lock_sock(sk);
+ err = vsock_auto_bind(vsk);
+ release_sock(sk);
+ if (err)
+ return err;
+ }
transport = vsk->transport;
- err = vsock_auto_bind(vsk);
- if (err)
- goto out;
-
-
/* If the provided message contains an address, use that. Otherwise
* fall back on the socket's remote handle (if it has been connected).
*/
@@ -1199,18 +1390,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}
} else if (sock->state == SS_CONNECTED) {
- remote_addr = &vsk->remote_addr;
+ err = vsock_remote_addr_copy(vsk, &stack_addr);
+ if (err < 0)
+ goto out;
- if (remote_addr->svm_cid == VMADDR_CID_ANY)
- remote_addr->svm_cid = transport->get_local_cid();
+ if (stack_addr.svm_cid == VMADDR_CID_ANY) {
+ stack_addr.svm_cid = transport->get_local_cid();
+ lock_sock(sk_vsock(vsk));
+ vsock_remote_addr_update(vsk, &stack_addr);
+ release_sock(sk_vsock(vsk));
+ }
/* XXX Should connect() or this function ensure remote_addr is
* bound?
*/
- if (!vsock_addr_bound(&vsk->remote_addr)) {
+ if (!vsock_addr_bound(&stack_addr)) {
err = -EINVAL;
goto out;
}
+
+ remote_addr = &stack_addr;
} else {
err = -EINVAL;
goto out;
@@ -1225,7 +1424,6 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
out:
- release_sock(sk);
return err;
}
@@ -1243,8 +1441,7 @@ static int vsock_dgram_connect(struct socket *sock,
err = vsock_addr_cast(addr, addr_len, &remote_addr);
if (err == -EAFNOSUPPORT && remote_addr->svm_family == AF_UNSPEC) {
lock_sock(sk);
- vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY,
- VMADDR_PORT_ANY);
+ vsock_remote_addr_update_cid_port(vsk, VMADDR_CID_ANY, VMADDR_PORT_ANY);
sock->state = SS_UNCONNECTED;
release_sock(sk);
return 0;
@@ -1263,7 +1460,10 @@ static int vsock_dgram_connect(struct socket *sock,
goto out;
}
- memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
+ err = vsock_remote_addr_update(vsk, remote_addr);
+ if (err < 0)
+ goto out;
+
sock->state = SS_CONNECTED;
/* sock map disallows redirection of non-TCP sockets with sk_state !=
@@ -1399,8 +1599,9 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
}
/* Set the remote address that we are connecting to. */
- memcpy(&vsk->remote_addr, remote_addr,
- sizeof(vsk->remote_addr));
+ err = vsock_remote_addr_update(vsk, remote_addr);
+ if (err)
+ goto out;
err = vsock_assign_transport(vsk, NULL);
if (err)
@@ -1831,7 +2032,7 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}
- if (!vsock_addr_bound(&vsk->remote_addr)) {
+ if (!vsock_remote_addr_bound(vsk)) {
err = -EDESTADDRREQ;
goto out;
}
diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
index a2823b1c5e28..f843bae86b32 100644
--- a/net/vmw_vsock/diag.c
+++ b/net/vmw_vsock/diag.c
@@ -15,8 +15,14 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
u32 portid, u32 seq, u32 flags)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ struct sockaddr_vm remote_addr;
struct vsock_diag_msg *rep;
struct nlmsghdr *nlh;
+ int err;
+
+ err = vsock_remote_addr_copy(vsk, &remote_addr);
+ if (err < 0)
+ return err;
nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
flags);
@@ -36,8 +42,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
rep->vdiag_shutdown = sk->sk_shutdown;
rep->vdiag_src_cid = vsk->local_addr.svm_cid;
rep->vdiag_src_port = vsk->local_addr.svm_port;
- rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
- rep->vdiag_dst_port = vsk->remote_addr.svm_port;
+ rep->vdiag_dst_cid = remote_addr.svm_cid;
+ rep->vdiag_dst_port = remote_addr.svm_port;
rep->vdiag_ino = sock_i_ino(sk);
sock_diag_save_cookie(sk, rep->vdiag_cookie);
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 7cb1a9d2cdb4..462b2ec3e6e9 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -336,9 +336,11 @@ static void hvs_open_connection(struct vmbus_channel *chan)
hvs_addr_init(&vnew->local_addr, if_type);
/* Remote peer is always the host */
- vsock_addr_init(&vnew->remote_addr,
- VMADDR_CID_HOST, VMADDR_PORT_ANY);
- vnew->remote_addr.svm_port = get_port_by_srv_id(if_instance);
+ ret = vsock_remote_addr_update_cid_port(vnew, VMADDR_CID_HOST,
+ get_port_by_srv_id(if_instance));
+ if (ret < 0)
+ goto out;
+
ret = vsock_assign_transport(vnew, vsock_sk(sk));
/* Transport assigned (looking at remote_addr) must be the
* same where we received the request.
@@ -459,13 +461,18 @@ static int hvs_connect(struct vsock_sock *vsk)
{
union hvs_service_id vm, host;
struct hvsock *h = vsk->trans;
+ int err;
vm.srv_id = srv_id_template;
vm.svm_port = vsk->local_addr.svm_port;
h->vm_srv_id = vm.srv_id;
host.srv_id = srv_id_template;
- host.svm_port = vsk->remote_addr.svm_port;
+
+ err = vsock_remote_addr_port(vsk, &host.svm_port);
+ if (err < 0)
+ return err;
+
h->host_srv_id = host.srv_id;
return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 925acface893..1b87704e516a 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -258,8 +258,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
src_cid = t_ops->transport.get_local_cid();
src_port = vsk->local_addr.svm_port;
if (!info->remote_cid) {
- dst_cid = vsk->remote_addr.svm_cid;
- dst_port = vsk->remote_addr.svm_port;
+ ret = vsock_remote_addr_cid_port(vsk, &dst_cid, &dst_port);
+ if (ret < 0)
+ return ret;
} else {
dst_cid = info->remote_cid;
dst_port = info->remote_port;
@@ -1169,7 +1170,9 @@ virtio_transport_recv_connecting(struct sock *sk,
case VIRTIO_VSOCK_OP_RESPONSE:
sk->sk_state = TCP_ESTABLISHED;
sk->sk_socket->state = SS_CONNECTED;
- vsock_insert_connected(vsk);
+ err = vsock_insert_connected(vsk);
+ if (err)
+ goto destroy;
sk->sk_state_change(sk);
break;
case VIRTIO_VSOCK_OP_INVALID:
@@ -1403,9 +1406,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
vchild = vsock_sk(child);
vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
le32_to_cpu(hdr->dst_port));
- vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
- le32_to_cpu(hdr->src_port));
-
+ vsock_remote_addr_update_cid_port(vchild, le64_to_cpu(hdr->src_cid),
+ le32_to_cpu(hdr->src_port));
ret = vsock_assign_transport(vchild, vsk);
/* Transport assigned (looking at remote_addr) must be the same
* where we received the request.
@@ -1420,7 +1422,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
if (virtio_transport_space_update(child, skb))
child->sk_write_space(child);
- vsock_insert_connected(vchild);
+ ret = vsock_insert_connected(vchild);
+ if (ret) {
+ release_sock(child);
+ virtio_transport_reset_no_sock(t, skb);
+ sock_put(child);
+ return ret;
+ }
vsock_enqueue_accept(sk, child);
virtio_transport_send_response(vchild, skb);
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index b370070194fa..c0c445e7d925 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -283,18 +283,25 @@ vmci_transport_send_control_pkt(struct sock *sk,
u16 proto,
struct vmci_handle handle)
{
+ struct sockaddr_vm addr_stack;
+ struct sockaddr_vm *remote_addr = &addr_stack;
struct vsock_sock *vsk;
+ int err;
vsk = vsock_sk(sk);
if (!vsock_addr_bound(&vsk->local_addr))
return -EINVAL;
- if (!vsock_addr_bound(&vsk->remote_addr))
+ if (!vsock_remote_addr_bound(vsk))
return -EINVAL;
+ err = vsock_remote_addr_copy(vsk, &addr_stack);
+ if (err < 0)
+ return err;
+
return vmci_transport_alloc_send_control_pkt(&vsk->local_addr,
- &vsk->remote_addr,
+ remote_addr,
type, size, mode,
wait, proto, handle);
}
@@ -317,6 +324,7 @@ static int vmci_transport_send_reset(struct sock *sk,
struct sockaddr_vm *dst_ptr;
struct sockaddr_vm dst;
struct vsock_sock *vsk;
+ int err;
if (pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST)
return 0;
@@ -326,13 +334,16 @@ static int vmci_transport_send_reset(struct sock *sk,
if (!vsock_addr_bound(&vsk->local_addr))
return -EINVAL;
- if (vsock_addr_bound(&vsk->remote_addr)) {
- dst_ptr = &vsk->remote_addr;
+ if (vsock_remote_addr_bound(vsk)) {
+ err = vsock_remote_addr_copy(vsk, &dst);
+ if (err < 0)
+ return err;
} else {
vsock_addr_init(&dst, pkt->dg.src.context,
pkt->src_port);
- dst_ptr = &dst;
}
+ dst_ptr = &dst;
+
return vmci_transport_alloc_send_control_pkt(&vsk->local_addr, dst_ptr,
VMCI_TRANSPORT_PACKET_TYPE_RST,
0, 0, NULL, VSOCK_PROTO_INVALID,
@@ -490,7 +501,7 @@ static struct sock *vmci_transport_get_pending(
list_for_each_entry(vpending, &vlistener->pending_links,
pending_links) {
- if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
+ if (vsock_remote_addr_equals(vpending, &src) &&
pkt->dst_port == vpending->local_addr.svm_port) {
pending = sk_vsock(vpending);
sock_hold(pending);
@@ -1015,8 +1026,8 @@ static int vmci_transport_recv_listen(struct sock *sk,
vsock_addr_init(&vpending->local_addr, pkt->dg.dst.context,
pkt->dst_port);
- vsock_addr_init(&vpending->remote_addr, pkt->dg.src.context,
- pkt->src_port);
+ vsock_remote_addr_update_cid_port(vpending, pkt->dg.src.context,
+ pkt->src_port);
err = vsock_assign_transport(vpending, vsock_sk(sk));
/* Transport assigned (looking at remote_addr) must be the same
@@ -1133,6 +1144,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
{
struct vsock_sock *vpending;
struct vmci_handle handle;
+ unsigned int vpending_remote_cid;
struct vmci_qp *qpair;
bool is_local;
u32 flags;
@@ -1189,8 +1201,13 @@ vmci_transport_recv_connecting_server(struct sock *listener,
/* vpending->local_addr always has a context id so we do not need to
* worry about VMADDR_CID_ANY in this case.
*/
- is_local =
- vpending->remote_addr.svm_cid == vpending->local_addr.svm_cid;
+ err = vsock_remote_addr_cid(vpending, &vpending_remote_cid);
+ if (err < 0) {
+ skerr = EPROTO;
+ goto destroy;
+ }
+
+ is_local = vpending_remote_cid == vpending->local_addr.svm_cid;
flags = VMCI_QPFLAG_ATTACH_ONLY;
flags |= is_local ? VMCI_QPFLAG_LOCAL : 0;
@@ -1203,7 +1220,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
flags,
vmci_transport_is_trusted(
vpending,
- vpending->remote_addr.svm_cid));
+ vpending_remote_cid));
if (err < 0) {
vmci_transport_send_reset(pending, pkt);
skerr = -err;
@@ -1306,9 +1323,20 @@ vmci_transport_recv_connecting_client(struct sock *sk,
break;
case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE:
case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE2:
+ struct sockaddr_vm_rcu *remote_addr;
+
+ rcu_read_lock();
+ remote_addr = rcu_dereference(vsk->remote_addr);
+ if (!remote_addr) {
+ skerr = EPROTO;
+ err = -EINVAL;
+ rcu_read_unlock();
+ goto destroy;
+ }
+
if (pkt->u.size == 0
- || pkt->dg.src.context != vsk->remote_addr.svm_cid
- || pkt->src_port != vsk->remote_addr.svm_port
+ || pkt->dg.src.context != remote_addr->addr.svm_cid
+ || pkt->src_port != remote_addr->addr.svm_port
|| !vmci_handle_is_invalid(vmci_trans(vsk)->qp_handle)
|| vmci_trans(vsk)->qpair
|| vmci_trans(vsk)->produce_size != 0
@@ -1316,9 +1344,10 @@ vmci_transport_recv_connecting_client(struct sock *sk,
|| vmci_trans(vsk)->detach_sub_id != VMCI_INVALID_ID) {
skerr = EPROTO;
err = -EINVAL;
-
+ rcu_read_unlock();
goto destroy;
}
+ rcu_read_unlock();
err = vmci_transport_recv_connecting_client_negotiate(sk, pkt);
if (err) {
@@ -1379,6 +1408,7 @@ static int vmci_transport_recv_connecting_client_negotiate(
int err;
struct vsock_sock *vsk;
struct vmci_handle handle;
+ unsigned int remote_cid;
struct vmci_qp *qpair;
u32 detach_sub_id;
bool is_local;
@@ -1449,19 +1479,23 @@ static int vmci_transport_recv_connecting_client_negotiate(
/* Make VMCI select the handle for us. */
handle = VMCI_INVALID_HANDLE;
- is_local = vsk->remote_addr.svm_cid == vsk->local_addr.svm_cid;
+
+ err = vsock_remote_addr_cid(vsk, &remote_cid);
+ if (err < 0)
+ goto destroy;
+
+ is_local = remote_cid == vsk->local_addr.svm_cid;
flags = is_local ? VMCI_QPFLAG_LOCAL : 0;
err = vmci_transport_queue_pair_alloc(&qpair,
&handle,
pkt->u.size,
pkt->u.size,
- vsk->remote_addr.svm_cid,
+ remote_cid,
flags,
vmci_transport_is_trusted(
vsk,
- vsk->
- remote_addr.svm_cid));
+ remote_cid));
if (err < 0)
goto destroy;
--
2.30.2
From: Jiang Wang <[email protected]>
This patch adds tests for vsock datagram.
Signed-off-by: Bobby Eshleman <[email protected]>
Signed-off-by: Jiang Wang <[email protected]>
---
tools/testing/vsock/util.c | 105 +++++++++++++++++++++
tools/testing/vsock/util.h | 4 +
tools/testing/vsock/vsock_test.c | 193 +++++++++++++++++++++++++++++++++++++++
3 files changed, 302 insertions(+)
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 01b636d3039a..45e35da48b40 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -260,6 +260,57 @@ void send_byte(int fd, int expected_ret, int flags)
}
}
+/* Transmit one byte and check the return value.
+ *
+ * expected_ret:
+ * <0 Negative errno (for testing errors)
+ * 0 End-of-file
+ * 1 Success
+ */
+void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
+ int flags)
+{
+ const uint8_t byte = 'A';
+ ssize_t nwritten;
+
+ timeout_begin(TIMEOUT);
+ do {
+ nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
+ len);
+ timeout_check("write");
+ } while (nwritten < 0 && errno == EINTR);
+ timeout_end();
+
+ if (expected_ret < 0) {
+ if (nwritten != -1) {
+ fprintf(stderr, "bogus sendto(2) return value %zd\n",
+ nwritten);
+ exit(EXIT_FAILURE);
+ }
+ if (errno != -expected_ret) {
+ perror("write");
+ exit(EXIT_FAILURE);
+ }
+ return;
+ }
+
+ if (nwritten < 0) {
+ perror("write");
+ exit(EXIT_FAILURE);
+ }
+ if (nwritten == 0) {
+ if (expected_ret == 0)
+ return;
+
+ fprintf(stderr, "unexpected EOF while sending byte\n");
+ exit(EXIT_FAILURE);
+ }
+ if (nwritten != sizeof(byte)) {
+ fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
+ exit(EXIT_FAILURE);
+ }
+}
+
/* Receive one byte and check the return value.
*
* expected_ret:
@@ -313,6 +364,60 @@ void recv_byte(int fd, int expected_ret, int flags)
}
}
+/* Receive one byte and check the return value.
+ *
+ * expected_ret:
+ * <0 Negative errno (for testing errors)
+ * 0 End-of-file
+ * 1 Success
+ */
+void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
+ int expected_ret, int flags)
+{
+ uint8_t byte;
+ ssize_t nread;
+
+ timeout_begin(TIMEOUT);
+ do {
+ nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
+ timeout_check("read");
+ } while (nread < 0 && errno == EINTR);
+ timeout_end();
+
+ if (expected_ret < 0) {
+ if (nread != -1) {
+ fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
+ nread);
+ exit(EXIT_FAILURE);
+ }
+ if (errno != -expected_ret) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ return;
+ }
+
+ if (nread < 0) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ if (nread == 0) {
+ if (expected_ret == 0)
+ return;
+
+ fprintf(stderr, "unexpected EOF while receiving byte\n");
+ exit(EXIT_FAILURE);
+ }
+ if (nread != sizeof(byte)) {
+ fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
+ exit(EXIT_FAILURE);
+ }
+ if (byte != 'A') {
+ fprintf(stderr, "unexpected byte read %c\n", byte);
+ exit(EXIT_FAILURE);
+ }
+}
+
/* Run test cases. The program terminates if a failure occurs. */
void run_tests(const struct test_case *test_cases,
const struct test_opts *opts)
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index fb99208a95ea..6e5cd610bf05 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -43,7 +43,11 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
struct sockaddr_vm *clientaddrp);
void vsock_wait_remote_close(int fd);
void send_byte(int fd, int expected_ret, int flags);
+void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
+ int flags);
void recv_byte(int fd, int expected_ret, int flags);
+void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
+ int expected_ret, int flags);
void run_tests(const struct test_case *test_cases,
const struct test_opts *opts);
void list_tests(const struct test_case *test_cases);
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index ac1bd3ac1533..851c3d65178d 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -202,6 +202,113 @@ static void test_stream_server_close_server(const struct test_opts *opts)
close(fd);
}
+static void test_dgram_sendto_client(const struct test_opts *opts)
+{
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } addr = {
+ .svm = {
+ .svm_family = AF_VSOCK,
+ .svm_port = 1234,
+ .svm_cid = opts->peer_cid,
+ },
+ };
+ int fd;
+
+ /* Wait for the server to be ready */
+ control_expectln("BIND");
+
+ fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+ if (fd < 0) {
+ perror("socket");
+ exit(EXIT_FAILURE);
+ }
+
+ sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
+
+ /* Notify the server that the client has finished */
+ control_writeln("DONE");
+
+ close(fd);
+}
+
+static void test_dgram_sendto_server(const struct test_opts *opts)
+{
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } addr = {
+ .svm = {
+ .svm_family = AF_VSOCK,
+ .svm_port = 1234,
+ .svm_cid = VMADDR_CID_ANY,
+ },
+ };
+ int fd;
+ int len = sizeof(addr.sa);
+
+ fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+
+ if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+ perror("bind");
+ exit(EXIT_FAILURE);
+ }
+
+ /* Notify the client that the server is ready */
+ control_writeln("BIND");
+
+ recvfrom_byte(fd, &addr.sa, &len, 1, 0);
+
+ /* Wait for the client to finish */
+ control_expectln("DONE");
+
+ close(fd);
+}
+
+static void test_dgram_connect_client(const struct test_opts *opts)
+{
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } addr = {
+ .svm = {
+ .svm_family = AF_VSOCK,
+ .svm_port = 1234,
+ .svm_cid = opts->peer_cid,
+ },
+ };
+ int fd;
+ int ret;
+
+ /* Wait for the server to be ready */
+ control_expectln("BIND");
+
+ fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+ if (fd < 0) {
+ perror("bind");
+ exit(EXIT_FAILURE);
+ }
+
+ ret = connect(fd, &addr.sa, sizeof(addr.svm));
+ if (ret < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ send_byte(fd, 1, 0);
+
+ /* Notify the server that the client has finished */
+ control_writeln("DONE");
+
+ close(fd);
+}
+
+static void test_dgram_connect_server(const struct test_opts *opts)
+{
+ test_dgram_sendto_server(opts);
+}
+
/* With the standard socket sizes, VMCI is able to support about 100
* concurrent stream connections.
*/
@@ -255,6 +362,77 @@ static void test_stream_multiconn_server(const struct test_opts *opts)
close(fds[i]);
}
+static void test_dgram_multiconn_client(const struct test_opts *opts)
+{
+ int fds[MULTICONN_NFDS];
+ int i;
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } addr = {
+ .svm = {
+ .svm_family = AF_VSOCK,
+ .svm_port = 1234,
+ .svm_cid = opts->peer_cid,
+ },
+ };
+
+ /* Wait for the server to be ready */
+ control_expectln("BIND");
+
+ for (i = 0; i < MULTICONN_NFDS; i++) {
+ fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
+ if (fds[i] < 0) {
+ perror("socket");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ for (i = 0; i < MULTICONN_NFDS; i++)
+ sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
+
+ /* Notify the server that the client has finished */
+ control_writeln("DONE");
+
+ for (i = 0; i < MULTICONN_NFDS; i++)
+ close(fds[i]);
+}
+
+static void test_dgram_multiconn_server(const struct test_opts *opts)
+{
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } addr = {
+ .svm = {
+ .svm_family = AF_VSOCK,
+ .svm_port = 1234,
+ .svm_cid = VMADDR_CID_ANY,
+ },
+ };
+ int fd;
+ int len = sizeof(addr.sa);
+ int i;
+
+ fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+
+ if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+ perror("bind");
+ exit(EXIT_FAILURE);
+ }
+
+ /* Notify the client that the server is ready */
+ control_writeln("BIND");
+
+ for (i = 0; i < MULTICONN_NFDS; i++)
+ recvfrom_byte(fd, &addr.sa, &len, 1, 0);
+
+ /* Wait for the client to finish */
+ control_expectln("DONE");
+
+ close(fd);
+}
+
static void test_stream_msg_peek_client(const struct test_opts *opts)
{
int fd;
@@ -1128,6 +1306,21 @@ static struct test_case test_cases[] = {
.run_client = test_stream_virtio_skb_merge_client,
.run_server = test_stream_virtio_skb_merge_server,
},
+ {
+ .name = "SOCK_DGRAM client close",
+ .run_client = test_dgram_sendto_client,
+ .run_server = test_dgram_sendto_server,
+ },
+ {
+ .name = "SOCK_DGRAM client connect",
+ .run_client = test_dgram_connect_client,
+ .run_server = test_dgram_connect_server,
+ },
+ {
+ .name = "SOCK_DGRAM multiple connections",
+ .run_client = test_dgram_multiconn_client,
+ .run_server = test_dgram_multiconn_server,
+ },
{},
};
--
2.30.2
Hi Bobby,
> /* The feature bitmap for virtio vsock */
> #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
> +#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
Seems that bit 2 is already taken by VIRTIO_VSOCK_F_NO_IMPLIED_STREAM.
https://github.com/oasis-tcs/virtio-spec/commit/26ed30ccb049fd51d6e20aad3de2807d678edb3a
On Fri, Apr 14, 2023 at 08:47:52AM +0000, Alvaro Karsz wrote:
> Hi Bobby,
>
> > /* The feature bitmap for virtio vsock */
> > #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
> > +#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
>
> Seems that bit 2 is already taken by VIRTIO_VSOCK_F_NO_IMPLIED_STREAM.
>
> https://github.com/oasis-tcs/virtio-spec/commit/26ed30ccb049fd51d6e20aad3de2807d678edb3a
Right! I'll bump that in the next rev.
Thanks,
Bobby
CC'ing Cong.
On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote:
> Hey all!
>
> This series introduces support for datagrams to virtio/vsock.
>
> It is a spin-off (and smaller version) of this series from the summer:
> https://lore.kernel.org/all/[email protected]/
>
> Please note that this is an RFC and should not be merged until
> associated changes are made to the virtio specification, which will
> follow after discussion from this series.
>
> This series first supports datagrams in a basic form for virtio, and
> then optimizes the sendpath for all transports.
>
> The result is a very fast datagram communication protocol that
> outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
> of multi-threaded workload samples.
>
> For those that are curious, some summary data comparing UDP and VSOCK
> DGRAM (N=5):
>
> vCPUS: 16
> virtio-net queues: 16
> payload size: 4KB
> Setup: bare metal + vm (non-nested)
>
> UDP: 287.59 MB/s
> VSOCK DGRAM: 509.2 MB/s
>
> Some notes about the implementation...
>
> This datagram implementation forces datagrams to self-throttle according
> to the threshold set by sk_sndbuf. It behaves similar to the credits
> used by streams in its effect on throughput and memory consumption, but
> it is not influenced by the receiving socket as credits are.
>
> The device drops packets silently. There is room for improvement by
> building into the device and driver some intelligence around how to
> reduce frequency of kicking the virtqueue when packet loss is high. I
> think there is a good discussion to be had on this.
>
> In this series I am also proposing that fairness be reexamined as an
> issue separate from datagrams, which differs from my previous series
> that coupled these issues. After further testing and reflection on the
> design, I do not believe that these need to be coupled and I do not
> believe this implementation introduces additional unfairness or
> exacerbates pre-existing unfairness.
>
> I attempted to characterize vsock fairness by using a pool of processes
> to stress test the shared resources while measuring the performance of a
> lone stream socket. Given unfair preference for datagrams, we would
> assume that a lone stream socket would degrade much more when a pool of
> datagram sockets was stressing the system than when a pool of stream
> sockets are stressing the system. The result, however, showed no
> significant difference between the degradation of throughput of the lone
> stream socket when using a pool of datagrams to stress the queue over
> using a pool of streams. The absolute difference in throughput actually
> favored datagrams as interfering least as the mean difference was +16%
> compared to using streams to stress test (N=7), but it was not
> statistically significant. Workloads were matched for payload size and
> buffer size (to approximate memory consumption) and process count, and
> stress workloads were configured to start before and last long after the
> lifetime of the "lone" stream socket flow to ensure that competing flows
> were continuously hot.
>
> Given the above data, I propose that vsock fairness be addressed
> independent of datagrams and to defer its implementation to a future
> series.
>
> Signed-off-by: Bobby Eshleman <[email protected]>
> ---
> Bobby Eshleman (3):
> virtio/vsock: support dgram
> virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
> vsock: Add lockless sendmsg() support
>
> Jiang Wang (1):
> tests: add vsock dgram tests
>
> drivers/vhost/vsock.c | 17 +-
> include/net/af_vsock.h | 20 ++-
> include/uapi/linux/virtio_vsock.h | 2 +
> net/vmw_vsock/af_vsock.c | 287 ++++++++++++++++++++++++++++----
> net/vmw_vsock/diag.c | 10 +-
> net/vmw_vsock/hyperv_transport.c | 15 +-
> net/vmw_vsock/virtio_transport.c | 10 +-
> net/vmw_vsock/virtio_transport_common.c | 221 ++++++++++++++++++++----
> net/vmw_vsock/vmci_transport.c | 70 ++++++--
> tools/testing/vsock/util.c | 105 ++++++++++++
> tools/testing/vsock/util.h | 4 +
> tools/testing/vsock/vsock_test.c | 193 +++++++++++++++++++++
> 12 files changed, 859 insertions(+), 95 deletions(-)
> ---
> base-commit: ed72bd5a6790a0c3747cb32b0427f921bd03bb71
> change-id: 20230413-b4-vsock-dgram-3b6eba6a64e5
>
> Best regards,
> --
> Bobby Eshleman <[email protected]>
>
On Fri, Apr 14, 2023 at 12:25:59AM +0000, Bobby Eshleman wrote:
>Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
>it does not scale when many senders share a socket.
>
>Prior to this patch the socket lock is used to protect the local_addr,
>remote_addr, transport, and buffer size variables. What follows are the
>new protection schemes for the various protected fields that ensure a
>race-free multi-sender sendmsg() path for vsock dgrams.
>
>- local_addr
> local_addr changes as a result of binding a socket. The write path
> for local_addr is bind() and various vsock_auto_bind() call sites.
> After a socket has been bound via vsock_auto_bind() or bind(), subsequent
> calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
> rejects the user request and vsock_auto_bind() early exits.
> Therefore, the local addr can not change while a parallel thread is
> in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
> Change: only acquire lock for auto-binding as-needed in sendmsg().
>
>- vsk->transport
> Updated upon socket creation and it doesn't change again until the
This is true only for dgram, right?
How do we decide which transport to assign for dgram?
> socket is destroyed, which only happens after the socket refcnt reaches
> zero. This prevents any sendmsg() call from being entered because the
> sockfd lookup fails beforehand. That is, sendmsg() and vsk->transport
> writes cannot execute in parallel. Additionally, connect() doesn't
> update vsk->transport for dgrams as it does for streams. Therefore
> vsk->transport is also safe to access lock-free in the sendmsg() path.
> No change.
>
>- buffer size variables
> Not used by dgram, so they do not need protection. No change.
Is this true because for dgram we use the socket buffer?
Is it the same for VMCI?
>
>- remote_addr
> Needs additional protection because before this patch the
> remote_addr (consisting of several fields such as cid, port, and flags)
> only changed atomically under socket lock context. By acquiring the
> socket lock to read the structure, the changes made by connect() were
> always made visible to sendmsg() atomically. Consequently, to retain
> atomicity of updates but offer lock-free access, this patch
> redesigns this field as an RCU-protected pointer.
>
> Writers are still synchronized using the socket lock, but readers
> only read inside RCU read-side critical sections.
>
>Helpers are introduced for accessing and updating the new pointer.
>
>The remote_addr structure is wrapped together with an rcu_head into a
>sockaddr_vm_rcu structure so that kfree_rcu() can be used. This removes
>the need of writers to use synchronize_rcu() after freeing old structures
>which is simply more efficient and reduces code churn where remote_addr
>is already being updated inside read-side sections.
>
>Only virtio has been tested, but updates were necessary to the VMCI and
>hyperv code. Unfortunately the author does not have access to
>VMCI/hyperv systems so those changes are untested.
>
>Perf Tests
>vCPUS: 16
>Threads: 16
>Payload: 4KB
>Test Runs: 5
>Type: SOCK_DGRAM
>
>Before: 245.2 MB/s
>After: 509.2 MB/s (+107%)
>
>Notably, on the same test system, vsock dgram even outperforms
>multi-threaded UDP over virtio-net with vhost and MQ support enabled.
Cool!
This patch is quite large, so I need to review it carefully in future
versions, but in general it makes sense to me.
Thanks,
Stefano
>
>Throughput metrics for single-threaded SOCK_DGRAM and
>single/multi-threaded SOCK_STREAM showed no statistically signficant
>throughput changes (lowest p-value reaching 0.27), with the range of the
>mean difference ranging between -5% to +1%.
>
>Signed-off-by: Bobby Eshleman <[email protected]>
>---
> drivers/vhost/vsock.c | 12 +-
> include/net/af_vsock.h | 19 ++-
> net/vmw_vsock/af_vsock.c | 261 ++++++++++++++++++++++++++++----
> net/vmw_vsock/diag.c | 10 +-
> net/vmw_vsock/hyperv_transport.c | 15 +-
> net/vmw_vsock/virtio_transport_common.c | 22 ++-
> net/vmw_vsock/vmci_transport.c | 70 ++++++---
> 7 files changed, 344 insertions(+), 65 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 028cf079225e..da105cb856ac 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -296,13 +296,17 @@ static int
> vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> {
> struct vhost_vsock *vsock;
>+ unsigned int cid;
> int cnt = 0;
> int ret = -ENODEV;
>
> rcu_read_lock();
>+ ret = vsock_remote_addr_cid(vsk, &cid);
>+ if (ret < 0)
>+ goto out;
>
> /* Find the vhost_vsock according to guest context id */
>- vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
>+ vsock = vhost_vsock_get(cid);
> if (!vsock)
> goto out;
>
>@@ -686,6 +690,10 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
> static void vhost_vsock_reset_orphans(struct sock *sk)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
>+ unsigned int cid;
>+
>+ if (vsock_remote_addr_cid(vsk, &cid) < 0)
>+ return;
>
> /* vmci_transport.c doesn't take sk_lock here either. At least we're
> * under vsock_table_lock so the sock cannot disappear while we're
>@@ -693,7 +701,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
> */
>
> /* If the peer is still valid, no need to reset connection */
>- if (vhost_vsock_get(vsk->remote_addr.svm_cid))
>+ if (vhost_vsock_get(cid))
> return;
>
> /* If the close timeout is pending, let it expire. This avoids races
>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>index 57af28fede19..c02fd6ad0047 100644
>--- a/include/net/af_vsock.h
>+++ b/include/net/af_vsock.h
>@@ -25,12 +25,17 @@ extern spinlock_t vsock_table_lock;
> #define vsock_sk(__sk) ((struct vsock_sock *)__sk)
> #define sk_vsock(__vsk) (&(__vsk)->sk)
>
>+struct sockaddr_vm_rcu {
>+ struct sockaddr_vm addr;
>+ struct rcu_head rcu;
>+};
>+
> struct vsock_sock {
> /* sk must be the first member. */
> struct sock sk;
> const struct vsock_transport *transport;
> struct sockaddr_vm local_addr;
>- struct sockaddr_vm remote_addr;
>+ struct sockaddr_vm_rcu * __rcu remote_addr;
> /* Links for the global tables of bound and connected sockets. */
> struct list_head bound_table;
> struct list_head connected_table;
>@@ -206,7 +211,7 @@ void vsock_release_pending(struct sock *pending);
> void vsock_add_pending(struct sock *listener, struct sock *pending);
> void vsock_remove_pending(struct sock *listener, struct sock *pending);
> void vsock_enqueue_accept(struct sock *listener, struct sock *connected);
>-void vsock_insert_connected(struct vsock_sock *vsk);
>+int vsock_insert_connected(struct vsock_sock *vsk);
> void vsock_remove_bound(struct vsock_sock *vsk);
> void vsock_remove_connected(struct vsock_sock *vsk);
> struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
>@@ -244,4 +249,14 @@ static inline void __init vsock_bpf_build_proto(void)
> {}
> #endif
>
>+/* RCU-protected remote addr helpers */
>+int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid);
>+int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port);
>+int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
>+ unsigned int *port);
>+int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest);
>+bool vsock_remote_addr_bound(struct vsock_sock *vsk);
>+bool vsock_remote_addr_equals(struct vsock_sock *vsk, struct sockaddr_vm *other);
>+int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port);
>+
> #endif /* __AF_VSOCK_H__ */
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index 46b3f35e3adc..93b4abbf20b4 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -145,6 +145,139 @@ static const struct vsock_transport *transport_local;
> static DEFINE_MUTEX(vsock_register_mutex);
>
> /**** UTILS ****/
>+bool vsock_remote_addr_bound(struct vsock_sock *vsk)
>+{
>+ struct sockaddr_vm_rcu *remote_addr;
>+ bool ret;
>+
>+ rcu_read_lock();
>+ remote_addr = rcu_dereference(vsk->remote_addr);
>+ if (!remote_addr) {
>+ rcu_read_unlock();
>+ return false;
>+ }
>+
>+ ret = vsock_addr_bound(&remote_addr->addr);
>+ rcu_read_unlock();
>+
>+ return ret;
>+}
>+EXPORT_SYMBOL_GPL(vsock_remote_addr_bound);
>+
>+int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest)
>+{
>+ struct sockaddr_vm_rcu *remote_addr;
>+
>+ rcu_read_lock();
>+ remote_addr = rcu_dereference(vsk->remote_addr);
>+ if (!remote_addr) {
>+ rcu_read_unlock();
>+ return -EINVAL;
>+ }
>+ memcpy(dest, &remote_addr->addr, sizeof(*dest));
>+ rcu_read_unlock();
>+
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(vsock_remote_addr_copy);
>+
>+int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid)
>+{
>+ return vsock_remote_addr_cid_port(vsk, cid, NULL);
>+}
>+EXPORT_SYMBOL_GPL(vsock_remote_addr_cid);
>+
>+int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port)
>+{
>+ return vsock_remote_addr_cid_port(vsk, NULL, port);
>+}
>+EXPORT_SYMBOL_GPL(vsock_remote_addr_port);
>+
>+int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
>+ unsigned int *port)
>+{
>+ struct sockaddr_vm_rcu *remote_addr;
>+
>+ rcu_read_lock();
>+ remote_addr = rcu_dereference(vsk->remote_addr);
>+ if (!remote_addr) {
>+ rcu_read_unlock();
>+ return -EINVAL;
>+ }
>+
>+ if (cid)
>+ *cid = remote_addr->addr.svm_cid;
>+ if (port)
>+ *port = remote_addr->addr.svm_port;
>+
>+ rcu_read_unlock();
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(vsock_remote_addr_cid_port);
>+
>+/* The socket lock must be held by the caller */
>+int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port)
>+{
>+ struct sockaddr_vm_rcu *old, *new;
>+
>+ new = kmalloc(sizeof(*new), GFP_KERNEL);
>+ if (!new)
>+ return -ENOMEM;
>+
>+ rcu_read_lock();
>+ old = rcu_dereference(vsk->remote_addr);
>+ if (!old) {
>+ kfree(new);
>+ return -EINVAL;
>+ }
>+ memcpy(&new->addr, &old->addr, sizeof(new->addr));
>+ rcu_read_unlock();
>+
>+ new->addr.svm_cid = cid;
>+ new->addr.svm_port = port;
>+
>+ old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
>+ kfree_rcu(old, rcu);
>+
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(vsock_remote_addr_update_cid_port);
>+
>+/* The socket lock must be held by the caller */
>+int vsock_remote_addr_update(struct vsock_sock *vsk, struct sockaddr_vm *src)
>+{
>+ struct sockaddr_vm_rcu *old, *new;
>+
>+ new = kmalloc(sizeof(*new), GFP_KERNEL);
>+ if (!new)
>+ return -ENOMEM;
>+
>+ memcpy(&new->addr, src, sizeof(new->addr));
>+ old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
>+ kfree_rcu(old, rcu);
>+
>+ return 0;
>+}
>+
>+bool vsock_remote_addr_equals(struct vsock_sock *vsk,
>+ struct sockaddr_vm *other)
>+{
>+ struct sockaddr_vm_rcu *remote_addr;
>+ bool equals;
>+
>+ rcu_read_lock();
>+ remote_addr = rcu_dereference(vsk->remote_addr);
>+ if (!remote_addr) {
>+ rcu_read_unlock();
>+ return false;
>+ }
>+
>+ equals = vsock_addr_equals_addr(&remote_addr->addr, other);
>+ rcu_read_unlock();
>+
>+ return equals;
>+}
>+EXPORT_SYMBOL_GPL(vsock_remote_addr_equals);
>
> /* Each bound VSocket is stored in the bind hash table and each connected
> * VSocket is stored in the connected hash table.
>@@ -254,10 +387,16 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
>
> list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
> connected_table) {
>- if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
>+ struct sockaddr_vm_rcu *remote_addr;
>+
>+ rcu_read_lock();
>+ remote_addr = rcu_dereference(vsk->remote_addr);
>+ if (vsock_addr_equals_addr(src, &remote_addr->addr) &&
> dst->svm_port == vsk->local_addr.svm_port) {
>+ rcu_read_unlock();
> return sk_vsock(vsk);
> }
>+ rcu_read_unlock();
> }
>
> return NULL;
>@@ -270,14 +409,25 @@ static void vsock_insert_unbound(struct vsock_sock *vsk)
> spin_unlock_bh(&vsock_table_lock);
> }
>
>-void vsock_insert_connected(struct vsock_sock *vsk)
>+int vsock_insert_connected(struct vsock_sock *vsk)
> {
>- struct list_head *list = vsock_connected_sockets(
>- &vsk->remote_addr, &vsk->local_addr);
>+ struct list_head *list;
>+ struct sockaddr_vm_rcu *remote_addr;
>+
>+ rcu_read_lock();
>+ remote_addr = rcu_dereference(vsk->remote_addr);
>+ if (!remote_addr) {
>+ rcu_read_unlock();
>+ return -EINVAL;
>+ }
>+ list = vsock_connected_sockets(&remote_addr->addr, &vsk->local_addr);
>+ rcu_read_unlock();
>
> spin_lock_bh(&vsock_table_lock);
> __vsock_insert_connected(list, vsk);
> spin_unlock_bh(&vsock_table_lock);
>+
>+ return 0;
> }
> EXPORT_SYMBOL_GPL(vsock_insert_connected);
>
>@@ -438,10 +588,17 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> {
> const struct vsock_transport *new_transport;
> struct sock *sk = sk_vsock(vsk);
>- unsigned int remote_cid = vsk->remote_addr.svm_cid;
>+ struct sockaddr_vm remote_addr;
>+ unsigned int remote_cid;
> __u8 remote_flags;
> int ret;
>
>+ ret = vsock_remote_addr_copy(vsk, &remote_addr);
>+ if (ret < 0)
>+ return ret;
>+
>+ remote_cid = remote_addr.svm_cid;
>+
> /* If the packet is coming with the source and destination CIDs higher
> * than VMADDR_CID_HOST, then a vsock channel where all the packets are
> * forwarded to the host should be established. Then the host will
>@@ -451,10 +608,15 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> * the connect path the flag can be set by the user space application.
> */
> if (psk && vsk->local_addr.svm_cid > VMADDR_CID_HOST &&
>- vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
>- vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;
>+ remote_addr.svm_cid > VMADDR_CID_HOST) {
>+ remote_addr.svm_flags |= VMADDR_CID_HOST;
>+
>+ ret = vsock_remote_addr_update(vsk, &remote_addr);
>+ if (ret < 0)
>+ return ret;
>+ }
>
>- remote_flags = vsk->remote_addr.svm_flags;
>+ remote_flags = remote_addr.svm_flags;
>
> switch (sk->sk_type) {
> case SOCK_DGRAM:
>@@ -742,6 +904,7 @@ static struct sock *__vsock_create(struct net *net,
> unsigned short type,
> int kern)
> {
>+ struct sockaddr_vm *remote_addr;
> struct sock *sk;
> struct vsock_sock *psk;
> struct vsock_sock *vsk;
>@@ -761,7 +924,14 @@ static struct sock *__vsock_create(struct net *net,
>
> vsk = vsock_sk(sk);
> vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>- vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>+
>+ remote_addr = kmalloc(sizeof(*remote_addr), GFP_KERNEL);
>+ if (!remote_addr) {
>+ sk_free(sk);
>+ return NULL;
>+ }
>+ vsock_addr_init(remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>+ rcu_assign_pointer(vsk->remote_addr, remote_addr);
>
> sk->sk_destruct = vsock_sk_destruct;
> sk->sk_backlog_rcv = vsock_queue_rcv_skb;
>@@ -845,6 +1015,7 @@ static void __vsock_release(struct sock *sk, int level)
> static void vsock_sk_destruct(struct sock *sk)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
>+ struct sockaddr_vm_rcu *remote_addr;
>
> vsock_deassign_transport(vsk);
>
>@@ -852,8 +1023,8 @@ static void vsock_sk_destruct(struct sock *sk)
> * possibly register the address family with the kernel.
> */
> vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>- vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>-
>+ remote_addr = rcu_replace_pointer(vsk->remote_addr, NULL, 1);
>+ kfree_rcu(remote_addr);
> put_cred(vsk->owner);
> }
>
>@@ -943,6 +1114,7 @@ static int vsock_getname(struct socket *sock,
> struct sock *sk;
> struct vsock_sock *vsk;
> struct sockaddr_vm *vm_addr;
>+ struct sockaddr_vm_rcu *rcu_ptr;
>
> sk = sock->sk;
> vsk = vsock_sk(sk);
>@@ -951,11 +1123,17 @@ static int vsock_getname(struct socket *sock,
> lock_sock(sk);
>
> if (peer) {
>+ rcu_read_lock();
> if (sock->state != SS_CONNECTED) {
> err = -ENOTCONN;
> goto out;
> }
>- vm_addr = &vsk->remote_addr;
>+ rcu_ptr = rcu_dereference(vsk->remote_addr);
>+ if (!rcu_ptr) {
>+ err = -EINVAL;
>+ goto out;
>+ }
>+ vm_addr = &rcu_ptr->addr;
> } else {
> vm_addr = &vsk->local_addr;
> }
>@@ -975,6 +1153,8 @@ static int vsock_getname(struct socket *sock,
> err = sizeof(*vm_addr);
>
> out:
>+ if (peer)
>+ rcu_read_unlock();
> release_sock(sk);
> return err;
> }
>@@ -1161,7 +1341,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> int err;
> struct sock *sk;
> struct vsock_sock *vsk;
>- struct sockaddr_vm *remote_addr;
>+ struct sockaddr_vm stack_addr, *remote_addr;
> const struct vsock_transport *transport;
>
> if (msg->msg_flags & MSG_OOB)
>@@ -1172,15 +1352,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> sk = sock->sk;
> vsk = vsock_sk(sk);
>
>- lock_sock(sk);
>+ /* If auto-binding is required, acquire the slock to avoid potential
>+ * race conditions. Otherwise, do not acquire the lock.
>+ *
>+ * We know that the first check of local_addr is racy (indicated by
>+ * data_race()). By acquiring the lock and then subsequently checking
>+ * again if local_addr is bound (inside vsock_auto_bind()), we can
>+ * ensure there are no real data races.
>+ *
>+ * This technique is borrowed by inet_send_prepare().
>+ */
>+ if (data_race(!vsock_addr_bound(&vsk->local_addr))) {
>+ lock_sock(sk);
>+ err = vsock_auto_bind(vsk);
>+ release_sock(sk);
>+ if (err)
>+ return err;
>+ }
>
> transport = vsk->transport;
>
>- err = vsock_auto_bind(vsk);
>- if (err)
>- goto out;
>-
>-
> /* If the provided message contains an address, use that. Otherwise
> * fall back on the socket's remote handle (if it has been connected).
> */
>@@ -1199,18 +1390,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> goto out;
> }
> } else if (sock->state == SS_CONNECTED) {
>- remote_addr = &vsk->remote_addr;
>+ err = vsock_remote_addr_copy(vsk, &stack_addr);
>+ if (err < 0)
>+ goto out;
>
>- if (remote_addr->svm_cid == VMADDR_CID_ANY)
>- remote_addr->svm_cid = transport->get_local_cid();
>+ if (stack_addr.svm_cid == VMADDR_CID_ANY) {
>+ stack_addr.svm_cid = transport->get_local_cid();
>+ lock_sock(sk_vsock(vsk));
>+ vsock_remote_addr_update(vsk, &stack_addr);
>+ release_sock(sk_vsock(vsk));
>+ }
>
> /* XXX Should connect() or this function ensure remote_addr is
> * bound?
> */
>- if (!vsock_addr_bound(&vsk->remote_addr)) {
>+ if (!vsock_addr_bound(&stack_addr)) {
> err = -EINVAL;
> goto out;
> }
>+
>+ remote_addr = &stack_addr;
> } else {
> err = -EINVAL;
> goto out;
>@@ -1225,7 +1424,6 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
>
> out:
>- release_sock(sk);
> return err;
> }
>
>@@ -1243,8 +1441,7 @@ static int vsock_dgram_connect(struct socket *sock,
> err = vsock_addr_cast(addr, addr_len, &remote_addr);
> if (err == -EAFNOSUPPORT && remote_addr->svm_family == AF_UNSPEC) {
> lock_sock(sk);
>- vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY,
>- VMADDR_PORT_ANY);
>+ vsock_remote_addr_update_cid_port(vsk, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> sock->state = SS_UNCONNECTED;
> release_sock(sk);
> return 0;
>@@ -1263,7 +1460,10 @@ static int vsock_dgram_connect(struct socket *sock,
> goto out;
> }
>
>- memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
>+ err = vsock_remote_addr_update(vsk, remote_addr);
>+ if (err < 0)
>+ goto out;
>+
> sock->state = SS_CONNECTED;
>
> /* sock map disallows redirection of non-TCP sockets with sk_state !=
>@@ -1399,8 +1599,9 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
> }
>
> /* Set the remote address that we are connecting to. */
>- memcpy(&vsk->remote_addr, remote_addr,
>- sizeof(vsk->remote_addr));
>+ err = vsock_remote_addr_update(vsk, remote_addr);
>+ if (err)
>+ goto out;
>
> err = vsock_assign_transport(vsk, NULL);
> if (err)
>@@ -1831,7 +2032,7 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
> goto out;
> }
>
>- if (!vsock_addr_bound(&vsk->remote_addr)) {
>+ if (!vsock_remote_addr_bound(vsk)) {
> err = -EDESTADDRREQ;
> goto out;
> }
>diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
>index a2823b1c5e28..f843bae86b32 100644
>--- a/net/vmw_vsock/diag.c
>+++ b/net/vmw_vsock/diag.c
>@@ -15,8 +15,14 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
> u32 portid, u32 seq, u32 flags)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
>+ struct sockaddr_vm remote_addr;
> struct vsock_diag_msg *rep;
> struct nlmsghdr *nlh;
>+ int err;
>+
>+ err = vsock_remote_addr_copy(vsk, &remote_addr);
>+ if (err < 0)
>+ return err;
>
> nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
> flags);
>@@ -36,8 +42,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
> rep->vdiag_shutdown = sk->sk_shutdown;
> rep->vdiag_src_cid = vsk->local_addr.svm_cid;
> rep->vdiag_src_port = vsk->local_addr.svm_port;
>- rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
>- rep->vdiag_dst_port = vsk->remote_addr.svm_port;
>+ rep->vdiag_dst_cid = remote_addr.svm_cid;
>+ rep->vdiag_dst_port = remote_addr.svm_port;
> rep->vdiag_ino = sock_i_ino(sk);
>
> sock_diag_save_cookie(sk, rep->vdiag_cookie);
>diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
>index 7cb1a9d2cdb4..462b2ec3e6e9 100644
>--- a/net/vmw_vsock/hyperv_transport.c
>+++ b/net/vmw_vsock/hyperv_transport.c
>@@ -336,9 +336,11 @@ static void hvs_open_connection(struct vmbus_channel *chan)
> hvs_addr_init(&vnew->local_addr, if_type);
>
> /* Remote peer is always the host */
>- vsock_addr_init(&vnew->remote_addr,
>- VMADDR_CID_HOST, VMADDR_PORT_ANY);
>- vnew->remote_addr.svm_port = get_port_by_srv_id(if_instance);
>+ ret = vsock_remote_addr_update_cid_port(vnew, VMADDR_CID_HOST,
>+ get_port_by_srv_id(if_instance));
>+ if (ret < 0)
>+ goto out;
>+
> ret = vsock_assign_transport(vnew, vsock_sk(sk));
> /* Transport assigned (looking at remote_addr) must be the
> * same where we received the request.
>@@ -459,13 +461,18 @@ static int hvs_connect(struct vsock_sock *vsk)
> {
> union hvs_service_id vm, host;
> struct hvsock *h = vsk->trans;
>+ int err;
>
> vm.srv_id = srv_id_template;
> vm.svm_port = vsk->local_addr.svm_port;
> h->vm_srv_id = vm.srv_id;
>
> host.srv_id = srv_id_template;
>- host.svm_port = vsk->remote_addr.svm_port;
>+
>+ err = vsock_remote_addr_port(vsk, &host.svm_port);
>+ if (err < 0)
>+ return err;
>+
> h->host_srv_id = host.srv_id;
>
> return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 925acface893..1b87704e516a 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -258,8 +258,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> src_cid = t_ops->transport.get_local_cid();
> src_port = vsk->local_addr.svm_port;
> if (!info->remote_cid) {
>- dst_cid = vsk->remote_addr.svm_cid;
>- dst_port = vsk->remote_addr.svm_port;
>+ ret = vsock_remote_addr_cid_port(vsk, &dst_cid, &dst_port);
>+ if (ret < 0)
>+ return ret;
> } else {
> dst_cid = info->remote_cid;
> dst_port = info->remote_port;
>@@ -1169,7 +1170,9 @@ virtio_transport_recv_connecting(struct sock *sk,
> case VIRTIO_VSOCK_OP_RESPONSE:
> sk->sk_state = TCP_ESTABLISHED;
> sk->sk_socket->state = SS_CONNECTED;
>- vsock_insert_connected(vsk);
>+ err = vsock_insert_connected(vsk);
>+ if (err)
>+ goto destroy;
> sk->sk_state_change(sk);
> break;
> case VIRTIO_VSOCK_OP_INVALID:
>@@ -1403,9 +1406,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> vchild = vsock_sk(child);
> vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
> le32_to_cpu(hdr->dst_port));
>- vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
>- le32_to_cpu(hdr->src_port));
>-
>+ vsock_remote_addr_update_cid_port(vchild, le64_to_cpu(hdr->src_cid),
>+ le32_to_cpu(hdr->src_port));
> ret = vsock_assign_transport(vchild, vsk);
> /* Transport assigned (looking at remote_addr) must be the same
> * where we received the request.
>@@ -1420,7 +1422,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> if (virtio_transport_space_update(child, skb))
> child->sk_write_space(child);
>
>- vsock_insert_connected(vchild);
>+ ret = vsock_insert_connected(vchild);
>+ if (ret) {
>+ release_sock(child);
>+ virtio_transport_reset_no_sock(t, skb);
>+ sock_put(child);
>+ return ret;
>+ }
> vsock_enqueue_accept(sk, child);
> virtio_transport_send_response(vchild, skb);
>
>diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
>index b370070194fa..c0c445e7d925 100644
>--- a/net/vmw_vsock/vmci_transport.c
>+++ b/net/vmw_vsock/vmci_transport.c
>@@ -283,18 +283,25 @@ vmci_transport_send_control_pkt(struct sock *sk,
> u16 proto,
> struct vmci_handle handle)
> {
>+ struct sockaddr_vm addr_stack;
>+ struct sockaddr_vm *remote_addr = &addr_stack;
> struct vsock_sock *vsk;
>+ int err;
>
> vsk = vsock_sk(sk);
>
> if (!vsock_addr_bound(&vsk->local_addr))
> return -EINVAL;
>
>- if (!vsock_addr_bound(&vsk->remote_addr))
>+ if (!vsock_remote_addr_bound(vsk))
> return -EINVAL;
>
>+ err = vsock_remote_addr_copy(vsk, &addr_stack);
>+ if (err < 0)
>+ return err;
>+
> return vmci_transport_alloc_send_control_pkt(&vsk->local_addr,
>- &vsk->remote_addr,
>+ remote_addr,
> type, size, mode,
> wait, proto, handle);
> }
>@@ -317,6 +324,7 @@ static int vmci_transport_send_reset(struct sock *sk,
> struct sockaddr_vm *dst_ptr;
> struct sockaddr_vm dst;
> struct vsock_sock *vsk;
>+ int err;
>
> if (pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST)
> return 0;
>@@ -326,13 +334,16 @@ static int vmci_transport_send_reset(struct sock *sk,
> if (!vsock_addr_bound(&vsk->local_addr))
> return -EINVAL;
>
>- if (vsock_addr_bound(&vsk->remote_addr)) {
>- dst_ptr = &vsk->remote_addr;
>+ if (vsock_remote_addr_bound(vsk)) {
>+ err = vsock_remote_addr_copy(vsk, &dst);
>+ if (err < 0)
>+ return err;
> } else {
> vsock_addr_init(&dst, pkt->dg.src.context,
> pkt->src_port);
>- dst_ptr = &dst;
> }
>+ dst_ptr = &dst;
>+
> return vmci_transport_alloc_send_control_pkt(&vsk->local_addr, dst_ptr,
> VMCI_TRANSPORT_PACKET_TYPE_RST,
> 0, 0, NULL, VSOCK_PROTO_INVALID,
>@@ -490,7 +501,7 @@ static struct sock *vmci_transport_get_pending(
>
> list_for_each_entry(vpending, &vlistener->pending_links,
> pending_links) {
>- if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
>+ if (vsock_remote_addr_equals(vpending, &src) &&
> pkt->dst_port == vpending->local_addr.svm_port) {
> pending = sk_vsock(vpending);
> sock_hold(pending);
>@@ -1015,8 +1026,8 @@ static int vmci_transport_recv_listen(struct sock *sk,
>
> vsock_addr_init(&vpending->local_addr, pkt->dg.dst.context,
> pkt->dst_port);
>- vsock_addr_init(&vpending->remote_addr, pkt->dg.src.context,
>- pkt->src_port);
>+ vsock_remote_addr_update_cid_port(vpending, pkt->dg.src.context,
>+ pkt->src_port);
>
> err = vsock_assign_transport(vpending, vsock_sk(sk));
> /* Transport assigned (looking at remote_addr) must be the same
>@@ -1133,6 +1144,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> {
> struct vsock_sock *vpending;
> struct vmci_handle handle;
>+ unsigned int vpending_remote_cid;
> struct vmci_qp *qpair;
> bool is_local;
> u32 flags;
>@@ -1189,8 +1201,13 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> /* vpending->local_addr always has a context id so we do not need to
> * worry about VMADDR_CID_ANY in this case.
> */
>- is_local =
>- vpending->remote_addr.svm_cid == vpending->local_addr.svm_cid;
>+ err = vsock_remote_addr_cid(vpending, &vpending_remote_cid);
>+ if (err < 0) {
>+ skerr = EPROTO;
>+ goto destroy;
>+ }
>+
>+ is_local = vpending_remote_cid == vpending->local_addr.svm_cid;
> flags = VMCI_QPFLAG_ATTACH_ONLY;
> flags |= is_local ? VMCI_QPFLAG_LOCAL : 0;
>
>@@ -1203,7 +1220,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> flags,
> vmci_transport_is_trusted(
> vpending,
>- vpending->remote_addr.svm_cid));
>+ vpending_remote_cid));
> if (err < 0) {
> vmci_transport_send_reset(pending, pkt);
> skerr = -err;
>@@ -1306,9 +1323,20 @@ vmci_transport_recv_connecting_client(struct sock *sk,
> break;
> case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE:
> case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE2:
>+ struct sockaddr_vm_rcu *remote_addr;
>+
>+ rcu_read_lock();
>+ remote_addr = rcu_dereference(vsk->remote_addr);
>+ if (!remote_addr) {
>+ skerr = EPROTO;
>+ err = -EINVAL;
>+ rcu_read_unlock();
>+ goto destroy;
>+ }
>+
> if (pkt->u.size == 0
>- || pkt->dg.src.context != vsk->remote_addr.svm_cid
>- || pkt->src_port != vsk->remote_addr.svm_port
>+ || pkt->dg.src.context != remote_addr->addr.svm_cid
>+ || pkt->src_port != remote_addr->addr.svm_port
> || !vmci_handle_is_invalid(vmci_trans(vsk)->qp_handle)
> || vmci_trans(vsk)->qpair
> || vmci_trans(vsk)->produce_size != 0
>@@ -1316,9 +1344,10 @@ vmci_transport_recv_connecting_client(struct sock *sk,
> || vmci_trans(vsk)->detach_sub_id != VMCI_INVALID_ID) {
> skerr = EPROTO;
> err = -EINVAL;
>-
>+ rcu_read_unlock();
> goto destroy;
> }
>+ rcu_read_unlock();
>
> err = vmci_transport_recv_connecting_client_negotiate(sk, pkt);
> if (err) {
>@@ -1379,6 +1408,7 @@ static int vmci_transport_recv_connecting_client_negotiate(
> int err;
> struct vsock_sock *vsk;
> struct vmci_handle handle;
>+ unsigned int remote_cid;
> struct vmci_qp *qpair;
> u32 detach_sub_id;
> bool is_local;
>@@ -1449,19 +1479,23 @@ static int vmci_transport_recv_connecting_client_negotiate(
>
> /* Make VMCI select the handle for us. */
> handle = VMCI_INVALID_HANDLE;
>- is_local = vsk->remote_addr.svm_cid == vsk->local_addr.svm_cid;
>+
>+ err = vsock_remote_addr_cid(vsk, &remote_cid);
>+ if (err < 0)
>+ goto destroy;
>+
>+ is_local = remote_cid == vsk->local_addr.svm_cid;
> flags = is_local ? VMCI_QPFLAG_LOCAL : 0;
>
> err = vmci_transport_queue_pair_alloc(&qpair,
> &handle,
> pkt->u.size,
> pkt->u.size,
>- vsk->remote_addr.svm_cid,
>+ remote_cid,
> flags,
> vmci_transport_is_trusted(
> vsk,
>- vsk->
>- remote_addr.svm_cid));
>+ remote_cid));
> if (err < 0)
> goto destroy;
>
>
>--
>2.30.2
>
On Fri, Apr 14, 2023 at 12:25:58AM +0000, Bobby Eshleman wrote:
>This commit adds a feature bit for virtio vsock to support datagrams.
>This commit should not be applied without first applying the commit
>that implements datagrams for virtio.
>
>Signed-off-by: Jiang Wang <[email protected]>
>Signed-off-by: Bobby Eshleman <[email protected]>
>---
> drivers/vhost/vsock.c | 3 ++-
> include/uapi/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport.c | 8 ++++++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index dff6ee1c479b..028cf079225e 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -32,7 +32,8 @@
> enum {
> VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
>- (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
>+ (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
>+ (1ULL << VIRTIO_VSOCK_F_DGRAM)
> };
>
> enum {
>diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>index 331be28b1d30..0975b9c88292 100644
>--- a/include/uapi/linux/virtio_vsock.h
>+++ b/include/uapi/linux/virtio_vsock.h
>@@ -40,6 +40,7 @@
>
> /* The feature bitmap for virtio vsock */
> #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
>+#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
>
> struct virtio_vsock_config {
> __le64 guest_cid;
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index 582c6c0f788f..bb43eea9a6f9 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -29,6 +29,7 @@ static struct virtio_transport virtio_transport; /* forward declaration */
> struct virtio_vsock {
> struct virtio_device *vdev;
> struct virtqueue *vqs[VSOCK_VQ_MAX];
>+ bool has_dgram;
>
> /* Virtqueue processing is deferred to a workqueue */
> struct work_struct tx_work;
>@@ -640,7 +641,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> }
>
> vsock->vdev = vdev;
>-
> vsock->rx_buf_nr = 0;
> vsock->rx_buf_max_nr = 0;
> atomic_set(&vsock->queued_replies, 0);
>@@ -657,6 +657,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
> vsock->seqpacket_allow = true;
>
>+ if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
>+ vsock->has_dgram = true;
This is unused for now, but I think the idea is to use in
virtio_transport_dgram_allow(), right?
I would follow `seqpacket_allow` in this case.
Thanks,
Stefano
>+
> vdev->priv = vsock;
>
> ret = virtio_vsock_vqs_init(vsock);
>@@ -749,7 +752,8 @@ static struct virtio_device_id id_table[] = {
> };
>
> static unsigned int features[] = {
>- VIRTIO_VSOCK_F_SEQPACKET
>+ VIRTIO_VSOCK_F_SEQPACKET,
>+ VIRTIO_VSOCK_F_DGRAM
> };
>
> static struct virtio_driver virtio_vsock_driver = {
>
>--
>2.30.2
>
Hi Bobby,
On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote:
>CC'ing Cong.
>
>On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote:
>> Hey all!
>>
>> This series introduces support for datagrams to virtio/vsock.
Great! Thanks for restarting this work!
>>
>> It is a spin-off (and smaller version) of this series from the summer:
>> https://lore.kernel.org/all/[email protected]/
>>
>> Please note that this is an RFC and should not be merged until
>> associated changes are made to the virtio specification, which will
>> follow after discussion from this series.
>>
>> This series first supports datagrams in a basic form for virtio, and
>> then optimizes the sendpath for all transports.
>>
>> The result is a very fast datagram communication protocol that
>> outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
>> of multi-threaded workload samples.
>>
>> For those that are curious, some summary data comparing UDP and VSOCK
>> DGRAM (N=5):
>>
>> vCPUS: 16
>> virtio-net queues: 16
>> payload size: 4KB
>> Setup: bare metal + vm (non-nested)
>>
>> UDP: 287.59 MB/s
>> VSOCK DGRAM: 509.2 MB/s
>>
>> Some notes about the implementation...
>>
>> This datagram implementation forces datagrams to self-throttle according
>> to the threshold set by sk_sndbuf. It behaves similar to the credits
>> used by streams in its effect on throughput and memory consumption, but
>> it is not influenced by the receiving socket as credits are.
So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right?
We should check if VMCI behaves the same.
>>
>> The device drops packets silently. There is room for improvement by
>> building into the device and driver some intelligence around how to
>> reduce frequency of kicking the virtqueue when packet loss is high. I
>> think there is a good discussion to be had on this.
Can you elaborate a bit here?
Do you mean some mechanism to report to the sender that a destination
(cid, port) is full so the packet will be dropped?
Can we adapt the credit mechanism?
>>
>> In this series I am also proposing that fairness be reexamined as an
>> issue separate from datagrams, which differs from my previous series
>> that coupled these issues. After further testing and reflection on the
>> design, I do not believe that these need to be coupled and I do not
>> believe this implementation introduces additional unfairness or
>> exacerbates pre-existing unfairness.
I see.
>>
>> I attempted to characterize vsock fairness by using a pool of processes
>> to stress test the shared resources while measuring the performance of a
>> lone stream socket. Given unfair preference for datagrams, we would
>> assume that a lone stream socket would degrade much more when a pool of
>> datagram sockets was stressing the system than when a pool of stream
>> sockets are stressing the system. The result, however, showed no
>> significant difference between the degradation of throughput of the lone
>> stream socket when using a pool of datagrams to stress the queue over
>> using a pool of streams. The absolute difference in throughput actually
>> favored datagrams as interfering least as the mean difference was +16%
>> compared to using streams to stress test (N=7), but it was not
>> statistically significant. Workloads were matched for payload size and
>> buffer size (to approximate memory consumption) and process count, and
>> stress workloads were configured to start before and last long after the
>> lifetime of the "lone" stream socket flow to ensure that competing flows
>> were continuously hot.
>>
>> Given the above data, I propose that vsock fairness be addressed
>> independent of datagrams and to defer its implementation to a future
>> series.
Makes sense to me.
I left some preliminary comments, anyway now it seems reasonable to use
the same virtqueues, so we can go head with the spec proposal.
Thanks,
Stefano
CC'ing [email protected] because this thread is starting
to touch the spec.
On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote:
> Hi Bobby,
>
> On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote:
> > CC'ing Cong.
> >
> > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote:
> > > Hey all!
> > >
> > > This series introduces support for datagrams to virtio/vsock.
>
> Great! Thanks for restarting this work!
>
No problem!
> > >
> > > It is a spin-off (and smaller version) of this series from the summer:
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > Please note that this is an RFC and should not be merged until
> > > associated changes are made to the virtio specification, which will
> > > follow after discussion from this series.
> > >
> > > This series first supports datagrams in a basic form for virtio, and
> > > then optimizes the sendpath for all transports.
> > >
> > > The result is a very fast datagram communication protocol that
> > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
> > > of multi-threaded workload samples.
> > >
> > > For those that are curious, some summary data comparing UDP and VSOCK
> > > DGRAM (N=5):
> > >
> > > vCPUS: 16
> > > virtio-net queues: 16
> > > payload size: 4KB
> > > Setup: bare metal + vm (non-nested)
> > >
> > > UDP: 287.59 MB/s
> > > VSOCK DGRAM: 509.2 MB/s
> > >
> > > Some notes about the implementation...
> > >
> > > This datagram implementation forces datagrams to self-throttle according
> > > to the threshold set by sk_sndbuf. It behaves similar to the credits
> > > used by streams in its effect on throughput and memory consumption, but
> > > it is not influenced by the receiving socket as credits are.
>
> So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right?
>
Correct.
> We should check if VMCI behaves the same.
>
> > >
> > > The device drops packets silently. There is room for improvement by
> > > building into the device and driver some intelligence around how to
> > > reduce frequency of kicking the virtqueue when packet loss is high. I
> > > think there is a good discussion to be had on this.
>
> Can you elaborate a bit here?
>
> Do you mean some mechanism to report to the sender that a destination
> (cid, port) is full so the packet will be dropped?
>
Correct. There is also the case of there being no receiver at all for
this address since this case isn't rejected upon connect(). Ideally,
such a socket (which will have 100% packet loss) will be throttled
aggressively.
Before we go down too far on this path, I also want to clarify that
using UDP over vhost/virtio-net also has this property... this can be
observed by using tcpdump to dump the UDP packets on the bridge network
your VM is using. UDP packets sent to a garbage address can be seen on
the host bridge (this is the nature of UDP, how is the host supposed to
know the address eventually goes nowhere). I mention the above because I
think it is possible for vsock to avoid this cost, given that it
benefits from being point-to-point and g2h/h2g.
If we're okay with vsock being on par, then the current series does
that. I propose something below that can be added later and maybe
negotiated as a feature bit too.
> Can we adapt the credit mechanism?
>
I've thought about this a lot because the attraction of the approach for
me would be that we could get the wait/buffer-limiting logic for free
and without big changes to the protocol, but the problem is that the
unreliable nature of datagrams means that the source's free-running
tx_cnt will become out-of-sync with the destination's fwd_cnt upon
packet loss.
Imagine a source that initializes and starts sending packets before a
destination socket even is created, the source's self-throttling will be
dysfunctional because its tx_cnt will always far exceed the
destination's fwd_cnt.
We could play tricks with the meaning of the CREDIT_UPDATE message and
fwd_cnt/buf_alloc fields, but I don't think we want to go down that
path.
I think that the best and simplest approach introduces a congestion
notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the
destination sends this notification. At a given repeated time period T,
the source can check if it has received any notifications in the last T.
If so, it halves its buffer allocation. If not, it doubles its buffer
allocation unless it is already at its max or original value.
An "invalid" socket which never has any receiver will converge towards a
rate limit of one packet per time T * log2(average pkt size). That is, a
socket with 100% packet loss will only be able to send 16 bytes every
4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within
160ms given at least one packet sent per 5ms. I have no idea if that is
a reasonable default T for vsock, I just pulled it out of a hat for the
sake of the example.
"Normal" sockets will be responsive to high loss and rebalance during
low loss. The source is trying to guess and converge on the actual
buffer state of the destination.
This would reuse the already-existing throttling mechanisms that
throttle based upon buffer allocation. The usage of sk_sndbuf would have
to be re-worked. The application using sendmsg() will see EAGAIN when
throttled, or just sleep if !MSG_DONTWAIT.
I looked at alternative schemes (like the Datagram Congestion Control
Protocol), but I do not think the added complexity is necessary in the
case of vsock (DCCP requires congestion windows, sequence numbers, batch
acknowledgements, etc...). I also looked at UDP-based application
protocols like TFTP, DHCP, and SIP over UDP which use a delay-based
backoff mechanism, but seem to require acknowledgement for those packet
types, which trigger the retries and backoffs. I think we can get away
with the simpler approach and not have to potentially kill performance
with per-packet acknowledgements.
> > >
> > > In this series I am also proposing that fairness be reexamined as an
> > > issue separate from datagrams, which differs from my previous series
> > > that coupled these issues. After further testing and reflection on the
> > > design, I do not believe that these need to be coupled and I do not
> > > believe this implementation introduces additional unfairness or
> > > exacerbates pre-existing unfairness.
>
> I see.
>
> > >
> > > I attempted to characterize vsock fairness by using a pool of processes
> > > to stress test the shared resources while measuring the performance of a
> > > lone stream socket. Given unfair preference for datagrams, we would
> > > assume that a lone stream socket would degrade much more when a pool of
> > > datagram sockets was stressing the system than when a pool of stream
> > > sockets are stressing the system. The result, however, showed no
> > > significant difference between the degradation of throughput of the lone
> > > stream socket when using a pool of datagrams to stress the queue over
> > > using a pool of streams. The absolute difference in throughput actually
> > > favored datagrams as interfering least as the mean difference was +16%
> > > compared to using streams to stress test (N=7), but it was not
> > > statistically significant. Workloads were matched for payload size and
> > > buffer size (to approximate memory consumption) and process count, and
> > > stress workloads were configured to start before and last long after the
> > > lifetime of the "lone" stream socket flow to ensure that competing flows
> > > were continuously hot.
> > >
> > > Given the above data, I propose that vsock fairness be addressed
> > > independent of datagrams and to defer its implementation to a future
> > > series.
>
> Makes sense to me.
>
> I left some preliminary comments, anyway now it seems reasonable to use
> the same virtqueues, so we can go head with the spec proposal.
>
> Thanks,
> Stefano
>
Thanks for the review!
Best,
Bobby
On Wed, Apr 19, 2023 at 11:30:21AM +0200, Stefano Garzarella wrote:
> On Fri, Apr 14, 2023 at 12:25:58AM +0000, Bobby Eshleman wrote:
> > This commit adds a feature bit for virtio vsock to support datagrams.
> > This commit should not be applied without first applying the commit
> > that implements datagrams for virtio.
> >
> > Signed-off-by: Jiang Wang <[email protected]>
> > Signed-off-by: Bobby Eshleman <[email protected]>
> > ---
> > drivers/vhost/vsock.c | 3 ++-
> > include/uapi/linux/virtio_vsock.h | 1 +
> > net/vmw_vsock/virtio_transport.c | 8 ++++++--
> > 3 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index dff6ee1c479b..028cf079225e 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -32,7 +32,8 @@
> > enum {
> > VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> > (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> > - (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> > + (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> > + (1ULL << VIRTIO_VSOCK_F_DGRAM)
> > };
> >
> > enum {
> > diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> > index 331be28b1d30..0975b9c88292 100644
> > --- a/include/uapi/linux/virtio_vsock.h
> > +++ b/include/uapi/linux/virtio_vsock.h
> > @@ -40,6 +40,7 @@
> >
> > /* The feature bitmap for virtio vsock */
> > #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
> > +#define VIRTIO_VSOCK_F_DGRAM 2 /* Host support dgram vsock */
> >
> > struct virtio_vsock_config {
> > __le64 guest_cid;
> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > index 582c6c0f788f..bb43eea9a6f9 100644
> > --- a/net/vmw_vsock/virtio_transport.c
> > +++ b/net/vmw_vsock/virtio_transport.c
> > @@ -29,6 +29,7 @@ static struct virtio_transport virtio_transport; /* forward declaration */
> > struct virtio_vsock {
> > struct virtio_device *vdev;
> > struct virtqueue *vqs[VSOCK_VQ_MAX];
> > + bool has_dgram;
> >
> > /* Virtqueue processing is deferred to a workqueue */
> > struct work_struct tx_work;
> > @@ -640,7 +641,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> > }
> >
> > vsock->vdev = vdev;
> > -
> > vsock->rx_buf_nr = 0;
> > vsock->rx_buf_max_nr = 0;
> > atomic_set(&vsock->queued_replies, 0);
> > @@ -657,6 +657,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> > if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
> > vsock->seqpacket_allow = true;
> >
> > + if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
> > + vsock->has_dgram = true;
>
> This is unused for now, but I think the idea is to use in
> virtio_transport_dgram_allow(), right?
>
> I would follow `seqpacket_allow` in this case.
>
Got it, thanks.
> Thanks,
> Stefano
>
> > +
> > vdev->priv = vsock;
> >
> > ret = virtio_vsock_vqs_init(vsock);
> > @@ -749,7 +752,8 @@ static struct virtio_device_id id_table[] = {
> > };
> >
> > static unsigned int features[] = {
> > - VIRTIO_VSOCK_F_SEQPACKET
> > + VIRTIO_VSOCK_F_SEQPACKET,
> > + VIRTIO_VSOCK_F_DGRAM
> > };
> >
> > static struct virtio_driver virtio_vsock_driver = {
> >
> > --
> > 2.30.2
> >
>
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
On Wed, Apr 19, 2023 at 11:30:53AM +0200, Stefano Garzarella wrote:
> On Fri, Apr 14, 2023 at 12:25:59AM +0000, Bobby Eshleman wrote:
> > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
> > it does not scale when many senders share a socket.
> >
> > Prior to this patch the socket lock is used to protect the local_addr,
> > remote_addr, transport, and buffer size variables. What follows are the
> > new protection schemes for the various protected fields that ensure a
> > race-free multi-sender sendmsg() path for vsock dgrams.
> >
> > - local_addr
> > local_addr changes as a result of binding a socket. The write path
> > for local_addr is bind() and various vsock_auto_bind() call sites.
> > After a socket has been bound via vsock_auto_bind() or bind(), subsequent
> > calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
> > rejects the user request and vsock_auto_bind() early exits.
> > Therefore, the local addr can not change while a parallel thread is
> > in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
> > Change: only acquire lock for auto-binding as-needed in sendmsg().
> >
> > - vsk->transport
> > Updated upon socket creation and it doesn't change again until the
>
> This is true only for dgram, right?
>
Yes.
> How do we decide which transport to assign for dgram?
>
The transport is assigned in proto->create() [vsock_create()]. It is
assigned there *only* for dgrams, whereas for streams/seqpackets it is
assigned in connect(). vsock_create() sets transport to
'transport_dgram' if sock->type == SOCK_DGRAM.
vsock_sk_destruct() then eventually sets vsk->transport to NULL.
Neither destruct nor create can occur in parallel with sendmsg().
create() hasn't yet returned the sockfd for the user to call upon it
sendmsg(), and AFAICT destruct is only called after the final socket
reference is released, which only happens after the socket no longer
exists in the fd lookup table and so sendmsg() will fail before it ever
has the chance to race.
> > socket is destroyed, which only happens after the socket refcnt reaches
> > zero. This prevents any sendmsg() call from being entered because the
> > sockfd lookup fails beforehand. That is, sendmsg() and vsk->transport
> > writes cannot execute in parallel. Additionally, connect() doesn't
> > update vsk->transport for dgrams as it does for streams. Therefore
> > vsk->transport is also safe to access lock-free in the sendmsg() path.
> > No change.
> >
> > - buffer size variables
> > Not used by dgram, so they do not need protection. No change.
>
> Is this true because for dgram we use the socket buffer?
> Is it the same for VMCI?
Yes. The buf_alloc derived from buffer_size is also always ignored after
being initialized once since credits aren't used.
My reading of the VMCI code is that the buffer_size and
buffer_{min,max}_size variables are used only in connectible calls (like
connect and recv_listen), but not for datagrams.
>
> >
> > - remote_addr
> > Needs additional protection because before this patch the
> > remote_addr (consisting of several fields such as cid, port, and flags)
> > only changed atomically under socket lock context. By acquiring the
> > socket lock to read the structure, the changes made by connect() were
> > always made visible to sendmsg() atomically. Consequently, to retain
> > atomicity of updates but offer lock-free access, this patch
> > redesigns this field as an RCU-protected pointer.
> >
> > Writers are still synchronized using the socket lock, but readers
> > only read inside RCU read-side critical sections.
> >
> > Helpers are introduced for accessing and updating the new pointer.
> >
> > The remote_addr structure is wrapped together with an rcu_head into a
> > sockaddr_vm_rcu structure so that kfree_rcu() can be used. This removes
> > the need of writers to use synchronize_rcu() after freeing old structures
> > which is simply more efficient and reduces code churn where remote_addr
> > is already being updated inside read-side sections.
> >
> > Only virtio has been tested, but updates were necessary to the VMCI and
> > hyperv code. Unfortunately the author does not have access to
> > VMCI/hyperv systems so those changes are untested.
> >
> > Perf Tests
> > vCPUS: 16
> > Threads: 16
> > Payload: 4KB
> > Test Runs: 5
> > Type: SOCK_DGRAM
> >
> > Before: 245.2 MB/s
> > After: 509.2 MB/s (+107%)
> >
> > Notably, on the same test system, vsock dgram even outperforms
> > multi-threaded UDP over virtio-net with vhost and MQ support enabled.
>
> Cool!
>
> This patch is quite large, so I need to review it carefully in future
> versions, but in general it makes sense to me.
>
> Thanks,
> Stefano
>
Thanks for the initial comments!
Best,
Bobby
> >
> > Throughput metrics for single-threaded SOCK_DGRAM and
> > single/multi-threaded SOCK_STREAM showed no statistically signficant
> > throughput changes (lowest p-value reaching 0.27), with the range of the
> > mean difference ranging between -5% to +1%.
> >
> > Signed-off-by: Bobby Eshleman <[email protected]>
> > ---
> > drivers/vhost/vsock.c | 12 +-
> > include/net/af_vsock.h | 19 ++-
> > net/vmw_vsock/af_vsock.c | 261 ++++++++++++++++++++++++++++----
> > net/vmw_vsock/diag.c | 10 +-
> > net/vmw_vsock/hyperv_transport.c | 15 +-
> > net/vmw_vsock/virtio_transport_common.c | 22 ++-
> > net/vmw_vsock/vmci_transport.c | 70 ++++++---
> > 7 files changed, 344 insertions(+), 65 deletions(-)
> >
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index 028cf079225e..da105cb856ac 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -296,13 +296,17 @@ static int
> > vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> > {
> > struct vhost_vsock *vsock;
> > + unsigned int cid;
> > int cnt = 0;
> > int ret = -ENODEV;
> >
> > rcu_read_lock();
> > + ret = vsock_remote_addr_cid(vsk, &cid);
> > + if (ret < 0)
> > + goto out;
> >
> > /* Find the vhost_vsock according to guest context id */
> > - vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
> > + vsock = vhost_vsock_get(cid);
> > if (!vsock)
> > goto out;
> >
> > @@ -686,6 +690,10 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
> > static void vhost_vsock_reset_orphans(struct sock *sk)
> > {
> > struct vsock_sock *vsk = vsock_sk(sk);
> > + unsigned int cid;
> > +
> > + if (vsock_remote_addr_cid(vsk, &cid) < 0)
> > + return;
> >
> > /* vmci_transport.c doesn't take sk_lock here either. At least we're
> > * under vsock_table_lock so the sock cannot disappear while we're
> > @@ -693,7 +701,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
> > */
> >
> > /* If the peer is still valid, no need to reset connection */
> > - if (vhost_vsock_get(vsk->remote_addr.svm_cid))
> > + if (vhost_vsock_get(cid))
> > return;
> >
> > /* If the close timeout is pending, let it expire. This avoids races
> > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > index 57af28fede19..c02fd6ad0047 100644
> > --- a/include/net/af_vsock.h
> > +++ b/include/net/af_vsock.h
> > @@ -25,12 +25,17 @@ extern spinlock_t vsock_table_lock;
> > #define vsock_sk(__sk) ((struct vsock_sock *)__sk)
> > #define sk_vsock(__vsk) (&(__vsk)->sk)
> >
> > +struct sockaddr_vm_rcu {
> > + struct sockaddr_vm addr;
> > + struct rcu_head rcu;
> > +};
> > +
> > struct vsock_sock {
> > /* sk must be the first member. */
> > struct sock sk;
> > const struct vsock_transport *transport;
> > struct sockaddr_vm local_addr;
> > - struct sockaddr_vm remote_addr;
> > + struct sockaddr_vm_rcu * __rcu remote_addr;
> > /* Links for the global tables of bound and connected sockets. */
> > struct list_head bound_table;
> > struct list_head connected_table;
> > @@ -206,7 +211,7 @@ void vsock_release_pending(struct sock *pending);
> > void vsock_add_pending(struct sock *listener, struct sock *pending);
> > void vsock_remove_pending(struct sock *listener, struct sock *pending);
> > void vsock_enqueue_accept(struct sock *listener, struct sock *connected);
> > -void vsock_insert_connected(struct vsock_sock *vsk);
> > +int vsock_insert_connected(struct vsock_sock *vsk);
> > void vsock_remove_bound(struct vsock_sock *vsk);
> > void vsock_remove_connected(struct vsock_sock *vsk);
> > struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
> > @@ -244,4 +249,14 @@ static inline void __init vsock_bpf_build_proto(void)
> > {}
> > #endif
> >
> > +/* RCU-protected remote addr helpers */
> > +int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid);
> > +int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port);
> > +int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
> > + unsigned int *port);
> > +int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest);
> > +bool vsock_remote_addr_bound(struct vsock_sock *vsk);
> > +bool vsock_remote_addr_equals(struct vsock_sock *vsk, struct sockaddr_vm *other);
> > +int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port);
> > +
> > #endif /* __AF_VSOCK_H__ */
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index 46b3f35e3adc..93b4abbf20b4 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -145,6 +145,139 @@ static const struct vsock_transport *transport_local;
> > static DEFINE_MUTEX(vsock_register_mutex);
> >
> > /**** UTILS ****/
> > +bool vsock_remote_addr_bound(struct vsock_sock *vsk)
> > +{
> > + struct sockaddr_vm_rcu *remote_addr;
> > + bool ret;
> > +
> > + rcu_read_lock();
> > + remote_addr = rcu_dereference(vsk->remote_addr);
> > + if (!remote_addr) {
> > + rcu_read_unlock();
> > + return false;
> > + }
> > +
> > + ret = vsock_addr_bound(&remote_addr->addr);
> > + rcu_read_unlock();
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_bound);
> > +
> > +int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest)
> > +{
> > + struct sockaddr_vm_rcu *remote_addr;
> > +
> > + rcu_read_lock();
> > + remote_addr = rcu_dereference(vsk->remote_addr);
> > + if (!remote_addr) {
> > + rcu_read_unlock();
> > + return -EINVAL;
> > + }
> > + memcpy(dest, &remote_addr->addr, sizeof(*dest));
> > + rcu_read_unlock();
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_copy);
> > +
> > +int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid)
> > +{
> > + return vsock_remote_addr_cid_port(vsk, cid, NULL);
> > +}
> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_cid);
> > +
> > +int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port)
> > +{
> > + return vsock_remote_addr_cid_port(vsk, NULL, port);
> > +}
> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_port);
> > +
> > +int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
> > + unsigned int *port)
> > +{
> > + struct sockaddr_vm_rcu *remote_addr;
> > +
> > + rcu_read_lock();
> > + remote_addr = rcu_dereference(vsk->remote_addr);
> > + if (!remote_addr) {
> > + rcu_read_unlock();
> > + return -EINVAL;
> > + }
> > +
> > + if (cid)
> > + *cid = remote_addr->addr.svm_cid;
> > + if (port)
> > + *port = remote_addr->addr.svm_port;
> > +
> > + rcu_read_unlock();
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_cid_port);
> > +
> > +/* The socket lock must be held by the caller */
> > +int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port)
> > +{
> > + struct sockaddr_vm_rcu *old, *new;
> > +
> > + new = kmalloc(sizeof(*new), GFP_KERNEL);
> > + if (!new)
> > + return -ENOMEM;
> > +
> > + rcu_read_lock();
> > + old = rcu_dereference(vsk->remote_addr);
> > + if (!old) {
> > + kfree(new);
> > + return -EINVAL;
> > + }
> > + memcpy(&new->addr, &old->addr, sizeof(new->addr));
> > + rcu_read_unlock();
> > +
> > + new->addr.svm_cid = cid;
> > + new->addr.svm_port = port;
> > +
> > + old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
> > + kfree_rcu(old, rcu);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_update_cid_port);
> > +
> > +/* The socket lock must be held by the caller */
> > +int vsock_remote_addr_update(struct vsock_sock *vsk, struct sockaddr_vm *src)
> > +{
> > + struct sockaddr_vm_rcu *old, *new;
> > +
> > + new = kmalloc(sizeof(*new), GFP_KERNEL);
> > + if (!new)
> > + return -ENOMEM;
> > +
> > + memcpy(&new->addr, src, sizeof(new->addr));
> > + old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
> > + kfree_rcu(old, rcu);
> > +
> > + return 0;
> > +}
> > +
> > +bool vsock_remote_addr_equals(struct vsock_sock *vsk,
> > + struct sockaddr_vm *other)
> > +{
> > + struct sockaddr_vm_rcu *remote_addr;
> > + bool equals;
> > +
> > + rcu_read_lock();
> > + remote_addr = rcu_dereference(vsk->remote_addr);
> > + if (!remote_addr) {
> > + rcu_read_unlock();
> > + return false;
> > + }
> > +
> > + equals = vsock_addr_equals_addr(&remote_addr->addr, other);
> > + rcu_read_unlock();
> > +
> > + return equals;
> > +}
> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_equals);
> >
> > /* Each bound VSocket is stored in the bind hash table and each connected
> > * VSocket is stored in the connected hash table.
> > @@ -254,10 +387,16 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
> >
> > list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
> > connected_table) {
> > - if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
> > + struct sockaddr_vm_rcu *remote_addr;
> > +
> > + rcu_read_lock();
> > + remote_addr = rcu_dereference(vsk->remote_addr);
> > + if (vsock_addr_equals_addr(src, &remote_addr->addr) &&
> > dst->svm_port == vsk->local_addr.svm_port) {
> > + rcu_read_unlock();
> > return sk_vsock(vsk);
> > }
> > + rcu_read_unlock();
> > }
> >
> > return NULL;
> > @@ -270,14 +409,25 @@ static void vsock_insert_unbound(struct vsock_sock *vsk)
> > spin_unlock_bh(&vsock_table_lock);
> > }
> >
> > -void vsock_insert_connected(struct vsock_sock *vsk)
> > +int vsock_insert_connected(struct vsock_sock *vsk)
> > {
> > - struct list_head *list = vsock_connected_sockets(
> > - &vsk->remote_addr, &vsk->local_addr);
> > + struct list_head *list;
> > + struct sockaddr_vm_rcu *remote_addr;
> > +
> > + rcu_read_lock();
> > + remote_addr = rcu_dereference(vsk->remote_addr);
> > + if (!remote_addr) {
> > + rcu_read_unlock();
> > + return -EINVAL;
> > + }
> > + list = vsock_connected_sockets(&remote_addr->addr, &vsk->local_addr);
> > + rcu_read_unlock();
> >
> > spin_lock_bh(&vsock_table_lock);
> > __vsock_insert_connected(list, vsk);
> > spin_unlock_bh(&vsock_table_lock);
> > +
> > + return 0;
> > }
> > EXPORT_SYMBOL_GPL(vsock_insert_connected);
> >
> > @@ -438,10 +588,17 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> > {
> > const struct vsock_transport *new_transport;
> > struct sock *sk = sk_vsock(vsk);
> > - unsigned int remote_cid = vsk->remote_addr.svm_cid;
> > + struct sockaddr_vm remote_addr;
> > + unsigned int remote_cid;
> > __u8 remote_flags;
> > int ret;
> >
> > + ret = vsock_remote_addr_copy(vsk, &remote_addr);
> > + if (ret < 0)
> > + return ret;
> > +
> > + remote_cid = remote_addr.svm_cid;
> > +
> > /* If the packet is coming with the source and destination CIDs higher
> > * than VMADDR_CID_HOST, then a vsock channel where all the packets are
> > * forwarded to the host should be established. Then the host will
> > @@ -451,10 +608,15 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> > * the connect path the flag can be set by the user space application.
> > */
> > if (psk && vsk->local_addr.svm_cid > VMADDR_CID_HOST &&
> > - vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
> > - vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;
> > + remote_addr.svm_cid > VMADDR_CID_HOST) {
> > + remote_addr.svm_flags |= VMADDR_CID_HOST;
> > +
> > + ret = vsock_remote_addr_update(vsk, &remote_addr);
> > + if (ret < 0)
> > + return ret;
> > + }
> >
> > - remote_flags = vsk->remote_addr.svm_flags;
> > + remote_flags = remote_addr.svm_flags;
> >
> > switch (sk->sk_type) {
> > case SOCK_DGRAM:
> > @@ -742,6 +904,7 @@ static struct sock *__vsock_create(struct net *net,
> > unsigned short type,
> > int kern)
> > {
> > + struct sockaddr_vm *remote_addr;
> > struct sock *sk;
> > struct vsock_sock *psk;
> > struct vsock_sock *vsk;
> > @@ -761,7 +924,14 @@ static struct sock *__vsock_create(struct net *net,
> >
> > vsk = vsock_sk(sk);
> > vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > +
> > + remote_addr = kmalloc(sizeof(*remote_addr), GFP_KERNEL);
> > + if (!remote_addr) {
> > + sk_free(sk);
> > + return NULL;
> > + }
> > + vsock_addr_init(remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > + rcu_assign_pointer(vsk->remote_addr, remote_addr);
> >
> > sk->sk_destruct = vsock_sk_destruct;
> > sk->sk_backlog_rcv = vsock_queue_rcv_skb;
> > @@ -845,6 +1015,7 @@ static void __vsock_release(struct sock *sk, int level)
> > static void vsock_sk_destruct(struct sock *sk)
> > {
> > struct vsock_sock *vsk = vsock_sk(sk);
> > + struct sockaddr_vm_rcu *remote_addr;
> >
> > vsock_deassign_transport(vsk);
> >
> > @@ -852,8 +1023,8 @@ static void vsock_sk_destruct(struct sock *sk)
> > * possibly register the address family with the kernel.
> > */
> > vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > -
> > + remote_addr = rcu_replace_pointer(vsk->remote_addr, NULL, 1);
> > + kfree_rcu(remote_addr);
> > put_cred(vsk->owner);
> > }
> >
> > @@ -943,6 +1114,7 @@ static int vsock_getname(struct socket *sock,
> > struct sock *sk;
> > struct vsock_sock *vsk;
> > struct sockaddr_vm *vm_addr;
> > + struct sockaddr_vm_rcu *rcu_ptr;
> >
> > sk = sock->sk;
> > vsk = vsock_sk(sk);
> > @@ -951,11 +1123,17 @@ static int vsock_getname(struct socket *sock,
> > lock_sock(sk);
> >
> > if (peer) {
> > + rcu_read_lock();
> > if (sock->state != SS_CONNECTED) {
> > err = -ENOTCONN;
> > goto out;
> > }
> > - vm_addr = &vsk->remote_addr;
> > + rcu_ptr = rcu_dereference(vsk->remote_addr);
> > + if (!rcu_ptr) {
> > + err = -EINVAL;
> > + goto out;
> > + }
> > + vm_addr = &rcu_ptr->addr;
> > } else {
> > vm_addr = &vsk->local_addr;
> > }
> > @@ -975,6 +1153,8 @@ static int vsock_getname(struct socket *sock,
> > err = sizeof(*vm_addr);
> >
> > out:
> > + if (peer)
> > + rcu_read_unlock();
> > release_sock(sk);
> > return err;
> > }
> > @@ -1161,7 +1341,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > int err;
> > struct sock *sk;
> > struct vsock_sock *vsk;
> > - struct sockaddr_vm *remote_addr;
> > + struct sockaddr_vm stack_addr, *remote_addr;
> > const struct vsock_transport *transport;
> >
> > if (msg->msg_flags & MSG_OOB)
> > @@ -1172,15 +1352,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > sk = sock->sk;
> > vsk = vsock_sk(sk);
> >
> > - lock_sock(sk);
> > + /* If auto-binding is required, acquire the slock to avoid potential
> > + * race conditions. Otherwise, do not acquire the lock.
> > + *
> > + * We know that the first check of local_addr is racy (indicated by
> > + * data_race()). By acquiring the lock and then subsequently checking
> > + * again if local_addr is bound (inside vsock_auto_bind()), we can
> > + * ensure there are no real data races.
> > + *
> > + * This technique is borrowed by inet_send_prepare().
> > + */
> > + if (data_race(!vsock_addr_bound(&vsk->local_addr))) {
> > + lock_sock(sk);
> > + err = vsock_auto_bind(vsk);
> > + release_sock(sk);
> > + if (err)
> > + return err;
> > + }
> >
> > transport = vsk->transport;
> >
> > - err = vsock_auto_bind(vsk);
> > - if (err)
> > - goto out;
> > -
> > -
> > /* If the provided message contains an address, use that. Otherwise
> > * fall back on the socket's remote handle (if it has been connected).
> > */
> > @@ -1199,18 +1390,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > goto out;
> > }
> > } else if (sock->state == SS_CONNECTED) {
> > - remote_addr = &vsk->remote_addr;
> > + err = vsock_remote_addr_copy(vsk, &stack_addr);
> > + if (err < 0)
> > + goto out;
> >
> > - if (remote_addr->svm_cid == VMADDR_CID_ANY)
> > - remote_addr->svm_cid = transport->get_local_cid();
> > + if (stack_addr.svm_cid == VMADDR_CID_ANY) {
> > + stack_addr.svm_cid = transport->get_local_cid();
> > + lock_sock(sk_vsock(vsk));
> > + vsock_remote_addr_update(vsk, &stack_addr);
> > + release_sock(sk_vsock(vsk));
> > + }
> >
> > /* XXX Should connect() or this function ensure remote_addr is
> > * bound?
> > */
> > - if (!vsock_addr_bound(&vsk->remote_addr)) {
> > + if (!vsock_addr_bound(&stack_addr)) {
> > err = -EINVAL;
> > goto out;
> > }
> > +
> > + remote_addr = &stack_addr;
> > } else {
> > err = -EINVAL;
> > goto out;
> > @@ -1225,7 +1424,6 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> >
> > out:
> > - release_sock(sk);
> > return err;
> > }
> >
> > @@ -1243,8 +1441,7 @@ static int vsock_dgram_connect(struct socket *sock,
> > err = vsock_addr_cast(addr, addr_len, &remote_addr);
> > if (err == -EAFNOSUPPORT && remote_addr->svm_family == AF_UNSPEC) {
> > lock_sock(sk);
> > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY,
> > - VMADDR_PORT_ANY);
> > + vsock_remote_addr_update_cid_port(vsk, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > sock->state = SS_UNCONNECTED;
> > release_sock(sk);
> > return 0;
> > @@ -1263,7 +1460,10 @@ static int vsock_dgram_connect(struct socket *sock,
> > goto out;
> > }
> >
> > - memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
> > + err = vsock_remote_addr_update(vsk, remote_addr);
> > + if (err < 0)
> > + goto out;
> > +
> > sock->state = SS_CONNECTED;
> >
> > /* sock map disallows redirection of non-TCP sockets with sk_state !=
> > @@ -1399,8 +1599,9 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
> > }
> >
> > /* Set the remote address that we are connecting to. */
> > - memcpy(&vsk->remote_addr, remote_addr,
> > - sizeof(vsk->remote_addr));
> > + err = vsock_remote_addr_update(vsk, remote_addr);
> > + if (err)
> > + goto out;
> >
> > err = vsock_assign_transport(vsk, NULL);
> > if (err)
> > @@ -1831,7 +2032,7 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
> > goto out;
> > }
> >
> > - if (!vsock_addr_bound(&vsk->remote_addr)) {
> > + if (!vsock_remote_addr_bound(vsk)) {
> > err = -EDESTADDRREQ;
> > goto out;
> > }
> > diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
> > index a2823b1c5e28..f843bae86b32 100644
> > --- a/net/vmw_vsock/diag.c
> > +++ b/net/vmw_vsock/diag.c
> > @@ -15,8 +15,14 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
> > u32 portid, u32 seq, u32 flags)
> > {
> > struct vsock_sock *vsk = vsock_sk(sk);
> > + struct sockaddr_vm remote_addr;
> > struct vsock_diag_msg *rep;
> > struct nlmsghdr *nlh;
> > + int err;
> > +
> > + err = vsock_remote_addr_copy(vsk, &remote_addr);
> > + if (err < 0)
> > + return err;
> >
> > nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
> > flags);
> > @@ -36,8 +42,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
> > rep->vdiag_shutdown = sk->sk_shutdown;
> > rep->vdiag_src_cid = vsk->local_addr.svm_cid;
> > rep->vdiag_src_port = vsk->local_addr.svm_port;
> > - rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
> > - rep->vdiag_dst_port = vsk->remote_addr.svm_port;
> > + rep->vdiag_dst_cid = remote_addr.svm_cid;
> > + rep->vdiag_dst_port = remote_addr.svm_port;
> > rep->vdiag_ino = sock_i_ino(sk);
> >
> > sock_diag_save_cookie(sk, rep->vdiag_cookie);
> > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> > index 7cb1a9d2cdb4..462b2ec3e6e9 100644
> > --- a/net/vmw_vsock/hyperv_transport.c
> > +++ b/net/vmw_vsock/hyperv_transport.c
> > @@ -336,9 +336,11 @@ static void hvs_open_connection(struct vmbus_channel *chan)
> > hvs_addr_init(&vnew->local_addr, if_type);
> >
> > /* Remote peer is always the host */
> > - vsock_addr_init(&vnew->remote_addr,
> > - VMADDR_CID_HOST, VMADDR_PORT_ANY);
> > - vnew->remote_addr.svm_port = get_port_by_srv_id(if_instance);
> > + ret = vsock_remote_addr_update_cid_port(vnew, VMADDR_CID_HOST,
> > + get_port_by_srv_id(if_instance));
> > + if (ret < 0)
> > + goto out;
> > +
> > ret = vsock_assign_transport(vnew, vsock_sk(sk));
> > /* Transport assigned (looking at remote_addr) must be the
> > * same where we received the request.
> > @@ -459,13 +461,18 @@ static int hvs_connect(struct vsock_sock *vsk)
> > {
> > union hvs_service_id vm, host;
> > struct hvsock *h = vsk->trans;
> > + int err;
> >
> > vm.srv_id = srv_id_template;
> > vm.svm_port = vsk->local_addr.svm_port;
> > h->vm_srv_id = vm.srv_id;
> >
> > host.srv_id = srv_id_template;
> > - host.svm_port = vsk->remote_addr.svm_port;
> > +
> > + err = vsock_remote_addr_port(vsk, &host.svm_port);
> > + if (err < 0)
> > + return err;
> > +
> > h->host_srv_id = host.srv_id;
> >
> > return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index 925acface893..1b87704e516a 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -258,8 +258,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > src_cid = t_ops->transport.get_local_cid();
> > src_port = vsk->local_addr.svm_port;
> > if (!info->remote_cid) {
> > - dst_cid = vsk->remote_addr.svm_cid;
> > - dst_port = vsk->remote_addr.svm_port;
> > + ret = vsock_remote_addr_cid_port(vsk, &dst_cid, &dst_port);
> > + if (ret < 0)
> > + return ret;
> > } else {
> > dst_cid = info->remote_cid;
> > dst_port = info->remote_port;
> > @@ -1169,7 +1170,9 @@ virtio_transport_recv_connecting(struct sock *sk,
> > case VIRTIO_VSOCK_OP_RESPONSE:
> > sk->sk_state = TCP_ESTABLISHED;
> > sk->sk_socket->state = SS_CONNECTED;
> > - vsock_insert_connected(vsk);
> > + err = vsock_insert_connected(vsk);
> > + if (err)
> > + goto destroy;
> > sk->sk_state_change(sk);
> > break;
> > case VIRTIO_VSOCK_OP_INVALID:
> > @@ -1403,9 +1406,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> > vchild = vsock_sk(child);
> > vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
> > le32_to_cpu(hdr->dst_port));
> > - vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
> > - le32_to_cpu(hdr->src_port));
> > -
> > + vsock_remote_addr_update_cid_port(vchild, le64_to_cpu(hdr->src_cid),
> > + le32_to_cpu(hdr->src_port));
> > ret = vsock_assign_transport(vchild, vsk);
> > /* Transport assigned (looking at remote_addr) must be the same
> > * where we received the request.
> > @@ -1420,7 +1422,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> > if (virtio_transport_space_update(child, skb))
> > child->sk_write_space(child);
> >
> > - vsock_insert_connected(vchild);
> > + ret = vsock_insert_connected(vchild);
> > + if (ret) {
> > + release_sock(child);
> > + virtio_transport_reset_no_sock(t, skb);
> > + sock_put(child);
> > + return ret;
> > + }
> > vsock_enqueue_accept(sk, child);
> > virtio_transport_send_response(vchild, skb);
> >
> > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> > index b370070194fa..c0c445e7d925 100644
> > --- a/net/vmw_vsock/vmci_transport.c
> > +++ b/net/vmw_vsock/vmci_transport.c
> > @@ -283,18 +283,25 @@ vmci_transport_send_control_pkt(struct sock *sk,
> > u16 proto,
> > struct vmci_handle handle)
> > {
> > + struct sockaddr_vm addr_stack;
> > + struct sockaddr_vm *remote_addr = &addr_stack;
> > struct vsock_sock *vsk;
> > + int err;
> >
> > vsk = vsock_sk(sk);
> >
> > if (!vsock_addr_bound(&vsk->local_addr))
> > return -EINVAL;
> >
> > - if (!vsock_addr_bound(&vsk->remote_addr))
> > + if (!vsock_remote_addr_bound(vsk))
> > return -EINVAL;
> >
> > + err = vsock_remote_addr_copy(vsk, &addr_stack);
> > + if (err < 0)
> > + return err;
> > +
> > return vmci_transport_alloc_send_control_pkt(&vsk->local_addr,
> > - &vsk->remote_addr,
> > + remote_addr,
> > type, size, mode,
> > wait, proto, handle);
> > }
> > @@ -317,6 +324,7 @@ static int vmci_transport_send_reset(struct sock *sk,
> > struct sockaddr_vm *dst_ptr;
> > struct sockaddr_vm dst;
> > struct vsock_sock *vsk;
> > + int err;
> >
> > if (pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST)
> > return 0;
> > @@ -326,13 +334,16 @@ static int vmci_transport_send_reset(struct sock *sk,
> > if (!vsock_addr_bound(&vsk->local_addr))
> > return -EINVAL;
> >
> > - if (vsock_addr_bound(&vsk->remote_addr)) {
> > - dst_ptr = &vsk->remote_addr;
> > + if (vsock_remote_addr_bound(vsk)) {
> > + err = vsock_remote_addr_copy(vsk, &dst);
> > + if (err < 0)
> > + return err;
> > } else {
> > vsock_addr_init(&dst, pkt->dg.src.context,
> > pkt->src_port);
> > - dst_ptr = &dst;
> > }
> > + dst_ptr = &dst;
> > +
> > return vmci_transport_alloc_send_control_pkt(&vsk->local_addr, dst_ptr,
> > VMCI_TRANSPORT_PACKET_TYPE_RST,
> > 0, 0, NULL, VSOCK_PROTO_INVALID,
> > @@ -490,7 +501,7 @@ static struct sock *vmci_transport_get_pending(
> >
> > list_for_each_entry(vpending, &vlistener->pending_links,
> > pending_links) {
> > - if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
> > + if (vsock_remote_addr_equals(vpending, &src) &&
> > pkt->dst_port == vpending->local_addr.svm_port) {
> > pending = sk_vsock(vpending);
> > sock_hold(pending);
> > @@ -1015,8 +1026,8 @@ static int vmci_transport_recv_listen(struct sock *sk,
> >
> > vsock_addr_init(&vpending->local_addr, pkt->dg.dst.context,
> > pkt->dst_port);
> > - vsock_addr_init(&vpending->remote_addr, pkt->dg.src.context,
> > - pkt->src_port);
> > + vsock_remote_addr_update_cid_port(vpending, pkt->dg.src.context,
> > + pkt->src_port);
> >
> > err = vsock_assign_transport(vpending, vsock_sk(sk));
> > /* Transport assigned (looking at remote_addr) must be the same
> > @@ -1133,6 +1144,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> > {
> > struct vsock_sock *vpending;
> > struct vmci_handle handle;
> > + unsigned int vpending_remote_cid;
> > struct vmci_qp *qpair;
> > bool is_local;
> > u32 flags;
> > @@ -1189,8 +1201,13 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> > /* vpending->local_addr always has a context id so we do not need to
> > * worry about VMADDR_CID_ANY in this case.
> > */
> > - is_local =
> > - vpending->remote_addr.svm_cid == vpending->local_addr.svm_cid;
> > + err = vsock_remote_addr_cid(vpending, &vpending_remote_cid);
> > + if (err < 0) {
> > + skerr = EPROTO;
> > + goto destroy;
> > + }
> > +
> > + is_local = vpending_remote_cid == vpending->local_addr.svm_cid;
> > flags = VMCI_QPFLAG_ATTACH_ONLY;
> > flags |= is_local ? VMCI_QPFLAG_LOCAL : 0;
> >
> > @@ -1203,7 +1220,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> > flags,
> > vmci_transport_is_trusted(
> > vpending,
> > - vpending->remote_addr.svm_cid));
> > + vpending_remote_cid));
> > if (err < 0) {
> > vmci_transport_send_reset(pending, pkt);
> > skerr = -err;
> > @@ -1306,9 +1323,20 @@ vmci_transport_recv_connecting_client(struct sock *sk,
> > break;
> > case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE:
> > case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE2:
> > + struct sockaddr_vm_rcu *remote_addr;
> > +
> > + rcu_read_lock();
> > + remote_addr = rcu_dereference(vsk->remote_addr);
> > + if (!remote_addr) {
> > + skerr = EPROTO;
> > + err = -EINVAL;
> > + rcu_read_unlock();
> > + goto destroy;
> > + }
> > +
> > if (pkt->u.size == 0
> > - || pkt->dg.src.context != vsk->remote_addr.svm_cid
> > - || pkt->src_port != vsk->remote_addr.svm_port
> > + || pkt->dg.src.context != remote_addr->addr.svm_cid
> > + || pkt->src_port != remote_addr->addr.svm_port
> > || !vmci_handle_is_invalid(vmci_trans(vsk)->qp_handle)
> > || vmci_trans(vsk)->qpair
> > || vmci_trans(vsk)->produce_size != 0
> > @@ -1316,9 +1344,10 @@ vmci_transport_recv_connecting_client(struct sock *sk,
> > || vmci_trans(vsk)->detach_sub_id != VMCI_INVALID_ID) {
> > skerr = EPROTO;
> > err = -EINVAL;
> > -
> > + rcu_read_unlock();
> > goto destroy;
> > }
> > + rcu_read_unlock();
> >
> > err = vmci_transport_recv_connecting_client_negotiate(sk, pkt);
> > if (err) {
> > @@ -1379,6 +1408,7 @@ static int vmci_transport_recv_connecting_client_negotiate(
> > int err;
> > struct vsock_sock *vsk;
> > struct vmci_handle handle;
> > + unsigned int remote_cid;
> > struct vmci_qp *qpair;
> > u32 detach_sub_id;
> > bool is_local;
> > @@ -1449,19 +1479,23 @@ static int vmci_transport_recv_connecting_client_negotiate(
> >
> > /* Make VMCI select the handle for us. */
> > handle = VMCI_INVALID_HANDLE;
> > - is_local = vsk->remote_addr.svm_cid == vsk->local_addr.svm_cid;
> > +
> > + err = vsock_remote_addr_cid(vsk, &remote_cid);
> > + if (err < 0)
> > + goto destroy;
> > +
> > + is_local = remote_cid == vsk->local_addr.svm_cid;
> > flags = is_local ? VMCI_QPFLAG_LOCAL : 0;
> >
> > err = vmci_transport_queue_pair_alloc(&qpair,
> > &handle,
> > pkt->u.size,
> > pkt->u.size,
> > - vsk->remote_addr.svm_cid,
> > + remote_cid,
> > flags,
> > vmci_transport_is_trusted(
> > vsk,
> > - vsk->
> > - remote_addr.svm_cid));
> > + remote_cid));
> > if (err < 0)
> > goto destroy;
> >
> >
> > --
> > 2.30.2
> >
>
On Sat, Apr 15, 2023 at 10:30:55AM +0000, Bobby Eshleman wrote:
>On Wed, Apr 19, 2023 at 11:30:53AM +0200, Stefano Garzarella wrote:
>> On Fri, Apr 14, 2023 at 12:25:59AM +0000, Bobby Eshleman wrote:
>> > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
>> > it does not scale when many senders share a socket.
>> >
>> > Prior to this patch the socket lock is used to protect the local_addr,
>> > remote_addr, transport, and buffer size variables. What follows are the
>> > new protection schemes for the various protected fields that ensure a
>> > race-free multi-sender sendmsg() path for vsock dgrams.
>> >
>> > - local_addr
>> > local_addr changes as a result of binding a socket. The write path
>> > for local_addr is bind() and various vsock_auto_bind() call sites.
>> > After a socket has been bound via vsock_auto_bind() or bind(), subsequent
>> > calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
>> > rejects the user request and vsock_auto_bind() early exits.
>> > Therefore, the local addr can not change while a parallel thread is
>> > in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
>> > Change: only acquire lock for auto-binding as-needed in sendmsg().
>> >
>> > - vsk->transport
>> > Updated upon socket creation and it doesn't change again until the
>>
>> This is true only for dgram, right?
>>
>
>Yes.
>
>> How do we decide which transport to assign for dgram?
>>
>
>The transport is assigned in proto->create() [vsock_create()]. It is
>assigned there *only* for dgrams, whereas for streams/seqpackets it is
>assigned in connect(). vsock_create() sets transport to
>'transport_dgram' if sock->type == SOCK_DGRAM.
>
>vsock_sk_destruct() then eventually sets vsk->transport to NULL.
>
>Neither destruct nor create can occur in parallel with sendmsg().
>create() hasn't yet returned the sockfd for the user to call upon it
>sendmsg(), and AFAICT destruct is only called after the final socket
>reference is released, which only happens after the socket no longer
>exists in the fd lookup table and so sendmsg() will fail before it ever
>has the chance to race.
This is okay if a socket can be assigned to a single transport, but with
dgrams I'm not sure we can do that.
Since it is not connected, a socket can send or receive packets from
different transports, so maybe we can't assign it to a specific one,
but do a lookup for every packets to understand which transport to use.
>
>> > socket is destroyed, which only happens after the socket refcnt reaches
>> > zero. This prevents any sendmsg() call from being entered because the
>> > sockfd lookup fails beforehand. That is, sendmsg() and vsk->transport
>> > writes cannot execute in parallel. Additionally, connect() doesn't
>> > update vsk->transport for dgrams as it does for streams. Therefore
>> > vsk->transport is also safe to access lock-free in the sendmsg() path.
>> > No change.
>> >
>> > - buffer size variables
>> > Not used by dgram, so they do not need protection. No change.
>>
>> Is this true because for dgram we use the socket buffer?
>> Is it the same for VMCI?
>
>Yes. The buf_alloc derived from buffer_size is also always ignored after
>being initialized once since credits aren't used.
>
>My reading of the VMCI code is that the buffer_size and
>buffer_{min,max}_size variables are used only in connectible calls (like
>connect and recv_listen), but not for datagrams.
Okay, thanks for checking!
>
>>
>> >
>> > - remote_addr
>> > Needs additional protection because before this patch the
>> > remote_addr (consisting of several fields such as cid, port, and flags)
>> > only changed atomically under socket lock context. By acquiring the
>> > socket lock to read the structure, the changes made by connect() were
>> > always made visible to sendmsg() atomically. Consequently, to retain
>> > atomicity of updates but offer lock-free access, this patch
>> > redesigns this field as an RCU-protected pointer.
>> >
>> > Writers are still synchronized using the socket lock, but readers
>> > only read inside RCU read-side critical sections.
>> >
>> > Helpers are introduced for accessing and updating the new pointer.
>> >
>> > The remote_addr structure is wrapped together with an rcu_head into a
>> > sockaddr_vm_rcu structure so that kfree_rcu() can be used. This removes
>> > the need of writers to use synchronize_rcu() after freeing old structures
>> > which is simply more efficient and reduces code churn where remote_addr
>> > is already being updated inside read-side sections.
>> >
>> > Only virtio has been tested, but updates were necessary to the VMCI and
>> > hyperv code. Unfortunately the author does not have access to
>> > VMCI/hyperv systems so those changes are untested.
>> >
>> > Perf Tests
>> > vCPUS: 16
>> > Threads: 16
>> > Payload: 4KB
>> > Test Runs: 5
>> > Type: SOCK_DGRAM
>> >
>> > Before: 245.2 MB/s
>> > After: 509.2 MB/s (+107%)
>> >
>> > Notably, on the same test system, vsock dgram even outperforms
>> > multi-threaded UDP over virtio-net with vhost and MQ support enabled.
>>
>> Cool!
>>
>> This patch is quite large, so I need to review it carefully in future
>> versions, but in general it makes sense to me.
>>
>> Thanks,
>> Stefano
>>
>
>Thanks for the initial comments!
>
>Best,
>Bobby
>
>> >
>> > Throughput metrics for single-threaded SOCK_DGRAM and
>> > single/multi-threaded SOCK_STREAM showed no statistically signficant
>> > throughput changes (lowest p-value reaching 0.27), with the range of the
>> > mean difference ranging between -5% to +1%.
>> >
>> > Signed-off-by: Bobby Eshleman <[email protected]>
>> > ---
>> > drivers/vhost/vsock.c | 12 +-
>> > include/net/af_vsock.h | 19 ++-
>> > net/vmw_vsock/af_vsock.c | 261 ++++++++++++++++++++++++++++----
>> > net/vmw_vsock/diag.c | 10 +-
>> > net/vmw_vsock/hyperv_transport.c | 15 +-
>> > net/vmw_vsock/virtio_transport_common.c | 22 ++-
>> > net/vmw_vsock/vmci_transport.c | 70 ++++++---
>> > 7 files changed, 344 insertions(+), 65 deletions(-)
>> >
>> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> > index 028cf079225e..da105cb856ac 100644
>> > --- a/drivers/vhost/vsock.c
>> > +++ b/drivers/vhost/vsock.c
>> > @@ -296,13 +296,17 @@ static int
>> > vhost_transport_cancel_pkt(struct vsock_sock *vsk)
>> > {
>> > struct vhost_vsock *vsock;
>> > + unsigned int cid;
>> > int cnt = 0;
>> > int ret = -ENODEV;
>> >
>> > rcu_read_lock();
>> > + ret = vsock_remote_addr_cid(vsk, &cid);
>> > + if (ret < 0)
>> > + goto out;
>> >
>> > /* Find the vhost_vsock according to guest context id */
>> > - vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
>> > + vsock = vhost_vsock_get(cid);
>> > if (!vsock)
>> > goto out;
>> >
>> > @@ -686,6 +690,10 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
>> > static void vhost_vsock_reset_orphans(struct sock *sk)
>> > {
>> > struct vsock_sock *vsk = vsock_sk(sk);
>> > + unsigned int cid;
>> > +
>> > + if (vsock_remote_addr_cid(vsk, &cid) < 0)
>> > + return;
>> >
>> > /* vmci_transport.c doesn't take sk_lock here either. At least we're
>> > * under vsock_table_lock so the sock cannot disappear while we're
>> > @@ -693,7 +701,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
>> > */
>> >
>> > /* If the peer is still valid, no need to reset connection */
>> > - if (vhost_vsock_get(vsk->remote_addr.svm_cid))
>> > + if (vhost_vsock_get(cid))
>> > return;
>> >
>> > /* If the close timeout is pending, let it expire. This avoids races
>> > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> > index 57af28fede19..c02fd6ad0047 100644
>> > --- a/include/net/af_vsock.h
>> > +++ b/include/net/af_vsock.h
>> > @@ -25,12 +25,17 @@ extern spinlock_t vsock_table_lock;
>> > #define vsock_sk(__sk) ((struct vsock_sock *)__sk)
>> > #define sk_vsock(__vsk) (&(__vsk)->sk)
>> >
>> > +struct sockaddr_vm_rcu {
>> > + struct sockaddr_vm addr;
>> > + struct rcu_head rcu;
>> > +};
>> > +
>> > struct vsock_sock {
>> > /* sk must be the first member. */
>> > struct sock sk;
>> > const struct vsock_transport *transport;
>> > struct sockaddr_vm local_addr;
>> > - struct sockaddr_vm remote_addr;
>> > + struct sockaddr_vm_rcu * __rcu remote_addr;
>> > /* Links for the global tables of bound and connected sockets. */
>> > struct list_head bound_table;
>> > struct list_head connected_table;
>> > @@ -206,7 +211,7 @@ void vsock_release_pending(struct sock *pending);
>> > void vsock_add_pending(struct sock *listener, struct sock *pending);
>> > void vsock_remove_pending(struct sock *listener, struct sock *pending);
>> > void vsock_enqueue_accept(struct sock *listener, struct sock *connected);
>> > -void vsock_insert_connected(struct vsock_sock *vsk);
>> > +int vsock_insert_connected(struct vsock_sock *vsk);
>> > void vsock_remove_bound(struct vsock_sock *vsk);
>> > void vsock_remove_connected(struct vsock_sock *vsk);
>> > struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
>> > @@ -244,4 +249,14 @@ static inline void __init vsock_bpf_build_proto(void)
>> > {}
>> > #endif
>> >
>> > +/* RCU-protected remote addr helpers */
>> > +int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid);
>> > +int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port);
>> > +int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
>> > + unsigned int *port);
>> > +int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest);
>> > +bool vsock_remote_addr_bound(struct vsock_sock *vsk);
>> > +bool vsock_remote_addr_equals(struct vsock_sock *vsk, struct sockaddr_vm *other);
>> > +int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port);
>> > +
>> > #endif /* __AF_VSOCK_H__ */
>> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> > index 46b3f35e3adc..93b4abbf20b4 100644
>> > --- a/net/vmw_vsock/af_vsock.c
>> > +++ b/net/vmw_vsock/af_vsock.c
>> > @@ -145,6 +145,139 @@ static const struct vsock_transport *transport_local;
>> > static DEFINE_MUTEX(vsock_register_mutex);
>> >
>> > /**** UTILS ****/
>> > +bool vsock_remote_addr_bound(struct vsock_sock *vsk)
>> > +{
>> > + struct sockaddr_vm_rcu *remote_addr;
>> > + bool ret;
>> > +
>> > + rcu_read_lock();
>> > + remote_addr = rcu_dereference(vsk->remote_addr);
>> > + if (!remote_addr) {
>> > + rcu_read_unlock();
>> > + return false;
>> > + }
>> > +
>> > + ret = vsock_addr_bound(&remote_addr->addr);
>> > + rcu_read_unlock();
>> > +
>> > + return ret;
>> > +}
>> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_bound);
>> > +
>> > +int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest)
>> > +{
>> > + struct sockaddr_vm_rcu *remote_addr;
>> > +
>> > + rcu_read_lock();
>> > + remote_addr = rcu_dereference(vsk->remote_addr);
>> > + if (!remote_addr) {
>> > + rcu_read_unlock();
>> > + return -EINVAL;
>> > + }
>> > + memcpy(dest, &remote_addr->addr, sizeof(*dest));
>> > + rcu_read_unlock();
>> > +
>> > + return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_copy);
>> > +
>> > +int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid)
>> > +{
>> > + return vsock_remote_addr_cid_port(vsk, cid, NULL);
>> > +}
>> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_cid);
>> > +
>> > +int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port)
>> > +{
>> > + return vsock_remote_addr_cid_port(vsk, NULL, port);
>> > +}
>> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_port);
>> > +
>> > +int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
>> > + unsigned int *port)
>> > +{
>> > + struct sockaddr_vm_rcu *remote_addr;
>> > +
>> > + rcu_read_lock();
>> > + remote_addr = rcu_dereference(vsk->remote_addr);
>> > + if (!remote_addr) {
>> > + rcu_read_unlock();
>> > + return -EINVAL;
>> > + }
>> > +
>> > + if (cid)
>> > + *cid = remote_addr->addr.svm_cid;
>> > + if (port)
>> > + *port = remote_addr->addr.svm_port;
>> > +
>> > + rcu_read_unlock();
>> > + return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_cid_port);
>> > +
>> > +/* The socket lock must be held by the caller */
>> > +int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port)
>> > +{
>> > + struct sockaddr_vm_rcu *old, *new;
>> > +
>> > + new = kmalloc(sizeof(*new), GFP_KERNEL);
>> > + if (!new)
>> > + return -ENOMEM;
>> > +
>> > + rcu_read_lock();
>> > + old = rcu_dereference(vsk->remote_addr);
>> > + if (!old) {
>> > + kfree(new);
>> > + return -EINVAL;
>> > + }
>> > + memcpy(&new->addr, &old->addr, sizeof(new->addr));
>> > + rcu_read_unlock();
>> > +
>> > + new->addr.svm_cid = cid;
>> > + new->addr.svm_port = port;
>> > +
>> > + old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
>> > + kfree_rcu(old, rcu);
>> > +
>> > + return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_update_cid_port);
>> > +
>> > +/* The socket lock must be held by the caller */
>> > +int vsock_remote_addr_update(struct vsock_sock *vsk, struct sockaddr_vm *src)
>> > +{
>> > + struct sockaddr_vm_rcu *old, *new;
>> > +
>> > + new = kmalloc(sizeof(*new), GFP_KERNEL);
>> > + if (!new)
>> > + return -ENOMEM;
>> > +
>> > + memcpy(&new->addr, src, sizeof(new->addr));
>> > + old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
>> > + kfree_rcu(old, rcu);
>> > +
>> > + return 0;
>> > +}
>> > +
>> > +bool vsock_remote_addr_equals(struct vsock_sock *vsk,
>> > + struct sockaddr_vm *other)
>> > +{
>> > + struct sockaddr_vm_rcu *remote_addr;
>> > + bool equals;
>> > +
>> > + rcu_read_lock();
>> > + remote_addr = rcu_dereference(vsk->remote_addr);
>> > + if (!remote_addr) {
>> > + rcu_read_unlock();
>> > + return false;
>> > + }
>> > +
>> > + equals = vsock_addr_equals_addr(&remote_addr->addr, other);
>> > + rcu_read_unlock();
>> > +
>> > + return equals;
>> > +}
>> > +EXPORT_SYMBOL_GPL(vsock_remote_addr_equals);
>> >
>> > /* Each bound VSocket is stored in the bind hash table and each connected
>> > * VSocket is stored in the connected hash table.
>> > @@ -254,10 +387,16 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
>> >
>> > list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
>> > connected_table) {
>> > - if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
>> > + struct sockaddr_vm_rcu *remote_addr;
>> > +
>> > + rcu_read_lock();
>> > + remote_addr = rcu_dereference(vsk->remote_addr);
>> > + if (vsock_addr_equals_addr(src, &remote_addr->addr) &&
>> > dst->svm_port == vsk->local_addr.svm_port) {
>> > + rcu_read_unlock();
>> > return sk_vsock(vsk);
>> > }
>> > + rcu_read_unlock();
>> > }
>> >
>> > return NULL;
>> > @@ -270,14 +409,25 @@ static void vsock_insert_unbound(struct vsock_sock *vsk)
>> > spin_unlock_bh(&vsock_table_lock);
>> > }
>> >
>> > -void vsock_insert_connected(struct vsock_sock *vsk)
>> > +int vsock_insert_connected(struct vsock_sock *vsk)
>> > {
>> > - struct list_head *list = vsock_connected_sockets(
>> > - &vsk->remote_addr, &vsk->local_addr);
>> > + struct list_head *list;
>> > + struct sockaddr_vm_rcu *remote_addr;
>> > +
>> > + rcu_read_lock();
>> > + remote_addr = rcu_dereference(vsk->remote_addr);
>> > + if (!remote_addr) {
>> > + rcu_read_unlock();
>> > + return -EINVAL;
>> > + }
>> > + list = vsock_connected_sockets(&remote_addr->addr, &vsk->local_addr);
>> > + rcu_read_unlock();
>> >
>> > spin_lock_bh(&vsock_table_lock);
>> > __vsock_insert_connected(list, vsk);
>> > spin_unlock_bh(&vsock_table_lock);
>> > +
>> > + return 0;
>> > }
>> > EXPORT_SYMBOL_GPL(vsock_insert_connected);
>> >
>> > @@ -438,10 +588,17 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
>> > {
>> > const struct vsock_transport *new_transport;
>> > struct sock *sk = sk_vsock(vsk);
>> > - unsigned int remote_cid = vsk->remote_addr.svm_cid;
>> > + struct sockaddr_vm remote_addr;
>> > + unsigned int remote_cid;
>> > __u8 remote_flags;
>> > int ret;
>> >
>> > + ret = vsock_remote_addr_copy(vsk, &remote_addr);
>> > + if (ret < 0)
>> > + return ret;
>> > +
>> > + remote_cid = remote_addr.svm_cid;
>> > +
>> > /* If the packet is coming with the source and destination CIDs higher
>> > * than VMADDR_CID_HOST, then a vsock channel where all the packets are
>> > * forwarded to the host should be established. Then the host will
>> > @@ -451,10 +608,15 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
>> > * the connect path the flag can be set by the user space application.
>> > */
>> > if (psk && vsk->local_addr.svm_cid > VMADDR_CID_HOST &&
>> > - vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
>> > - vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;
>> > + remote_addr.svm_cid > VMADDR_CID_HOST) {
>> > + remote_addr.svm_flags |= VMADDR_CID_HOST;
>> > +
>> > + ret = vsock_remote_addr_update(vsk, &remote_addr);
>> > + if (ret < 0)
>> > + return ret;
>> > + }
>> >
>> > - remote_flags = vsk->remote_addr.svm_flags;
>> > + remote_flags = remote_addr.svm_flags;
>> >
>> > switch (sk->sk_type) {
>> > case SOCK_DGRAM:
>> > @@ -742,6 +904,7 @@ static struct sock *__vsock_create(struct net *net,
>> > unsigned short type,
>> > int kern)
>> > {
>> > + struct sockaddr_vm *remote_addr;
>> > struct sock *sk;
>> > struct vsock_sock *psk;
>> > struct vsock_sock *vsk;
>> > @@ -761,7 +924,14 @@ static struct sock *__vsock_create(struct net *net,
>> >
>> > vsk = vsock_sk(sk);
>> > vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>> > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>> > +
>> > + remote_addr = kmalloc(sizeof(*remote_addr), GFP_KERNEL);
>> > + if (!remote_addr) {
>> > + sk_free(sk);
>> > + return NULL;
>> > + }
>> > + vsock_addr_init(remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>> > + rcu_assign_pointer(vsk->remote_addr, remote_addr);
>> >
>> > sk->sk_destruct = vsock_sk_destruct;
>> > sk->sk_backlog_rcv = vsock_queue_rcv_skb;
>> > @@ -845,6 +1015,7 @@ static void __vsock_release(struct sock *sk, int level)
>> > static void vsock_sk_destruct(struct sock *sk)
>> > {
>> > struct vsock_sock *vsk = vsock_sk(sk);
>> > + struct sockaddr_vm_rcu *remote_addr;
>> >
>> > vsock_deassign_transport(vsk);
>> >
>> > @@ -852,8 +1023,8 @@ static void vsock_sk_destruct(struct sock *sk)
>> > * possibly register the address family with the kernel.
>> > */
>> > vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>> > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>> > -
>> > + remote_addr = rcu_replace_pointer(vsk->remote_addr, NULL, 1);
>> > + kfree_rcu(remote_addr);
>> > put_cred(vsk->owner);
>> > }
>> >
>> > @@ -943,6 +1114,7 @@ static int vsock_getname(struct socket *sock,
>> > struct sock *sk;
>> > struct vsock_sock *vsk;
>> > struct sockaddr_vm *vm_addr;
>> > + struct sockaddr_vm_rcu *rcu_ptr;
>> >
>> > sk = sock->sk;
>> > vsk = vsock_sk(sk);
>> > @@ -951,11 +1123,17 @@ static int vsock_getname(struct socket *sock,
>> > lock_sock(sk);
>> >
>> > if (peer) {
>> > + rcu_read_lock();
>> > if (sock->state != SS_CONNECTED) {
>> > err = -ENOTCONN;
>> > goto out;
>> > }
>> > - vm_addr = &vsk->remote_addr;
>> > + rcu_ptr = rcu_dereference(vsk->remote_addr);
>> > + if (!rcu_ptr) {
>> > + err = -EINVAL;
>> > + goto out;
>> > + }
>> > + vm_addr = &rcu_ptr->addr;
>> > } else {
>> > vm_addr = &vsk->local_addr;
>> > }
>> > @@ -975,6 +1153,8 @@ static int vsock_getname(struct socket *sock,
>> > err = sizeof(*vm_addr);
>> >
>> > out:
>> > + if (peer)
>> > + rcu_read_unlock();
>> > release_sock(sk);
>> > return err;
>> > }
>> > @@ -1161,7 +1341,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>> > int err;
>> > struct sock *sk;
>> > struct vsock_sock *vsk;
>> > - struct sockaddr_vm *remote_addr;
>> > + struct sockaddr_vm stack_addr, *remote_addr;
>> > const struct vsock_transport *transport;
>> >
>> > if (msg->msg_flags & MSG_OOB)
>> > @@ -1172,15 +1352,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>> > sk = sock->sk;
>> > vsk = vsock_sk(sk);
>> >
>> > - lock_sock(sk);
>> > + /* If auto-binding is required, acquire the slock to avoid potential
>> > + * race conditions. Otherwise, do not acquire the lock.
>> > + *
>> > + * We know that the first check of local_addr is racy (indicated by
>> > + * data_race()). By acquiring the lock and then subsequently checking
>> > + * again if local_addr is bound (inside vsock_auto_bind()), we can
>> > + * ensure there are no real data races.
>> > + *
>> > + * This technique is borrowed by inet_send_prepare().
>> > + */
>> > + if (data_race(!vsock_addr_bound(&vsk->local_addr))) {
>> > + lock_sock(sk);
>> > + err = vsock_auto_bind(vsk);
>> > + release_sock(sk);
>> > + if (err)
>> > + return err;
>> > + }
>> >
>> > transport = vsk->transport;
>> >
>> > - err = vsock_auto_bind(vsk);
>> > - if (err)
>> > - goto out;
>> > -
>> > -
>> > /* If the provided message contains an address, use that. Otherwise
>> > * fall back on the socket's remote handle (if it has been connected).
>> > */
>> > @@ -1199,18 +1390,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>> > goto out;
>> > }
>> > } else if (sock->state == SS_CONNECTED) {
>> > - remote_addr = &vsk->remote_addr;
>> > + err = vsock_remote_addr_copy(vsk, &stack_addr);
>> > + if (err < 0)
>> > + goto out;
>> >
>> > - if (remote_addr->svm_cid == VMADDR_CID_ANY)
>> > - remote_addr->svm_cid = transport->get_local_cid();
>> > + if (stack_addr.svm_cid == VMADDR_CID_ANY) {
>> > + stack_addr.svm_cid = transport->get_local_cid();
>> > + lock_sock(sk_vsock(vsk));
>> > + vsock_remote_addr_update(vsk, &stack_addr);
>> > + release_sock(sk_vsock(vsk));
>> > + }
>> >
>> > /* XXX Should connect() or this function ensure remote_addr is
>> > * bound?
>> > */
>> > - if (!vsock_addr_bound(&vsk->remote_addr)) {
>> > + if (!vsock_addr_bound(&stack_addr)) {
>> > err = -EINVAL;
>> > goto out;
>> > }
>> > +
>> > + remote_addr = &stack_addr;
>> > } else {
>> > err = -EINVAL;
>> > goto out;
>> > @@ -1225,7 +1424,6 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>> > err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
>> >
>> > out:
>> > - release_sock(sk);
>> > return err;
>> > }
>> >
>> > @@ -1243,8 +1441,7 @@ static int vsock_dgram_connect(struct socket *sock,
>> > err = vsock_addr_cast(addr, addr_len, &remote_addr);
>> > if (err == -EAFNOSUPPORT && remote_addr->svm_family == AF_UNSPEC) {
>> > lock_sock(sk);
>> > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY,
>> > - VMADDR_PORT_ANY);
>> > + vsock_remote_addr_update_cid_port(vsk, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>> > sock->state = SS_UNCONNECTED;
>> > release_sock(sk);
>> > return 0;
>> > @@ -1263,7 +1460,10 @@ static int vsock_dgram_connect(struct socket *sock,
>> > goto out;
>> > }
>> >
>> > - memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
>> > + err = vsock_remote_addr_update(vsk, remote_addr);
>> > + if (err < 0)
>> > + goto out;
>> > +
>> > sock->state = SS_CONNECTED;
>> >
>> > /* sock map disallows redirection of non-TCP sockets with sk_state !=
>> > @@ -1399,8 +1599,9 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
>> > }
>> >
>> > /* Set the remote address that we are connecting to. */
>> > - memcpy(&vsk->remote_addr, remote_addr,
>> > - sizeof(vsk->remote_addr));
>> > + err = vsock_remote_addr_update(vsk, remote_addr);
>> > + if (err)
>> > + goto out;
>> >
>> > err = vsock_assign_transport(vsk, NULL);
>> > if (err)
>> > @@ -1831,7 +2032,7 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
>> > goto out;
>> > }
>> >
>> > - if (!vsock_addr_bound(&vsk->remote_addr)) {
>> > + if (!vsock_remote_addr_bound(vsk)) {
>> > err = -EDESTADDRREQ;
>> > goto out;
>> > }
>> > diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
>> > index a2823b1c5e28..f843bae86b32 100644
>> > --- a/net/vmw_vsock/diag.c
>> > +++ b/net/vmw_vsock/diag.c
>> > @@ -15,8 +15,14 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
>> > u32 portid, u32 seq, u32 flags)
>> > {
>> > struct vsock_sock *vsk = vsock_sk(sk);
>> > + struct sockaddr_vm remote_addr;
>> > struct vsock_diag_msg *rep;
>> > struct nlmsghdr *nlh;
>> > + int err;
>> > +
>> > + err = vsock_remote_addr_copy(vsk, &remote_addr);
>> > + if (err < 0)
>> > + return err;
>> >
>> > nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
>> > flags);
>> > @@ -36,8 +42,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
>> > rep->vdiag_shutdown = sk->sk_shutdown;
>> > rep->vdiag_src_cid = vsk->local_addr.svm_cid;
>> > rep->vdiag_src_port = vsk->local_addr.svm_port;
>> > - rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
>> > - rep->vdiag_dst_port = vsk->remote_addr.svm_port;
>> > + rep->vdiag_dst_cid = remote_addr.svm_cid;
>> > + rep->vdiag_dst_port = remote_addr.svm_port;
>> > rep->vdiag_ino = sock_i_ino(sk);
>> >
>> > sock_diag_save_cookie(sk, rep->vdiag_cookie);
>> > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
>> > index 7cb1a9d2cdb4..462b2ec3e6e9 100644
>> > --- a/net/vmw_vsock/hyperv_transport.c
>> > +++ b/net/vmw_vsock/hyperv_transport.c
>> > @@ -336,9 +336,11 @@ static void hvs_open_connection(struct vmbus_channel *chan)
>> > hvs_addr_init(&vnew->local_addr, if_type);
>> >
>> > /* Remote peer is always the host */
>> > - vsock_addr_init(&vnew->remote_addr,
>> > - VMADDR_CID_HOST, VMADDR_PORT_ANY);
>> > - vnew->remote_addr.svm_port = get_port_by_srv_id(if_instance);
>> > + ret = vsock_remote_addr_update_cid_port(vnew, VMADDR_CID_HOST,
>> > + get_port_by_srv_id(if_instance));
>> > + if (ret < 0)
>> > + goto out;
>> > +
>> > ret = vsock_assign_transport(vnew, vsock_sk(sk));
>> > /* Transport assigned (looking at remote_addr) must be the
>> > * same where we received the request.
>> > @@ -459,13 +461,18 @@ static int hvs_connect(struct vsock_sock *vsk)
>> > {
>> > union hvs_service_id vm, host;
>> > struct hvsock *h = vsk->trans;
>> > + int err;
>> >
>> > vm.srv_id = srv_id_template;
>> > vm.svm_port = vsk->local_addr.svm_port;
>> > h->vm_srv_id = vm.srv_id;
>> >
>> > host.srv_id = srv_id_template;
>> > - host.svm_port = vsk->remote_addr.svm_port;
>> > +
>> > + err = vsock_remote_addr_port(vsk, &host.svm_port);
>> > + if (err < 0)
>> > + return err;
>> > +
>> > h->host_srv_id = host.srv_id;
>> >
>> > return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
>> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> > index 925acface893..1b87704e516a 100644
>> > --- a/net/vmw_vsock/virtio_transport_common.c
>> > +++ b/net/vmw_vsock/virtio_transport_common.c
>> > @@ -258,8 +258,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>> > src_cid = t_ops->transport.get_local_cid();
>> > src_port = vsk->local_addr.svm_port;
>> > if (!info->remote_cid) {
>> > - dst_cid = vsk->remote_addr.svm_cid;
>> > - dst_port = vsk->remote_addr.svm_port;
>> > + ret = vsock_remote_addr_cid_port(vsk, &dst_cid, &dst_port);
>> > + if (ret < 0)
>> > + return ret;
>> > } else {
>> > dst_cid = info->remote_cid;
>> > dst_port = info->remote_port;
>> > @@ -1169,7 +1170,9 @@ virtio_transport_recv_connecting(struct sock *sk,
>> > case VIRTIO_VSOCK_OP_RESPONSE:
>> > sk->sk_state = TCP_ESTABLISHED;
>> > sk->sk_socket->state = SS_CONNECTED;
>> > - vsock_insert_connected(vsk);
>> > + err = vsock_insert_connected(vsk);
>> > + if (err)
>> > + goto destroy;
>> > sk->sk_state_change(sk);
>> > break;
>> > case VIRTIO_VSOCK_OP_INVALID:
>> > @@ -1403,9 +1406,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>> > vchild = vsock_sk(child);
>> > vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
>> > le32_to_cpu(hdr->dst_port));
>> > - vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
>> > - le32_to_cpu(hdr->src_port));
>> > -
>> > + vsock_remote_addr_update_cid_port(vchild, le64_to_cpu(hdr->src_cid),
>> > + le32_to_cpu(hdr->src_port));
>> > ret = vsock_assign_transport(vchild, vsk);
>> > /* Transport assigned (looking at remote_addr) must be the same
>> > * where we received the request.
>> > @@ -1420,7 +1422,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>> > if (virtio_transport_space_update(child, skb))
>> > child->sk_write_space(child);
>> >
>> > - vsock_insert_connected(vchild);
>> > + ret = vsock_insert_connected(vchild);
>> > + if (ret) {
>> > + release_sock(child);
>> > + virtio_transport_reset_no_sock(t, skb);
>> > + sock_put(child);
>> > + return ret;
>> > + }
>> > vsock_enqueue_accept(sk, child);
>> > virtio_transport_send_response(vchild, skb);
>> >
>> > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
>> > index b370070194fa..c0c445e7d925 100644
>> > --- a/net/vmw_vsock/vmci_transport.c
>> > +++ b/net/vmw_vsock/vmci_transport.c
>> > @@ -283,18 +283,25 @@ vmci_transport_send_control_pkt(struct sock *sk,
>> > u16 proto,
>> > struct vmci_handle handle)
>> > {
>> > + struct sockaddr_vm addr_stack;
>> > + struct sockaddr_vm *remote_addr = &addr_stack;
>> > struct vsock_sock *vsk;
>> > + int err;
>> >
>> > vsk = vsock_sk(sk);
>> >
>> > if (!vsock_addr_bound(&vsk->local_addr))
>> > return -EINVAL;
>> >
>> > - if (!vsock_addr_bound(&vsk->remote_addr))
>> > + if (!vsock_remote_addr_bound(vsk))
>> > return -EINVAL;
>> >
>> > + err = vsock_remote_addr_copy(vsk, &addr_stack);
>> > + if (err < 0)
>> > + return err;
>> > +
>> > return vmci_transport_alloc_send_control_pkt(&vsk->local_addr,
>> > - &vsk->remote_addr,
>> > + remote_addr,
>> > type, size, mode,
>> > wait, proto, handle);
>> > }
>> > @@ -317,6 +324,7 @@ static int vmci_transport_send_reset(struct sock *sk,
>> > struct sockaddr_vm *dst_ptr;
>> > struct sockaddr_vm dst;
>> > struct vsock_sock *vsk;
>> > + int err;
>> >
>> > if (pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST)
>> > return 0;
>> > @@ -326,13 +334,16 @@ static int vmci_transport_send_reset(struct sock *sk,
>> > if (!vsock_addr_bound(&vsk->local_addr))
>> > return -EINVAL;
>> >
>> > - if (vsock_addr_bound(&vsk->remote_addr)) {
>> > - dst_ptr = &vsk->remote_addr;
>> > + if (vsock_remote_addr_bound(vsk)) {
>> > + err = vsock_remote_addr_copy(vsk, &dst);
>> > + if (err < 0)
>> > + return err;
>> > } else {
>> > vsock_addr_init(&dst, pkt->dg.src.context,
>> > pkt->src_port);
>> > - dst_ptr = &dst;
>> > }
>> > + dst_ptr = &dst;
>> > +
>> > return vmci_transport_alloc_send_control_pkt(&vsk->local_addr, dst_ptr,
>> > VMCI_TRANSPORT_PACKET_TYPE_RST,
>> > 0, 0, NULL, VSOCK_PROTO_INVALID,
>> > @@ -490,7 +501,7 @@ static struct sock *vmci_transport_get_pending(
>> >
>> > list_for_each_entry(vpending, &vlistener->pending_links,
>> > pending_links) {
>> > - if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
>> > + if (vsock_remote_addr_equals(vpending, &src) &&
>> > pkt->dst_port == vpending->local_addr.svm_port) {
>> > pending = sk_vsock(vpending);
>> > sock_hold(pending);
>> > @@ -1015,8 +1026,8 @@ static int vmci_transport_recv_listen(struct sock *sk,
>> >
>> > vsock_addr_init(&vpending->local_addr, pkt->dg.dst.context,
>> > pkt->dst_port);
>> > - vsock_addr_init(&vpending->remote_addr, pkt->dg.src.context,
>> > - pkt->src_port);
>> > + vsock_remote_addr_update_cid_port(vpending, pkt->dg.src.context,
>> > + pkt->src_port);
>> >
>> > err = vsock_assign_transport(vpending, vsock_sk(sk));
>> > /* Transport assigned (looking at remote_addr) must be the same
>> > @@ -1133,6 +1144,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
>> > {
>> > struct vsock_sock *vpending;
>> > struct vmci_handle handle;
>> > + unsigned int vpending_remote_cid;
>> > struct vmci_qp *qpair;
>> > bool is_local;
>> > u32 flags;
>> > @@ -1189,8 +1201,13 @@ vmci_transport_recv_connecting_server(struct sock *listener,
>> > /* vpending->local_addr always has a context id so we do not need to
>> > * worry about VMADDR_CID_ANY in this case.
>> > */
>> > - is_local =
>> > - vpending->remote_addr.svm_cid == vpending->local_addr.svm_cid;
>> > + err = vsock_remote_addr_cid(vpending, &vpending_remote_cid);
>> > + if (err < 0) {
>> > + skerr = EPROTO;
>> > + goto destroy;
>> > + }
>> > +
>> > + is_local = vpending_remote_cid == vpending->local_addr.svm_cid;
>> > flags = VMCI_QPFLAG_ATTACH_ONLY;
>> > flags |= is_local ? VMCI_QPFLAG_LOCAL : 0;
>> >
>> > @@ -1203,7 +1220,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
>> > flags,
>> > vmci_transport_is_trusted(
>> > vpending,
>> > - vpending->remote_addr.svm_cid));
>> > + vpending_remote_cid));
>> > if (err < 0) {
>> > vmci_transport_send_reset(pending, pkt);
>> > skerr = -err;
>> > @@ -1306,9 +1323,20 @@ vmci_transport_recv_connecting_client(struct sock *sk,
>> > break;
>> > case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE:
>> > case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE2:
>> > + struct sockaddr_vm_rcu *remote_addr;
>> > +
>> > + rcu_read_lock();
>> > + remote_addr = rcu_dereference(vsk->remote_addr);
>> > + if (!remote_addr) {
>> > + skerr = EPROTO;
>> > + err = -EINVAL;
>> > + rcu_read_unlock();
>> > + goto destroy;
>> > + }
>> > +
>> > if (pkt->u.size == 0
>> > - || pkt->dg.src.context != vsk->remote_addr.svm_cid
>> > - || pkt->src_port != vsk->remote_addr.svm_port
>> > + || pkt->dg.src.context != remote_addr->addr.svm_cid
>> > + || pkt->src_port != remote_addr->addr.svm_port
>> > || !vmci_handle_is_invalid(vmci_trans(vsk)->qp_handle)
>> > || vmci_trans(vsk)->qpair
>> > || vmci_trans(vsk)->produce_size != 0
>> > @@ -1316,9 +1344,10 @@ vmci_transport_recv_connecting_client(struct sock *sk,
>> > || vmci_trans(vsk)->detach_sub_id != VMCI_INVALID_ID) {
>> > skerr = EPROTO;
>> > err = -EINVAL;
>> > -
>> > + rcu_read_unlock();
>> > goto destroy;
>> > }
>> > + rcu_read_unlock();
>> >
>> > err = vmci_transport_recv_connecting_client_negotiate(sk, pkt);
>> > if (err) {
>> > @@ -1379,6 +1408,7 @@ static int vmci_transport_recv_connecting_client_negotiate(
>> > int err;
>> > struct vsock_sock *vsk;
>> > struct vmci_handle handle;
>> > + unsigned int remote_cid;
>> > struct vmci_qp *qpair;
>> > u32 detach_sub_id;
>> > bool is_local;
>> > @@ -1449,19 +1479,23 @@ static int vmci_transport_recv_connecting_client_negotiate(
>> >
>> > /* Make VMCI select the handle for us. */
>> > handle = VMCI_INVALID_HANDLE;
>> > - is_local = vsk->remote_addr.svm_cid == vsk->local_addr.svm_cid;
>> > +
>> > + err = vsock_remote_addr_cid(vsk, &remote_cid);
>> > + if (err < 0)
>> > + goto destroy;
>> > +
>> > + is_local = remote_cid == vsk->local_addr.svm_cid;
>> > flags = is_local ? VMCI_QPFLAG_LOCAL : 0;
>> >
>> > err = vmci_transport_queue_pair_alloc(&qpair,
>> > &handle,
>> > pkt->u.size,
>> > pkt->u.size,
>> > - vsk->remote_addr.svm_cid,
>> > + remote_cid,
>> > flags,
>> > vmci_transport_is_trusted(
>> > vsk,
>> > - vsk->
>> > - remote_addr.svm_cid));
>> > + remote_cid));
>> > if (err < 0)
>> > goto destroy;
>> >
>> >
>> > --
>> > 2.30.2
>> >
>>
>
On Sat, Apr 15, 2023 at 07:13:47AM +0000, Bobby Eshleman wrote:
>CC'ing [email protected] because this thread is starting
>to touch the spec.
>
>On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote:
>> Hi Bobby,
>>
>> On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote:
>> > CC'ing Cong.
>> >
>> > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote:
>> > > Hey all!
>> > >
>> > > This series introduces support for datagrams to virtio/vsock.
>>
>> Great! Thanks for restarting this work!
>>
>
>No problem!
>
>> > >
>> > > It is a spin-off (and smaller version) of this series from the summer:
>> > > https://lore.kernel.org/all/[email protected]/
>> > >
>> > > Please note that this is an RFC and should not be merged until
>> > > associated changes are made to the virtio specification, which will
>> > > follow after discussion from this series.
>> > >
>> > > This series first supports datagrams in a basic form for virtio, and
>> > > then optimizes the sendpath for all transports.
>> > >
>> > > The result is a very fast datagram communication protocol that
>> > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
>> > > of multi-threaded workload samples.
>> > >
>> > > For those that are curious, some summary data comparing UDP and VSOCK
>> > > DGRAM (N=5):
>> > >
>> > > vCPUS: 16
>> > > virtio-net queues: 16
>> > > payload size: 4KB
>> > > Setup: bare metal + vm (non-nested)
>> > >
>> > > UDP: 287.59 MB/s
>> > > VSOCK DGRAM: 509.2 MB/s
>> > >
>> > > Some notes about the implementation...
>> > >
>> > > This datagram implementation forces datagrams to self-throttle according
>> > > to the threshold set by sk_sndbuf. It behaves similar to the credits
>> > > used by streams in its effect on throughput and memory consumption, but
>> > > it is not influenced by the receiving socket as credits are.
>>
>> So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right?
>>
>
>Correct.
>
>> We should check if VMCI behaves the same.
>>
>> > >
>> > > The device drops packets silently. There is room for improvement by
>> > > building into the device and driver some intelligence around how to
>> > > reduce frequency of kicking the virtqueue when packet loss is high. I
>> > > think there is a good discussion to be had on this.
>>
>> Can you elaborate a bit here?
>>
>> Do you mean some mechanism to report to the sender that a destination
>> (cid, port) is full so the packet will be dropped?
>>
>
>Correct. There is also the case of there being no receiver at all for
>this address since this case isn't rejected upon connect(). Ideally,
>such a socket (which will have 100% packet loss) will be throttled
>aggressively.
>
>Before we go down too far on this path, I also want to clarify that
>using UDP over vhost/virtio-net also has this property... this can be
>observed by using tcpdump to dump the UDP packets on the bridge network
>your VM is using. UDP packets sent to a garbage address can be seen on
>the host bridge (this is the nature of UDP, how is the host supposed to
>know the address eventually goes nowhere). I mention the above because I
>think it is possible for vsock to avoid this cost, given that it
>benefits from being point-to-point and g2h/h2g.
>
>If we're okay with vsock being on par, then the current series does
>that. I propose something below that can be added later and maybe
>negotiated as a feature bit too.
I see and I agree on that, let's do it step by step.
If we can do it in the first phase is great, but I think is fine to add
this feature later.
>
>> Can we adapt the credit mechanism?
>>
>
>I've thought about this a lot because the attraction of the approach for
>me would be that we could get the wait/buffer-limiting logic for free
>and without big changes to the protocol, but the problem is that the
>unreliable nature of datagrams means that the source's free-running
>tx_cnt will become out-of-sync with the destination's fwd_cnt upon
>packet loss.
We need to understand where the packet can be lost.
If the packet always reaches the destination (vsock driver or device),
we can discard it, but also update the counters.
>
>Imagine a source that initializes and starts sending packets before a
>destination socket even is created, the source's self-throttling will be
>dysfunctional because its tx_cnt will always far exceed the
>destination's fwd_cnt.
Right, the other problem I see is that the socket aren't connected, so
we have 1-N relationship.
>
>We could play tricks with the meaning of the CREDIT_UPDATE message and
>fwd_cnt/buf_alloc fields, but I don't think we want to go down that
>path.
>
>I think that the best and simplest approach introduces a congestion
>notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the
>destination sends this notification. At a given repeated time period T,
>the source can check if it has received any notifications in the last T.
>If so, it halves its buffer allocation. If not, it doubles its buffer
>allocation unless it is already at its max or original value.
>
>An "invalid" socket which never has any receiver will converge towards a
>rate limit of one packet per time T * log2(average pkt size). That is, a
>socket with 100% packet loss will only be able to send 16 bytes every
>4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within
>160ms given at least one packet sent per 5ms. I have no idea if that is
>a reasonable default T for vsock, I just pulled it out of a hat for the
>sake of the example.
>
>"Normal" sockets will be responsive to high loss and rebalance during
>low loss. The source is trying to guess and converge on the actual
>buffer state of the destination.
>
>This would reuse the already-existing throttling mechanisms that
>throttle based upon buffer allocation. The usage of sk_sndbuf would have
>to be re-worked. The application using sendmsg() will see EAGAIN when
>throttled, or just sleep if !MSG_DONTWAIT.
I see, it looks interesting, but I think we need to share that
information between multiple sockets, since the same destination
(cid, port), can be reached by multiple sockets.
Another approach could be to have both congestion notification and
decongestion, but maybe it produces double traffic.
>
>I looked at alternative schemes (like the Datagram Congestion Control
>Protocol), but I do not think the added complexity is necessary in the
>case of vsock (DCCP requires congestion windows, sequence numbers, batch
>acknowledgements, etc...). I also looked at UDP-based application
>protocols like TFTP, DHCP, and SIP over UDP which use a delay-based
>backoff mechanism, but seem to require acknowledgement for those packet
>types, which trigger the retries and backoffs. I think we can get away
>with the simpler approach and not have to potentially kill performance
>with per-packet acknowledgements.
Yep I agree. I think our advantage is that the channel (virtqueues),
can't lose packets.
>
>> > >
>> > > In this series I am also proposing that fairness be reexamined as an
>> > > issue separate from datagrams, which differs from my previous series
>> > > that coupled these issues. After further testing and reflection on the
>> > > design, I do not believe that these need to be coupled and I do not
>> > > believe this implementation introduces additional unfairness or
>> > > exacerbates pre-existing unfairness.
>>
>> I see.
>>
>> > >
>> > > I attempted to characterize vsock fairness by using a pool of processes
>> > > to stress test the shared resources while measuring the performance of a
>> > > lone stream socket. Given unfair preference for datagrams, we would
>> > > assume that a lone stream socket would degrade much more when a pool of
>> > > datagram sockets was stressing the system than when a pool of stream
>> > > sockets are stressing the system. The result, however, showed no
>> > > significant difference between the degradation of throughput of the lone
>> > > stream socket when using a pool of datagrams to stress the queue over
>> > > using a pool of streams. The absolute difference in throughput actually
>> > > favored datagrams as interfering least as the mean difference was +16%
>> > > compared to using streams to stress test (N=7), but it was not
>> > > statistically significant. Workloads were matched for payload size and
>> > > buffer size (to approximate memory consumption) and process count, and
>> > > stress workloads were configured to start before and last long after the
>> > > lifetime of the "lone" stream socket flow to ensure that competing flows
>> > > were continuously hot.
>> > >
>> > > Given the above data, I propose that vsock fairness be addressed
>> > > independent of datagrams and to defer its implementation to a future
>> > > series.
>>
>> Makes sense to me.
>>
>> I left some preliminary comments, anyway now it seems reasonable to use
>> the same virtqueues, so we can go head with the spec proposal.
>>
>> Thanks,
>> Stefano
>>
>
>Thanks for the review!
You're welcome!
Stefano
On Fri, Apr 28, 2023 at 12:43:09PM +0200, Stefano Garzarella wrote:
> On Sat, Apr 15, 2023 at 07:13:47AM +0000, Bobby Eshleman wrote:
> > CC'ing [email protected] because this thread is starting
> > to touch the spec.
> >
> > On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote:
> > > Hi Bobby,
> > >
> > > On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote:
> > > > CC'ing Cong.
> > > >
> > > > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote:
> > > > > Hey all!
> > > > >
> > > > > This series introduces support for datagrams to virtio/vsock.
> > >
> > > Great! Thanks for restarting this work!
> > >
> >
> > No problem!
> >
> > > > >
> > > > > It is a spin-off (and smaller version) of this series from the summer:
> > > > > https://lore.kernel.org/all/[email protected]/
> > > > >
> > > > > Please note that this is an RFC and should not be merged until
> > > > > associated changes are made to the virtio specification, which will
> > > > > follow after discussion from this series.
> > > > >
> > > > > This series first supports datagrams in a basic form for virtio, and
> > > > > then optimizes the sendpath for all transports.
> > > > >
> > > > > The result is a very fast datagram communication protocol that
> > > > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
> > > > > of multi-threaded workload samples.
> > > > >
> > > > > For those that are curious, some summary data comparing UDP and VSOCK
> > > > > DGRAM (N=5):
> > > > >
> > > > > vCPUS: 16
> > > > > virtio-net queues: 16
> > > > > payload size: 4KB
> > > > > Setup: bare metal + vm (non-nested)
> > > > >
> > > > > UDP: 287.59 MB/s
> > > > > VSOCK DGRAM: 509.2 MB/s
> > > > >
> > > > > Some notes about the implementation...
> > > > >
> > > > > This datagram implementation forces datagrams to self-throttle according
> > > > > to the threshold set by sk_sndbuf. It behaves similar to the credits
> > > > > used by streams in its effect on throughput and memory consumption, but
> > > > > it is not influenced by the receiving socket as credits are.
> > >
> > > So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right?
> > >
> >
> > Correct.
> >
> > > We should check if VMCI behaves the same.
> > >
> > > > >
> > > > > The device drops packets silently. There is room for improvement by
> > > > > building into the device and driver some intelligence around how to
> > > > > reduce frequency of kicking the virtqueue when packet loss is high. I
> > > > > think there is a good discussion to be had on this.
> > >
> > > Can you elaborate a bit here?
> > >
> > > Do you mean some mechanism to report to the sender that a destination
> > > (cid, port) is full so the packet will be dropped?
> > >
> >
> > Correct. There is also the case of there being no receiver at all for
> > this address since this case isn't rejected upon connect(). Ideally,
> > such a socket (which will have 100% packet loss) will be throttled
> > aggressively.
> >
> > Before we go down too far on this path, I also want to clarify that
> > using UDP over vhost/virtio-net also has this property... this can be
> > observed by using tcpdump to dump the UDP packets on the bridge network
> > your VM is using. UDP packets sent to a garbage address can be seen on
> > the host bridge (this is the nature of UDP, how is the host supposed to
> > know the address eventually goes nowhere). I mention the above because I
> > think it is possible for vsock to avoid this cost, given that it
> > benefits from being point-to-point and g2h/h2g.
> >
> > If we're okay with vsock being on par, then the current series does
> > that. I propose something below that can be added later and maybe
> > negotiated as a feature bit too.
>
> I see and I agree on that, let's do it step by step.
> If we can do it in the first phase is great, but I think is fine to add
> this feature later.
>
> >
> > > Can we adapt the credit mechanism?
> > >
> >
> > I've thought about this a lot because the attraction of the approach for
> > me would be that we could get the wait/buffer-limiting logic for free
> > and without big changes to the protocol, but the problem is that the
> > unreliable nature of datagrams means that the source's free-running
> > tx_cnt will become out-of-sync with the destination's fwd_cnt upon
> > packet loss.
>
> We need to understand where the packet can be lost.
> If the packet always reaches the destination (vsock driver or device),
> we can discard it, but also update the counters.
>
> >
> > Imagine a source that initializes and starts sending packets before a
> > destination socket even is created, the source's self-throttling will be
> > dysfunctional because its tx_cnt will always far exceed the
> > destination's fwd_cnt.
>
> Right, the other problem I see is that the socket aren't connected, so
> we have 1-N relationship.
>
Oh yeah, good point.
> >
> > We could play tricks with the meaning of the CREDIT_UPDATE message and
> > fwd_cnt/buf_alloc fields, but I don't think we want to go down that
> > path.
> >
> > I think that the best and simplest approach introduces a congestion
> > notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the
> > destination sends this notification. At a given repeated time period T,
> > the source can check if it has received any notifications in the last T.
> > If so, it halves its buffer allocation. If not, it doubles its buffer
> > allocation unless it is already at its max or original value.
> >
> > An "invalid" socket which never has any receiver will converge towards a
> > rate limit of one packet per time T * log2(average pkt size). That is, a
> > socket with 100% packet loss will only be able to send 16 bytes every
> > 4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within
> > 160ms given at least one packet sent per 5ms. I have no idea if that is
> > a reasonable default T for vsock, I just pulled it out of a hat for the
> > sake of the example.
> >
> > "Normal" sockets will be responsive to high loss and rebalance during
> > low loss. The source is trying to guess and converge on the actual
> > buffer state of the destination.
> >
> > This would reuse the already-existing throttling mechanisms that
> > throttle based upon buffer allocation. The usage of sk_sndbuf would have
> > to be re-worked. The application using sendmsg() will see EAGAIN when
> > throttled, or just sleep if !MSG_DONTWAIT.
>
> I see, it looks interesting, but I think we need to share that
> information between multiple sockets, since the same destination
> (cid, port), can be reached by multiple sockets.
>
Good point, that is true.
> Another approach could be to have both congestion notification and
> decongestion, but maybe it produces double traffic.
>
I think this could simplify things and could reduce noise. It is also
probably sufficient for the source to simply halt upon congestion
notification and resume upon decongestion notification, instead of
scaling up and down like I suggested above. It also avoids the
burstiness that would occur with a "congestion notification"-only
approach where the source guesses when to resume and guesses wrong.
The congestion notification may want to have an expiration period after
which the sender can resume without receiving a decongestion
notification? If it receives congestion again, then it can halt again.
> >
> > I looked at alternative schemes (like the Datagram Congestion Control
> > Protocol), but I do not think the added complexity is necessary in the
> > case of vsock (DCCP requires congestion windows, sequence numbers, batch
> > acknowledgements, etc...). I also looked at UDP-based application
> > protocols like TFTP, DHCP, and SIP over UDP which use a delay-based
> > backoff mechanism, but seem to require acknowledgement for those packet
> > types, which trigger the retries and backoffs. I think we can get away
> > with the simpler approach and not have to potentially kill performance
> > with per-packet acknowledgements.
>
> Yep I agree. I think our advantage is that the channel (virtqueues),
> can't lose packets.
>
Exactly.
> >
> > > > >
> > > > > In this series I am also proposing that fairness be reexamined as an
> > > > > issue separate from datagrams, which differs from my previous series
> > > > > that coupled these issues. After further testing and reflection on the
> > > > > design, I do not believe that these need to be coupled and I do not
> > > > > believe this implementation introduces additional unfairness or
> > > > > exacerbates pre-existing unfairness.
> > >
> > > I see.
> > >
> > > > >
> > > > > I attempted to characterize vsock fairness by using a pool of processes
> > > > > to stress test the shared resources while measuring the performance of a
> > > > > lone stream socket. Given unfair preference for datagrams, we would
> > > > > assume that a lone stream socket would degrade much more when a pool of
> > > > > datagram sockets was stressing the system than when a pool of stream
> > > > > sockets are stressing the system. The result, however, showed no
> > > > > significant difference between the degradation of throughput of the lone
> > > > > stream socket when using a pool of datagrams to stress the queue over
> > > > > using a pool of streams. The absolute difference in throughput actually
> > > > > favored datagrams as interfering least as the mean difference was +16%
> > > > > compared to using streams to stress test (N=7), but it was not
> > > > > statistically significant. Workloads were matched for payload size and
> > > > > buffer size (to approximate memory consumption) and process count, and
> > > > > stress workloads were configured to start before and last long after the
> > > > > lifetime of the "lone" stream socket flow to ensure that competing flows
> > > > > were continuously hot.
> > > > >
> > > > > Given the above data, I propose that vsock fairness be addressed
> > > > > independent of datagrams and to defer its implementation to a future
> > > > > series.
> > >
> > > Makes sense to me.
> > >
> > > I left some preliminary comments, anyway now it seems reasonable to use
> > > the same virtqueues, so we can go head with the spec proposal.
> > >
> > > Thanks,
> > > Stefano
> > >
> >
> > Thanks for the review!
>
> You're welcome!
>
> Stefano
>
Best,
Bobby
On Fri, Apr 28, 2023 at 12:29:50PM +0200, Stefano Garzarella wrote:
> On Sat, Apr 15, 2023 at 10:30:55AM +0000, Bobby Eshleman wrote:
> > On Wed, Apr 19, 2023 at 11:30:53AM +0200, Stefano Garzarella wrote:
> > > On Fri, Apr 14, 2023 at 12:25:59AM +0000, Bobby Eshleman wrote:
> > > > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
> > > > it does not scale when many senders share a socket.
> > > >
> > > > Prior to this patch the socket lock is used to protect the local_addr,
> > > > remote_addr, transport, and buffer size variables. What follows are the
> > > > new protection schemes for the various protected fields that ensure a
> > > > race-free multi-sender sendmsg() path for vsock dgrams.
> > > >
> > > > - local_addr
> > > > local_addr changes as a result of binding a socket. The write path
> > > > for local_addr is bind() and various vsock_auto_bind() call sites.
> > > > After a socket has been bound via vsock_auto_bind() or bind(), subsequent
> > > > calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
> > > > rejects the user request and vsock_auto_bind() early exits.
> > > > Therefore, the local addr can not change while a parallel thread is
> > > > in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
> > > > Change: only acquire lock for auto-binding as-needed in sendmsg().
> > > >
> > > > - vsk->transport
> > > > Updated upon socket creation and it doesn't change again until the
> > >
> > > This is true only for dgram, right?
> > >
> >
> > Yes.
> >
> > > How do we decide which transport to assign for dgram?
> > >
> >
> > The transport is assigned in proto->create() [vsock_create()]. It is
> > assigned there *only* for dgrams, whereas for streams/seqpackets it is
> > assigned in connect(). vsock_create() sets transport to
> > 'transport_dgram' if sock->type == SOCK_DGRAM.
> >
> > vsock_sk_destruct() then eventually sets vsk->transport to NULL.
> >
> > Neither destruct nor create can occur in parallel with sendmsg().
> > create() hasn't yet returned the sockfd for the user to call upon it
> > sendmsg(), and AFAICT destruct is only called after the final socket
> > reference is released, which only happens after the socket no longer
> > exists in the fd lookup table and so sendmsg() will fail before it ever
> > has the chance to race.
>
> This is okay if a socket can be assigned to a single transport, but with
> dgrams I'm not sure we can do that.
>
> Since it is not connected, a socket can send or receive packets from
> different transports, so maybe we can't assign it to a specific one,
> but do a lookup for every packets to understand which transport to use.
>
Yes this is true, this lookup needs to be implemented... currently
sendmsg() doesn't do this at all. It grabs the remote_addr when passed
in from sendto(), but then just uses the same old transport from vsk.
You are right that sendto() should be a per-packet lookup, not a
vsk->transport read. Perhaps I should add that as another patch in this
series, and make it precede this one?
For the send() / sendto(NULL) case where vsk->transport is being read, I
do believe this is still race-free, but...
If we later support dynamic transports for datagram, such that
connect(VMADDR_LOCAL_CID) sets vsk->transport to transport_loopback,
connect(VMADDR_CID_HOST) sets vsk->transport to something like
transport_datagram_g2h, etc..., then vsk->transport will need to be
bundled into the RCU-protected pointer too, since it may change when
remote_addr changes.
> >
> > > > socket is destroyed, which only happens after the socket refcnt reaches
> > > > zero. This prevents any sendmsg() call from being entered because the
> > > > sockfd lookup fails beforehand. That is, sendmsg() and vsk->transport
> > > > writes cannot execute in parallel. Additionally, connect() doesn't
> > > > update vsk->transport for dgrams as it does for streams. Therefore
> > > > vsk->transport is also safe to access lock-free in the sendmsg() path.
> > > > No change.
> > > >
> > > > - buffer size variables
> > > > Not used by dgram, so they do not need protection. No change.
> > >
> > > Is this true because for dgram we use the socket buffer?
> > > Is it the same for VMCI?
> >
> > Yes. The buf_alloc derived from buffer_size is also always ignored after
> > being initialized once since credits aren't used.
> >
> > My reading of the VMCI code is that the buffer_size and
> > buffer_{min,max}_size variables are used only in connectible calls (like
> > connect and recv_listen), but not for datagrams.
>
> Okay, thanks for checking!
>
> >
> > >
> > > >
> > > > - remote_addr
> > > > Needs additional protection because before this patch the
> > > > remote_addr (consisting of several fields such as cid, port, and flags)
> > > > only changed atomically under socket lock context. By acquiring the
> > > > socket lock to read the structure, the changes made by connect() were
> > > > always made visible to sendmsg() atomically. Consequently, to retain
> > > > atomicity of updates but offer lock-free access, this patch
> > > > redesigns this field as an RCU-protected pointer.
> > > >
> > > > Writers are still synchronized using the socket lock, but readers
> > > > only read inside RCU read-side critical sections.
> > > >
> > > > Helpers are introduced for accessing and updating the new pointer.
> > > >
> > > > The remote_addr structure is wrapped together with an rcu_head into a
> > > > sockaddr_vm_rcu structure so that kfree_rcu() can be used. This removes
> > > > the need of writers to use synchronize_rcu() after freeing old structures
> > > > which is simply more efficient and reduces code churn where remote_addr
> > > > is already being updated inside read-side sections.
> > > >
> > > > Only virtio has been tested, but updates were necessary to the VMCI and
> > > > hyperv code. Unfortunately the author does not have access to
> > > > VMCI/hyperv systems so those changes are untested.
> > > >
> > > > Perf Tests
> > > > vCPUS: 16
> > > > Threads: 16
> > > > Payload: 4KB
> > > > Test Runs: 5
> > > > Type: SOCK_DGRAM
> > > >
> > > > Before: 245.2 MB/s
> > > > After: 509.2 MB/s (+107%)
> > > >
> > > > Notably, on the same test system, vsock dgram even outperforms
> > > > multi-threaded UDP over virtio-net with vhost and MQ support enabled.
> > >
> > > Cool!
> > >
> > > This patch is quite large, so I need to review it carefully in future
> > > versions, but in general it makes sense to me.
> > >
> > > Thanks,
> > > Stefano
> > >
> >
> > Thanks for the initial comments!
> >
> > Best,
> > Bobby
> >
> > > >
> > > > Throughput metrics for single-threaded SOCK_DGRAM and
> > > > single/multi-threaded SOCK_STREAM showed no statistically signficant
> > > > throughput changes (lowest p-value reaching 0.27), with the range of the
> > > > mean difference ranging between -5% to +1%.
> > > >
> > > > Signed-off-by: Bobby Eshleman <[email protected]>
> > > > ---
> > > > drivers/vhost/vsock.c | 12 +-
> > > > include/net/af_vsock.h | 19 ++-
> > > > net/vmw_vsock/af_vsock.c | 261 ++++++++++++++++++++++++++++----
> > > > net/vmw_vsock/diag.c | 10 +-
> > > > net/vmw_vsock/hyperv_transport.c | 15 +-
> > > > net/vmw_vsock/virtio_transport_common.c | 22 ++-
> > > > net/vmw_vsock/vmci_transport.c | 70 ++++++---
> > > > 7 files changed, 344 insertions(+), 65 deletions(-)
> > > >
> > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > index 028cf079225e..da105cb856ac 100644
> > > > --- a/drivers/vhost/vsock.c
> > > > +++ b/drivers/vhost/vsock.c
> > > > @@ -296,13 +296,17 @@ static int
> > > > vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> > > > {
> > > > struct vhost_vsock *vsock;
> > > > + unsigned int cid;
> > > > int cnt = 0;
> > > > int ret = -ENODEV;
> > > >
> > > > rcu_read_lock();
> > > > + ret = vsock_remote_addr_cid(vsk, &cid);
> > > > + if (ret < 0)
> > > > + goto out;
> > > >
> > > > /* Find the vhost_vsock according to guest context id */
> > > > - vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
> > > > + vsock = vhost_vsock_get(cid);
> > > > if (!vsock)
> > > > goto out;
> > > >
> > > > @@ -686,6 +690,10 @@ static void vhost_vsock_flush(struct vhost_vsock *vsock)
> > > > static void vhost_vsock_reset_orphans(struct sock *sk)
> > > > {
> > > > struct vsock_sock *vsk = vsock_sk(sk);
> > > > + unsigned int cid;
> > > > +
> > > > + if (vsock_remote_addr_cid(vsk, &cid) < 0)
> > > > + return;
> > > >
> > > > /* vmci_transport.c doesn't take sk_lock here either. At least we're
> > > > * under vsock_table_lock so the sock cannot disappear while we're
> > > > @@ -693,7 +701,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
> > > > */
> > > >
> > > > /* If the peer is still valid, no need to reset connection */
> > > > - if (vhost_vsock_get(vsk->remote_addr.svm_cid))
> > > > + if (vhost_vsock_get(cid))
> > > > return;
> > > >
> > > > /* If the close timeout is pending, let it expire. This avoids races
> > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > index 57af28fede19..c02fd6ad0047 100644
> > > > --- a/include/net/af_vsock.h
> > > > +++ b/include/net/af_vsock.h
> > > > @@ -25,12 +25,17 @@ extern spinlock_t vsock_table_lock;
> > > > #define vsock_sk(__sk) ((struct vsock_sock *)__sk)
> > > > #define sk_vsock(__vsk) (&(__vsk)->sk)
> > > >
> > > > +struct sockaddr_vm_rcu {
> > > > + struct sockaddr_vm addr;
> > > > + struct rcu_head rcu;
> > > > +};
> > > > +
> > > > struct vsock_sock {
> > > > /* sk must be the first member. */
> > > > struct sock sk;
> > > > const struct vsock_transport *transport;
> > > > struct sockaddr_vm local_addr;
> > > > - struct sockaddr_vm remote_addr;
> > > > + struct sockaddr_vm_rcu * __rcu remote_addr;
> > > > /* Links for the global tables of bound and connected sockets. */
> > > > struct list_head bound_table;
> > > > struct list_head connected_table;
> > > > @@ -206,7 +211,7 @@ void vsock_release_pending(struct sock *pending);
> > > > void vsock_add_pending(struct sock *listener, struct sock *pending);
> > > > void vsock_remove_pending(struct sock *listener, struct sock *pending);
> > > > void vsock_enqueue_accept(struct sock *listener, struct sock *connected);
> > > > -void vsock_insert_connected(struct vsock_sock *vsk);
> > > > +int vsock_insert_connected(struct vsock_sock *vsk);
> > > > void vsock_remove_bound(struct vsock_sock *vsk);
> > > > void vsock_remove_connected(struct vsock_sock *vsk);
> > > > struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
> > > > @@ -244,4 +249,14 @@ static inline void __init vsock_bpf_build_proto(void)
> > > > {}
> > > > #endif
> > > >
> > > > +/* RCU-protected remote addr helpers */
> > > > +int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid);
> > > > +int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port);
> > > > +int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
> > > > + unsigned int *port);
> > > > +int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest);
> > > > +bool vsock_remote_addr_bound(struct vsock_sock *vsk);
> > > > +bool vsock_remote_addr_equals(struct vsock_sock *vsk, struct sockaddr_vm *other);
> > > > +int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port);
> > > > +
> > > > #endif /* __AF_VSOCK_H__ */
> > > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > > > index 46b3f35e3adc..93b4abbf20b4 100644
> > > > --- a/net/vmw_vsock/af_vsock.c
> > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > @@ -145,6 +145,139 @@ static const struct vsock_transport *transport_local;
> > > > static DEFINE_MUTEX(vsock_register_mutex);
> > > >
> > > > /**** UTILS ****/
> > > > +bool vsock_remote_addr_bound(struct vsock_sock *vsk)
> > > > +{
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > > + bool ret;
> > > > +
> > > > + rcu_read_lock();
> > > > + remote_addr = rcu_dereference(vsk->remote_addr);
> > > > + if (!remote_addr) {
> > > > + rcu_read_unlock();
> > > > + return false;
> > > > + }
> > > > +
> > > > + ret = vsock_addr_bound(&remote_addr->addr);
> > > > + rcu_read_unlock();
> > > > +
> > > > + return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vsock_remote_addr_bound);
> > > > +
> > > > +int vsock_remote_addr_copy(struct vsock_sock *vsk, struct sockaddr_vm *dest)
> > > > +{
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > > +
> > > > + rcu_read_lock();
> > > > + remote_addr = rcu_dereference(vsk->remote_addr);
> > > > + if (!remote_addr) {
> > > > + rcu_read_unlock();
> > > > + return -EINVAL;
> > > > + }
> > > > + memcpy(dest, &remote_addr->addr, sizeof(*dest));
> > > > + rcu_read_unlock();
> > > > +
> > > > + return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vsock_remote_addr_copy);
> > > > +
> > > > +int vsock_remote_addr_cid(struct vsock_sock *vsk, unsigned int *cid)
> > > > +{
> > > > + return vsock_remote_addr_cid_port(vsk, cid, NULL);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vsock_remote_addr_cid);
> > > > +
> > > > +int vsock_remote_addr_port(struct vsock_sock *vsk, unsigned int *port)
> > > > +{
> > > > + return vsock_remote_addr_cid_port(vsk, NULL, port);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vsock_remote_addr_port);
> > > > +
> > > > +int vsock_remote_addr_cid_port(struct vsock_sock *vsk, unsigned int *cid,
> > > > + unsigned int *port)
> > > > +{
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > > +
> > > > + rcu_read_lock();
> > > > + remote_addr = rcu_dereference(vsk->remote_addr);
> > > > + if (!remote_addr) {
> > > > + rcu_read_unlock();
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + if (cid)
> > > > + *cid = remote_addr->addr.svm_cid;
> > > > + if (port)
> > > > + *port = remote_addr->addr.svm_port;
> > > > +
> > > > + rcu_read_unlock();
> > > > + return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vsock_remote_addr_cid_port);
> > > > +
> > > > +/* The socket lock must be held by the caller */
> > > > +int vsock_remote_addr_update_cid_port(struct vsock_sock *vsk, u32 cid, u32 port)
> > > > +{
> > > > + struct sockaddr_vm_rcu *old, *new;
> > > > +
> > > > + new = kmalloc(sizeof(*new), GFP_KERNEL);
> > > > + if (!new)
> > > > + return -ENOMEM;
> > > > +
> > > > + rcu_read_lock();
> > > > + old = rcu_dereference(vsk->remote_addr);
> > > > + if (!old) {
> > > > + kfree(new);
> > > > + return -EINVAL;
> > > > + }
> > > > + memcpy(&new->addr, &old->addr, sizeof(new->addr));
> > > > + rcu_read_unlock();
> > > > +
> > > > + new->addr.svm_cid = cid;
> > > > + new->addr.svm_port = port;
> > > > +
> > > > + old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
> > > > + kfree_rcu(old, rcu);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vsock_remote_addr_update_cid_port);
> > > > +
> > > > +/* The socket lock must be held by the caller */
> > > > +int vsock_remote_addr_update(struct vsock_sock *vsk, struct sockaddr_vm *src)
> > > > +{
> > > > + struct sockaddr_vm_rcu *old, *new;
> > > > +
> > > > + new = kmalloc(sizeof(*new), GFP_KERNEL);
> > > > + if (!new)
> > > > + return -ENOMEM;
> > > > +
> > > > + memcpy(&new->addr, src, sizeof(new->addr));
> > > > + old = rcu_replace_pointer(vsk->remote_addr, new, lockdep_sock_is_held(sk_vsock(vsk)));
> > > > + kfree_rcu(old, rcu);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +bool vsock_remote_addr_equals(struct vsock_sock *vsk,
> > > > + struct sockaddr_vm *other)
> > > > +{
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > > + bool equals;
> > > > +
> > > > + rcu_read_lock();
> > > > + remote_addr = rcu_dereference(vsk->remote_addr);
> > > > + if (!remote_addr) {
> > > > + rcu_read_unlock();
> > > > + return false;
> > > > + }
> > > > +
> > > > + equals = vsock_addr_equals_addr(&remote_addr->addr, other);
> > > > + rcu_read_unlock();
> > > > +
> > > > + return equals;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vsock_remote_addr_equals);
> > > >
> > > > /* Each bound VSocket is stored in the bind hash table and each connected
> > > > * VSocket is stored in the connected hash table.
> > > > @@ -254,10 +387,16 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
> > > >
> > > > list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
> > > > connected_table) {
> > > > - if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > > +
> > > > + rcu_read_lock();
> > > > + remote_addr = rcu_dereference(vsk->remote_addr);
> > > > + if (vsock_addr_equals_addr(src, &remote_addr->addr) &&
> > > > dst->svm_port == vsk->local_addr.svm_port) {
> > > > + rcu_read_unlock();
> > > > return sk_vsock(vsk);
> > > > }
> > > > + rcu_read_unlock();
> > > > }
> > > >
> > > > return NULL;
> > > > @@ -270,14 +409,25 @@ static void vsock_insert_unbound(struct vsock_sock *vsk)
> > > > spin_unlock_bh(&vsock_table_lock);
> > > > }
> > > >
> > > > -void vsock_insert_connected(struct vsock_sock *vsk)
> > > > +int vsock_insert_connected(struct vsock_sock *vsk)
> > > > {
> > > > - struct list_head *list = vsock_connected_sockets(
> > > > - &vsk->remote_addr, &vsk->local_addr);
> > > > + struct list_head *list;
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > > +
> > > > + rcu_read_lock();
> > > > + remote_addr = rcu_dereference(vsk->remote_addr);
> > > > + if (!remote_addr) {
> > > > + rcu_read_unlock();
> > > > + return -EINVAL;
> > > > + }
> > > > + list = vsock_connected_sockets(&remote_addr->addr, &vsk->local_addr);
> > > > + rcu_read_unlock();
> > > >
> > > > spin_lock_bh(&vsock_table_lock);
> > > > __vsock_insert_connected(list, vsk);
> > > > spin_unlock_bh(&vsock_table_lock);
> > > > +
> > > > + return 0;
> > > > }
> > > > EXPORT_SYMBOL_GPL(vsock_insert_connected);
> > > >
> > > > @@ -438,10 +588,17 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> > > > {
> > > > const struct vsock_transport *new_transport;
> > > > struct sock *sk = sk_vsock(vsk);
> > > > - unsigned int remote_cid = vsk->remote_addr.svm_cid;
> > > > + struct sockaddr_vm remote_addr;
> > > > + unsigned int remote_cid;
> > > > __u8 remote_flags;
> > > > int ret;
> > > >
> > > > + ret = vsock_remote_addr_copy(vsk, &remote_addr);
> > > > + if (ret < 0)
> > > > + return ret;
> > > > +
> > > > + remote_cid = remote_addr.svm_cid;
> > > > +
> > > > /* If the packet is coming with the source and destination CIDs higher
> > > > * than VMADDR_CID_HOST, then a vsock channel where all the packets are
> > > > * forwarded to the host should be established. Then the host will
> > > > @@ -451,10 +608,15 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> > > > * the connect path the flag can be set by the user space application.
> > > > */
> > > > if (psk && vsk->local_addr.svm_cid > VMADDR_CID_HOST &&
> > > > - vsk->remote_addr.svm_cid > VMADDR_CID_HOST)
> > > > - vsk->remote_addr.svm_flags |= VMADDR_FLAG_TO_HOST;
> > > > + remote_addr.svm_cid > VMADDR_CID_HOST) {
> > > > + remote_addr.svm_flags |= VMADDR_CID_HOST;
> > > > +
> > > > + ret = vsock_remote_addr_update(vsk, &remote_addr);
> > > > + if (ret < 0)
> > > > + return ret;
> > > > + }
> > > >
> > > > - remote_flags = vsk->remote_addr.svm_flags;
> > > > + remote_flags = remote_addr.svm_flags;
> > > >
> > > > switch (sk->sk_type) {
> > > > case SOCK_DGRAM:
> > > > @@ -742,6 +904,7 @@ static struct sock *__vsock_create(struct net *net,
> > > > unsigned short type,
> > > > int kern)
> > > > {
> > > > + struct sockaddr_vm *remote_addr;
> > > > struct sock *sk;
> > > > struct vsock_sock *psk;
> > > > struct vsock_sock *vsk;
> > > > @@ -761,7 +924,14 @@ static struct sock *__vsock_create(struct net *net,
> > > >
> > > > vsk = vsock_sk(sk);
> > > > vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > > > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > > > +
> > > > + remote_addr = kmalloc(sizeof(*remote_addr), GFP_KERNEL);
> > > > + if (!remote_addr) {
> > > > + sk_free(sk);
> > > > + return NULL;
> > > > + }
> > > > + vsock_addr_init(remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > > > + rcu_assign_pointer(vsk->remote_addr, remote_addr);
> > > >
> > > > sk->sk_destruct = vsock_sk_destruct;
> > > > sk->sk_backlog_rcv = vsock_queue_rcv_skb;
> > > > @@ -845,6 +1015,7 @@ static void __vsock_release(struct sock *sk, int level)
> > > > static void vsock_sk_destruct(struct sock *sk)
> > > > {
> > > > struct vsock_sock *vsk = vsock_sk(sk);
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > >
> > > > vsock_deassign_transport(vsk);
> > > >
> > > > @@ -852,8 +1023,8 @@ static void vsock_sk_destruct(struct sock *sk)
> > > > * possibly register the address family with the kernel.
> > > > */
> > > > vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > > > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > > > -
> > > > + remote_addr = rcu_replace_pointer(vsk->remote_addr, NULL, 1);
> > > > + kfree_rcu(remote_addr);
> > > > put_cred(vsk->owner);
> > > > }
> > > >
> > > > @@ -943,6 +1114,7 @@ static int vsock_getname(struct socket *sock,
> > > > struct sock *sk;
> > > > struct vsock_sock *vsk;
> > > > struct sockaddr_vm *vm_addr;
> > > > + struct sockaddr_vm_rcu *rcu_ptr;
> > > >
> > > > sk = sock->sk;
> > > > vsk = vsock_sk(sk);
> > > > @@ -951,11 +1123,17 @@ static int vsock_getname(struct socket *sock,
> > > > lock_sock(sk);
> > > >
> > > > if (peer) {
> > > > + rcu_read_lock();
> > > > if (sock->state != SS_CONNECTED) {
> > > > err = -ENOTCONN;
> > > > goto out;
> > > > }
> > > > - vm_addr = &vsk->remote_addr;
> > > > + rcu_ptr = rcu_dereference(vsk->remote_addr);
> > > > + if (!rcu_ptr) {
> > > > + err = -EINVAL;
> > > > + goto out;
> > > > + }
> > > > + vm_addr = &rcu_ptr->addr;
> > > > } else {
> > > > vm_addr = &vsk->local_addr;
> > > > }
> > > > @@ -975,6 +1153,8 @@ static int vsock_getname(struct socket *sock,
> > > > err = sizeof(*vm_addr);
> > > >
> > > > out:
> > > > + if (peer)
> > > > + rcu_read_unlock();
> > > > release_sock(sk);
> > > > return err;
> > > > }
> > > > @@ -1161,7 +1341,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > > > int err;
> > > > struct sock *sk;
> > > > struct vsock_sock *vsk;
> > > > - struct sockaddr_vm *remote_addr;
> > > > + struct sockaddr_vm stack_addr, *remote_addr;
> > > > const struct vsock_transport *transport;
> > > >
> > > > if (msg->msg_flags & MSG_OOB)
> > > > @@ -1172,15 +1352,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > > > sk = sock->sk;
> > > > vsk = vsock_sk(sk);
> > > >
> > > > - lock_sock(sk);
> > > > + /* If auto-binding is required, acquire the slock to avoid potential
> > > > + * race conditions. Otherwise, do not acquire the lock.
> > > > + *
> > > > + * We know that the first check of local_addr is racy (indicated by
> > > > + * data_race()). By acquiring the lock and then subsequently checking
> > > > + * again if local_addr is bound (inside vsock_auto_bind()), we can
> > > > + * ensure there are no real data races.
> > > > + *
> > > > + * This technique is borrowed by inet_send_prepare().
> > > > + */
> > > > + if (data_race(!vsock_addr_bound(&vsk->local_addr))) {
> > > > + lock_sock(sk);
> > > > + err = vsock_auto_bind(vsk);
> > > > + release_sock(sk);
> > > > + if (err)
> > > > + return err;
> > > > + }
> > > >
> > > > transport = vsk->transport;
> > > >
> > > > - err = vsock_auto_bind(vsk);
> > > > - if (err)
> > > > - goto out;
> > > > -
> > > > -
> > > > /* If the provided message contains an address, use that. Otherwise
> > > > * fall back on the socket's remote handle (if it has been connected).
> > > > */
> > > > @@ -1199,18 +1390,26 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > > > goto out;
> > > > }
> > > > } else if (sock->state == SS_CONNECTED) {
> > > > - remote_addr = &vsk->remote_addr;
> > > > + err = vsock_remote_addr_copy(vsk, &stack_addr);
> > > > + if (err < 0)
> > > > + goto out;
> > > >
> > > > - if (remote_addr->svm_cid == VMADDR_CID_ANY)
> > > > - remote_addr->svm_cid = transport->get_local_cid();
> > > > + if (stack_addr.svm_cid == VMADDR_CID_ANY) {
> > > > + stack_addr.svm_cid = transport->get_local_cid();
> > > > + lock_sock(sk_vsock(vsk));
> > > > + vsock_remote_addr_update(vsk, &stack_addr);
> > > > + release_sock(sk_vsock(vsk));
> > > > + }
> > > >
> > > > /* XXX Should connect() or this function ensure remote_addr is
> > > > * bound?
> > > > */
> > > > - if (!vsock_addr_bound(&vsk->remote_addr)) {
> > > > + if (!vsock_addr_bound(&stack_addr)) {
> > > > err = -EINVAL;
> > > > goto out;
> > > > }
> > > > +
> > > > + remote_addr = &stack_addr;
> > > > } else {
> > > > err = -EINVAL;
> > > > goto out;
> > > > @@ -1225,7 +1424,6 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> > > > err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
> > > >
> > > > out:
> > > > - release_sock(sk);
> > > > return err;
> > > > }
> > > >
> > > > @@ -1243,8 +1441,7 @@ static int vsock_dgram_connect(struct socket *sock,
> > > > err = vsock_addr_cast(addr, addr_len, &remote_addr);
> > > > if (err == -EAFNOSUPPORT && remote_addr->svm_family == AF_UNSPEC) {
> > > > lock_sock(sk);
> > > > - vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY,
> > > > - VMADDR_PORT_ANY);
> > > > + vsock_remote_addr_update_cid_port(vsk, VMADDR_CID_ANY, VMADDR_PORT_ANY);
> > > > sock->state = SS_UNCONNECTED;
> > > > release_sock(sk);
> > > > return 0;
> > > > @@ -1263,7 +1460,10 @@ static int vsock_dgram_connect(struct socket *sock,
> > > > goto out;
> > > > }
> > > >
> > > > - memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
> > > > + err = vsock_remote_addr_update(vsk, remote_addr);
> > > > + if (err < 0)
> > > > + goto out;
> > > > +
> > > > sock->state = SS_CONNECTED;
> > > >
> > > > /* sock map disallows redirection of non-TCP sockets with sk_state !=
> > > > @@ -1399,8 +1599,9 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
> > > > }
> > > >
> > > > /* Set the remote address that we are connecting to. */
> > > > - memcpy(&vsk->remote_addr, remote_addr,
> > > > - sizeof(vsk->remote_addr));
> > > > + err = vsock_remote_addr_update(vsk, remote_addr);
> > > > + if (err)
> > > > + goto out;
> > > >
> > > > err = vsock_assign_transport(vsk, NULL);
> > > > if (err)
> > > > @@ -1831,7 +2032,7 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
> > > > goto out;
> > > > }
> > > >
> > > > - if (!vsock_addr_bound(&vsk->remote_addr)) {
> > > > + if (!vsock_remote_addr_bound(vsk)) {
> > > > err = -EDESTADDRREQ;
> > > > goto out;
> > > > }
> > > > diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
> > > > index a2823b1c5e28..f843bae86b32 100644
> > > > --- a/net/vmw_vsock/diag.c
> > > > +++ b/net/vmw_vsock/diag.c
> > > > @@ -15,8 +15,14 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
> > > > u32 portid, u32 seq, u32 flags)
> > > > {
> > > > struct vsock_sock *vsk = vsock_sk(sk);
> > > > + struct sockaddr_vm remote_addr;
> > > > struct vsock_diag_msg *rep;
> > > > struct nlmsghdr *nlh;
> > > > + int err;
> > > > +
> > > > + err = vsock_remote_addr_copy(vsk, &remote_addr);
> > > > + if (err < 0)
> > > > + return err;
> > > >
> > > > nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
> > > > flags);
> > > > @@ -36,8 +42,8 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
> > > > rep->vdiag_shutdown = sk->sk_shutdown;
> > > > rep->vdiag_src_cid = vsk->local_addr.svm_cid;
> > > > rep->vdiag_src_port = vsk->local_addr.svm_port;
> > > > - rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
> > > > - rep->vdiag_dst_port = vsk->remote_addr.svm_port;
> > > > + rep->vdiag_dst_cid = remote_addr.svm_cid;
> > > > + rep->vdiag_dst_port = remote_addr.svm_port;
> > > > rep->vdiag_ino = sock_i_ino(sk);
> > > >
> > > > sock_diag_save_cookie(sk, rep->vdiag_cookie);
> > > > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> > > > index 7cb1a9d2cdb4..462b2ec3e6e9 100644
> > > > --- a/net/vmw_vsock/hyperv_transport.c
> > > > +++ b/net/vmw_vsock/hyperv_transport.c
> > > > @@ -336,9 +336,11 @@ static void hvs_open_connection(struct vmbus_channel *chan)
> > > > hvs_addr_init(&vnew->local_addr, if_type);
> > > >
> > > > /* Remote peer is always the host */
> > > > - vsock_addr_init(&vnew->remote_addr,
> > > > - VMADDR_CID_HOST, VMADDR_PORT_ANY);
> > > > - vnew->remote_addr.svm_port = get_port_by_srv_id(if_instance);
> > > > + ret = vsock_remote_addr_update_cid_port(vnew, VMADDR_CID_HOST,
> > > > + get_port_by_srv_id(if_instance));
> > > > + if (ret < 0)
> > > > + goto out;
> > > > +
> > > > ret = vsock_assign_transport(vnew, vsock_sk(sk));
> > > > /* Transport assigned (looking at remote_addr) must be the
> > > > * same where we received the request.
> > > > @@ -459,13 +461,18 @@ static int hvs_connect(struct vsock_sock *vsk)
> > > > {
> > > > union hvs_service_id vm, host;
> > > > struct hvsock *h = vsk->trans;
> > > > + int err;
> > > >
> > > > vm.srv_id = srv_id_template;
> > > > vm.svm_port = vsk->local_addr.svm_port;
> > > > h->vm_srv_id = vm.srv_id;
> > > >
> > > > host.srv_id = srv_id_template;
> > > > - host.svm_port = vsk->remote_addr.svm_port;
> > > > +
> > > > + err = vsock_remote_addr_port(vsk, &host.svm_port);
> > > > + if (err < 0)
> > > > + return err;
> > > > +
> > > > h->host_srv_id = host.srv_id;
> > > >
> > > > return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
> > > > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > > > index 925acface893..1b87704e516a 100644
> > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > @@ -258,8 +258,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > src_cid = t_ops->transport.get_local_cid();
> > > > src_port = vsk->local_addr.svm_port;
> > > > if (!info->remote_cid) {
> > > > - dst_cid = vsk->remote_addr.svm_cid;
> > > > - dst_port = vsk->remote_addr.svm_port;
> > > > + ret = vsock_remote_addr_cid_port(vsk, &dst_cid, &dst_port);
> > > > + if (ret < 0)
> > > > + return ret;
> > > > } else {
> > > > dst_cid = info->remote_cid;
> > > > dst_port = info->remote_port;
> > > > @@ -1169,7 +1170,9 @@ virtio_transport_recv_connecting(struct sock *sk,
> > > > case VIRTIO_VSOCK_OP_RESPONSE:
> > > > sk->sk_state = TCP_ESTABLISHED;
> > > > sk->sk_socket->state = SS_CONNECTED;
> > > > - vsock_insert_connected(vsk);
> > > > + err = vsock_insert_connected(vsk);
> > > > + if (err)
> > > > + goto destroy;
> > > > sk->sk_state_change(sk);
> > > > break;
> > > > case VIRTIO_VSOCK_OP_INVALID:
> > > > @@ -1403,9 +1406,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> > > > vchild = vsock_sk(child);
> > > > vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
> > > > le32_to_cpu(hdr->dst_port));
> > > > - vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
> > > > - le32_to_cpu(hdr->src_port));
> > > > -
> > > > + vsock_remote_addr_update_cid_port(vchild, le64_to_cpu(hdr->src_cid),
> > > > + le32_to_cpu(hdr->src_port));
> > > > ret = vsock_assign_transport(vchild, vsk);
> > > > /* Transport assigned (looking at remote_addr) must be the same
> > > > * where we received the request.
> > > > @@ -1420,7 +1422,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> > > > if (virtio_transport_space_update(child, skb))
> > > > child->sk_write_space(child);
> > > >
> > > > - vsock_insert_connected(vchild);
> > > > + ret = vsock_insert_connected(vchild);
> > > > + if (ret) {
> > > > + release_sock(child);
> > > > + virtio_transport_reset_no_sock(t, skb);
> > > > + sock_put(child);
> > > > + return ret;
> > > > + }
> > > > vsock_enqueue_accept(sk, child);
> > > > virtio_transport_send_response(vchild, skb);
> > > >
> > > > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> > > > index b370070194fa..c0c445e7d925 100644
> > > > --- a/net/vmw_vsock/vmci_transport.c
> > > > +++ b/net/vmw_vsock/vmci_transport.c
> > > > @@ -283,18 +283,25 @@ vmci_transport_send_control_pkt(struct sock *sk,
> > > > u16 proto,
> > > > struct vmci_handle handle)
> > > > {
> > > > + struct sockaddr_vm addr_stack;
> > > > + struct sockaddr_vm *remote_addr = &addr_stack;
> > > > struct vsock_sock *vsk;
> > > > + int err;
> > > >
> > > > vsk = vsock_sk(sk);
> > > >
> > > > if (!vsock_addr_bound(&vsk->local_addr))
> > > > return -EINVAL;
> > > >
> > > > - if (!vsock_addr_bound(&vsk->remote_addr))
> > > > + if (!vsock_remote_addr_bound(vsk))
> > > > return -EINVAL;
> > > >
> > > > + err = vsock_remote_addr_copy(vsk, &addr_stack);
> > > > + if (err < 0)
> > > > + return err;
> > > > +
> > > > return vmci_transport_alloc_send_control_pkt(&vsk->local_addr,
> > > > - &vsk->remote_addr,
> > > > + remote_addr,
> > > > type, size, mode,
> > > > wait, proto, handle);
> > > > }
> > > > @@ -317,6 +324,7 @@ static int vmci_transport_send_reset(struct sock *sk,
> > > > struct sockaddr_vm *dst_ptr;
> > > > struct sockaddr_vm dst;
> > > > struct vsock_sock *vsk;
> > > > + int err;
> > > >
> > > > if (pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST)
> > > > return 0;
> > > > @@ -326,13 +334,16 @@ static int vmci_transport_send_reset(struct sock *sk,
> > > > if (!vsock_addr_bound(&vsk->local_addr))
> > > > return -EINVAL;
> > > >
> > > > - if (vsock_addr_bound(&vsk->remote_addr)) {
> > > > - dst_ptr = &vsk->remote_addr;
> > > > + if (vsock_remote_addr_bound(vsk)) {
> > > > + err = vsock_remote_addr_copy(vsk, &dst);
> > > > + if (err < 0)
> > > > + return err;
> > > > } else {
> > > > vsock_addr_init(&dst, pkt->dg.src.context,
> > > > pkt->src_port);
> > > > - dst_ptr = &dst;
> > > > }
> > > > + dst_ptr = &dst;
> > > > +
> > > > return vmci_transport_alloc_send_control_pkt(&vsk->local_addr, dst_ptr,
> > > > VMCI_TRANSPORT_PACKET_TYPE_RST,
> > > > 0, 0, NULL, VSOCK_PROTO_INVALID,
> > > > @@ -490,7 +501,7 @@ static struct sock *vmci_transport_get_pending(
> > > >
> > > > list_for_each_entry(vpending, &vlistener->pending_links,
> > > > pending_links) {
> > > > - if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
> > > > + if (vsock_remote_addr_equals(vpending, &src) &&
> > > > pkt->dst_port == vpending->local_addr.svm_port) {
> > > > pending = sk_vsock(vpending);
> > > > sock_hold(pending);
> > > > @@ -1015,8 +1026,8 @@ static int vmci_transport_recv_listen(struct sock *sk,
> > > >
> > > > vsock_addr_init(&vpending->local_addr, pkt->dg.dst.context,
> > > > pkt->dst_port);
> > > > - vsock_addr_init(&vpending->remote_addr, pkt->dg.src.context,
> > > > - pkt->src_port);
> > > > + vsock_remote_addr_update_cid_port(vpending, pkt->dg.src.context,
> > > > + pkt->src_port);
> > > >
> > > > err = vsock_assign_transport(vpending, vsock_sk(sk));
> > > > /* Transport assigned (looking at remote_addr) must be the same
> > > > @@ -1133,6 +1144,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> > > > {
> > > > struct vsock_sock *vpending;
> > > > struct vmci_handle handle;
> > > > + unsigned int vpending_remote_cid;
> > > > struct vmci_qp *qpair;
> > > > bool is_local;
> > > > u32 flags;
> > > > @@ -1189,8 +1201,13 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> > > > /* vpending->local_addr always has a context id so we do not need to
> > > > * worry about VMADDR_CID_ANY in this case.
> > > > */
> > > > - is_local =
> > > > - vpending->remote_addr.svm_cid == vpending->local_addr.svm_cid;
> > > > + err = vsock_remote_addr_cid(vpending, &vpending_remote_cid);
> > > > + if (err < 0) {
> > > > + skerr = EPROTO;
> > > > + goto destroy;
> > > > + }
> > > > +
> > > > + is_local = vpending_remote_cid == vpending->local_addr.svm_cid;
> > > > flags = VMCI_QPFLAG_ATTACH_ONLY;
> > > > flags |= is_local ? VMCI_QPFLAG_LOCAL : 0;
> > > >
> > > > @@ -1203,7 +1220,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
> > > > flags,
> > > > vmci_transport_is_trusted(
> > > > vpending,
> > > > - vpending->remote_addr.svm_cid));
> > > > + vpending_remote_cid));
> > > > if (err < 0) {
> > > > vmci_transport_send_reset(pending, pkt);
> > > > skerr = -err;
> > > > @@ -1306,9 +1323,20 @@ vmci_transport_recv_connecting_client(struct sock *sk,
> > > > break;
> > > > case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE:
> > > > case VMCI_TRANSPORT_PACKET_TYPE_NEGOTIATE2:
> > > > + struct sockaddr_vm_rcu *remote_addr;
> > > > +
> > > > + rcu_read_lock();
> > > > + remote_addr = rcu_dereference(vsk->remote_addr);
> > > > + if (!remote_addr) {
> > > > + skerr = EPROTO;
> > > > + err = -EINVAL;
> > > > + rcu_read_unlock();
> > > > + goto destroy;
> > > > + }
> > > > +
> > > > if (pkt->u.size == 0
> > > > - || pkt->dg.src.context != vsk->remote_addr.svm_cid
> > > > - || pkt->src_port != vsk->remote_addr.svm_port
> > > > + || pkt->dg.src.context != remote_addr->addr.svm_cid
> > > > + || pkt->src_port != remote_addr->addr.svm_port
> > > > || !vmci_handle_is_invalid(vmci_trans(vsk)->qp_handle)
> > > > || vmci_trans(vsk)->qpair
> > > > || vmci_trans(vsk)->produce_size != 0
> > > > @@ -1316,9 +1344,10 @@ vmci_transport_recv_connecting_client(struct sock *sk,
> > > > || vmci_trans(vsk)->detach_sub_id != VMCI_INVALID_ID) {
> > > > skerr = EPROTO;
> > > > err = -EINVAL;
> > > > -
> > > > + rcu_read_unlock();
> > > > goto destroy;
> > > > }
> > > > + rcu_read_unlock();
> > > >
> > > > err = vmci_transport_recv_connecting_client_negotiate(sk, pkt);
> > > > if (err) {
> > > > @@ -1379,6 +1408,7 @@ static int vmci_transport_recv_connecting_client_negotiate(
> > > > int err;
> > > > struct vsock_sock *vsk;
> > > > struct vmci_handle handle;
> > > > + unsigned int remote_cid;
> > > > struct vmci_qp *qpair;
> > > > u32 detach_sub_id;
> > > > bool is_local;
> > > > @@ -1449,19 +1479,23 @@ static int vmci_transport_recv_connecting_client_negotiate(
> > > >
> > > > /* Make VMCI select the handle for us. */
> > > > handle = VMCI_INVALID_HANDLE;
> > > > - is_local = vsk->remote_addr.svm_cid == vsk->local_addr.svm_cid;
> > > > +
> > > > + err = vsock_remote_addr_cid(vsk, &remote_cid);
> > > > + if (err < 0)
> > > > + goto destroy;
> > > > +
> > > > + is_local = remote_cid == vsk->local_addr.svm_cid;
> > > > flags = is_local ? VMCI_QPFLAG_LOCAL : 0;
> > > >
> > > > err = vmci_transport_queue_pair_alloc(&qpair,
> > > > &handle,
> > > > pkt->u.size,
> > > > pkt->u.size,
> > > > - vsk->remote_addr.svm_cid,
> > > > + remote_cid,
> > > > flags,
> > > > vmci_transport_is_trusted(
> > > > vsk,
> > > > - vsk->
> > > > - remote_addr.svm_cid));
> > > > + remote_cid));
> > > > if (err < 0)
> > > > goto destroy;
> > > >
> > > >
> > > > --
> > > > 2.30.2
> > > >
> > >
> >
>
>
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
On Sat, Apr 15, 2023 at 05:29:12PM +0000, Bobby Eshleman wrote:
>On Fri, Apr 28, 2023 at 12:29:50PM +0200, Stefano Garzarella wrote:
>> On Sat, Apr 15, 2023 at 10:30:55AM +0000, Bobby Eshleman wrote:
>> > On Wed, Apr 19, 2023 at 11:30:53AM +0200, Stefano Garzarella wrote:
>> > > On Fri, Apr 14, 2023 at 12:25:59AM +0000, Bobby Eshleman wrote:
>> > > > Because the dgram sendmsg() path for AF_VSOCK acquires the socket lock
>> > > > it does not scale when many senders share a socket.
>> > > >
>> > > > Prior to this patch the socket lock is used to protect the local_addr,
>> > > > remote_addr, transport, and buffer size variables. What follows are the
>> > > > new protection schemes for the various protected fields that ensure a
>> > > > race-free multi-sender sendmsg() path for vsock dgrams.
>> > > >
>> > > > - local_addr
>> > > > local_addr changes as a result of binding a socket. The write path
>> > > > for local_addr is bind() and various vsock_auto_bind() call sites.
>> > > > After a socket has been bound via vsock_auto_bind() or bind(), subsequent
>> > > > calls to bind()/vsock_auto_bind() do not write to local_addr again. bind()
>> > > > rejects the user request and vsock_auto_bind() early exits.
>> > > > Therefore, the local addr can not change while a parallel thread is
>> > > > in sendmsg() and lock-free reads of local addr in sendmsg() are safe.
>> > > > Change: only acquire lock for auto-binding as-needed in sendmsg().
>> > > >
>> > > > - vsk->transport
>> > > > Updated upon socket creation and it doesn't change again until the
>> > >
>> > > This is true only for dgram, right?
>> > >
>> >
>> > Yes.
>> >
>> > > How do we decide which transport to assign for dgram?
>> > >
>> >
>> > The transport is assigned in proto->create() [vsock_create()]. It is
>> > assigned there *only* for dgrams, whereas for streams/seqpackets it is
>> > assigned in connect(). vsock_create() sets transport to
>> > 'transport_dgram' if sock->type == SOCK_DGRAM.
>> >
>> > vsock_sk_destruct() then eventually sets vsk->transport to NULL.
>> >
>> > Neither destruct nor create can occur in parallel with sendmsg().
>> > create() hasn't yet returned the sockfd for the user to call upon it
>> > sendmsg(), and AFAICT destruct is only called after the final socket
>> > reference is released, which only happens after the socket no longer
>> > exists in the fd lookup table and so sendmsg() will fail before it ever
>> > has the chance to race.
>>
>> This is okay if a socket can be assigned to a single transport, but with
>> dgrams I'm not sure we can do that.
>>
>> Since it is not connected, a socket can send or receive packets from
>> different transports, so maybe we can't assign it to a specific one,
>> but do a lookup for every packets to understand which transport to use.
>>
>
>Yes this is true, this lookup needs to be implemented... currently
>sendmsg() doesn't do this at all. It grabs the remote_addr when passed
>in from sendto(), but then just uses the same old transport from vsk.
>You are right that sendto() should be a per-packet lookup, not a
>vsk->transport read. Perhaps I should add that as another patch in this
>series, and make it precede this one?
Yep, I think so, we need to implement it before adding a new transport
that can support dgram.
>
>For the send() / sendto(NULL) case where vsk->transport is being read, I
>do believe this is still race-free, but...
>
>If we later support dynamic transports for datagram, such that
>connect(VMADDR_LOCAL_CID) sets vsk->transport to transport_loopback,
>connect(VMADDR_CID_HOST) sets vsk->transport to something like
>transport_datagram_g2h, etc..., then vsk->transport will need to be
>bundled into the RCU-protected pointer too, since it may change when
>remote_addr changes.
Yep, I think so. Although in vsock_dgram_connect we call lock_sock(), so
maybe that could be enough to protect us.
In general I think we should use vsk->transport if vsock_dgram_connect()
is called, or we need to do per-packet lookup.
Another think I would change, is the dgram_dequeue() callback.
I would remove it, and move in the core the code in
vmci_transport_dgram_dequeue() since it seems pretty generic.
This should work well if every transport uses sk_receive_skb() to
enqueue sk_buffs in the socket buffer.
Thanks,
Stefano
On Sat, Apr 15, 2023 at 03:55:05PM +0000, Bobby Eshleman wrote:
>On Fri, Apr 28, 2023 at 12:43:09PM +0200, Stefano Garzarella wrote:
>> On Sat, Apr 15, 2023 at 07:13:47AM +0000, Bobby Eshleman wrote:
>> > CC'ing [email protected] because this thread is starting
>> > to touch the spec.
>> >
>> > On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote:
>> > > Hi Bobby,
>> > >
>> > > On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote:
>> > > > CC'ing Cong.
>> > > >
>> > > > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote:
>> > > > > Hey all!
>> > > > >
>> > > > > This series introduces support for datagrams to virtio/vsock.
>> > >
>> > > Great! Thanks for restarting this work!
>> > >
>> >
>> > No problem!
>> >
>> > > > >
>> > > > > It is a spin-off (and smaller version) of this series from the summer:
>> > > > > https://lore.kernel.org/all/[email protected]/
>> > > > >
>> > > > > Please note that this is an RFC and should not be merged until
>> > > > > associated changes are made to the virtio specification, which will
>> > > > > follow after discussion from this series.
>> > > > >
>> > > > > This series first supports datagrams in a basic form for virtio, and
>> > > > > then optimizes the sendpath for all transports.
>> > > > >
>> > > > > The result is a very fast datagram communication protocol that
>> > > > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
>> > > > > of multi-threaded workload samples.
>> > > > >
>> > > > > For those that are curious, some summary data comparing UDP and VSOCK
>> > > > > DGRAM (N=5):
>> > > > >
>> > > > > vCPUS: 16
>> > > > > virtio-net queues: 16
>> > > > > payload size: 4KB
>> > > > > Setup: bare metal + vm (non-nested)
>> > > > >
>> > > > > UDP: 287.59 MB/s
>> > > > > VSOCK DGRAM: 509.2 MB/s
>> > > > >
>> > > > > Some notes about the implementation...
>> > > > >
>> > > > > This datagram implementation forces datagrams to self-throttle according
>> > > > > to the threshold set by sk_sndbuf. It behaves similar to the credits
>> > > > > used by streams in its effect on throughput and memory consumption, but
>> > > > > it is not influenced by the receiving socket as credits are.
>> > >
>> > > So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right?
>> > >
>> >
>> > Correct.
>> >
>> > > We should check if VMCI behaves the same.
>> > >
>> > > > >
>> > > > > The device drops packets silently. There is room for improvement by
>> > > > > building into the device and driver some intelligence around how to
>> > > > > reduce frequency of kicking the virtqueue when packet loss is high. I
>> > > > > think there is a good discussion to be had on this.
>> > >
>> > > Can you elaborate a bit here?
>> > >
>> > > Do you mean some mechanism to report to the sender that a destination
>> > > (cid, port) is full so the packet will be dropped?
>> > >
>> >
>> > Correct. There is also the case of there being no receiver at all for
>> > this address since this case isn't rejected upon connect(). Ideally,
>> > such a socket (which will have 100% packet loss) will be throttled
>> > aggressively.
>> >
>> > Before we go down too far on this path, I also want to clarify that
>> > using UDP over vhost/virtio-net also has this property... this can be
>> > observed by using tcpdump to dump the UDP packets on the bridge network
>> > your VM is using. UDP packets sent to a garbage address can be seen on
>> > the host bridge (this is the nature of UDP, how is the host supposed to
>> > know the address eventually goes nowhere). I mention the above because I
>> > think it is possible for vsock to avoid this cost, given that it
>> > benefits from being point-to-point and g2h/h2g.
>> >
>> > If we're okay with vsock being on par, then the current series does
>> > that. I propose something below that can be added later and maybe
>> > negotiated as a feature bit too.
>>
>> I see and I agree on that, let's do it step by step.
>> If we can do it in the first phase is great, but I think is fine to add
>> this feature later.
>>
>> >
>> > > Can we adapt the credit mechanism?
>> > >
>> >
>> > I've thought about this a lot because the attraction of the approach for
>> > me would be that we could get the wait/buffer-limiting logic for free
>> > and without big changes to the protocol, but the problem is that the
>> > unreliable nature of datagrams means that the source's free-running
>> > tx_cnt will become out-of-sync with the destination's fwd_cnt upon
>> > packet loss.
>>
>> We need to understand where the packet can be lost.
>> If the packet always reaches the destination (vsock driver or device),
>> we can discard it, but also update the counters.
>>
>> >
>> > Imagine a source that initializes and starts sending packets before a
>> > destination socket even is created, the source's self-throttling will be
>> > dysfunctional because its tx_cnt will always far exceed the
>> > destination's fwd_cnt.
>>
>> Right, the other problem I see is that the socket aren't connected, so
>> we have 1-N relationship.
>>
>
>Oh yeah, good point.
>
>> >
>> > We could play tricks with the meaning of the CREDIT_UPDATE message and
>> > fwd_cnt/buf_alloc fields, but I don't think we want to go down that
>> > path.
>> >
>> > I think that the best and simplest approach introduces a congestion
>> > notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the
>> > destination sends this notification. At a given repeated time period T,
>> > the source can check if it has received any notifications in the last T.
>> > If so, it halves its buffer allocation. If not, it doubles its buffer
>> > allocation unless it is already at its max or original value.
>> >
>> > An "invalid" socket which never has any receiver will converge towards a
>> > rate limit of one packet per time T * log2(average pkt size). That is, a
>> > socket with 100% packet loss will only be able to send 16 bytes every
>> > 4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within
>> > 160ms given at least one packet sent per 5ms. I have no idea if that is
>> > a reasonable default T for vsock, I just pulled it out of a hat for the
>> > sake of the example.
>> >
>> > "Normal" sockets will be responsive to high loss and rebalance during
>> > low loss. The source is trying to guess and converge on the actual
>> > buffer state of the destination.
>> >
>> > This would reuse the already-existing throttling mechanisms that
>> > throttle based upon buffer allocation. The usage of sk_sndbuf would have
>> > to be re-worked. The application using sendmsg() will see EAGAIN when
>> > throttled, or just sleep if !MSG_DONTWAIT.
>>
>> I see, it looks interesting, but I think we need to share that
>> information between multiple sockets, since the same destination
>> (cid, port), can be reached by multiple sockets.
>>
>
>Good point, that is true.
>
>> Another approach could be to have both congestion notification and
>> decongestion, but maybe it produces double traffic.
>>
>
>I think this could simplify things and could reduce noise. It is also
>probably sufficient for the source to simply halt upon congestion
>notification and resume upon decongestion notification, instead of
>scaling up and down like I suggested above. It also avoids the
>burstiness that would occur with a "congestion notification"-only
>approach where the source guesses when to resume and guesses wrong.
>
>The congestion notification may want to have an expiration period after
>which the sender can resume without receiving a decongestion
>notification? If it receives congestion again, then it can halt again.
Yep, I agree.
Anyway the congestion/decongestion messages should be just a hint,
because the other peer has to keep the state and a malicious host/guest
could use it for DoS, so the peer could discard these packets if it has
no more space to save the state.
Thanks,
Stefano