2016-10-07 10:02:32

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 00/10] NFS: add AF_VSOCK support to NFS client

This patch series enables AF_VSOCK address family support in the NFS client.
You can also get the commits from the vsock-nfs branch at
https://github.com/stefanha/linux.git.

The AF_VSOCK address family provides dgram and stream socket communication
between virtual machines and hypervisors. VMware VMCI and virtio (for KVM)
transports are available, see net/vmw_vsock.

The goal of this work is sharing files between virtual machines and
hypervisors. AF_VSOCK is well-suited to this because it requires no
configuration inside the virtual machine, making it simple to manage and
reliable.

Why NFS over AF_VSOCK?
----------------------
It is unusual to add a new NFS transport, only TCP, RDMA, and UDP are currently
supported. Here is the rationale for adding AF_VSOCK.

Sharing files with a virtual machine can be configured manually:
1. Add a dedicated network card to the virtual machine. It will be used for
NFS traffic.
2. Configure a local subnet and assign IP addresses to the virtual machine and
hypervisor
3. Configure an NFS export on the hypervisor and start the NFS server
4. Mount the export inside the virtual machine

Automating these steps poses a problem: modifying network configuration inside
the virtual machine is invasive. It's hard to add a network interface to an
arbitrary running system in an automated fashion, considering the network
management tools, firewall rules, IP address usage, etc.

Furthermore, the user may disrupt file sharing by accident when they add
firewall rules, restart networking, etc because the NFS network interface is
visible alongside the network interfaces managed by the user.

AF_VSOCK is a zero-configuration network transport that avoids these problems.
Adding it to a virtual machine is non-invasive. It also avoids accidental
misconfiguration by the user. This is why "guest agents" and other services in
various hypervisors (KVM, Xen, VMware, VirtualBox) do not use regular network
interfaces.

This is why AF_VSOCK is appropriate for providing shared files as a hypervisor
service.

The approach in this series
---------------------------
AF_VSOCK stream sockets can be used for NFSv4.1 much in the same way as TCP.
RFC 1831 record fragments divide messages since SOCK_STREAM semantics are
present. The backchannel shares the connection just like the default TCP
configuration.

Addresses are <Context ID, Port Number> pairs. These patches use "vsock:<cid>"
string representation to distinguish AF_VSOCK addresses from IPv4 and IPv6
numeric addresses.

The following example mounts /export from the hypervisor (CID 2) inside the
virtual machine (CID 3):

# /sbin/mount.nfs 2:/export /mnt -o clientaddr=3,proto=vsock

Please see the nfs-utils patch series I have just sent to
[email protected] for the necessary patches.

Status
------
The virtio-vsock transport was merged in Linux 4.8 and the vhost-vsock-pci
device is available in QEMU git master. This means the underlying AF_VSOCK
transport for KVM is now available upstream.

I have begun work on nfsd support in the kernel and nfs-utils. This is not
complete yet and will be sent as separate patch series.

Stefan Hajnoczi (10):
SUNRPC: add AF_VSOCK support to addr.[ch]
SUNRPC: rename "TCP" record parser to "stream" parser
SUNRPC: abstract tcp_read_sock() in record fragment parser
SUNRPC: extract xs_stream_reset_state()
VSOCK: add tcp_read_sock()-like vsock_read_sock() function
SUNRPC: add AF_VSOCK support to xprtsock.c
SUNRPC: drop unnecessary svc_bc_tcp_create() helper
SUNRPC: add AF_VSOCK support to svc_xprt.c
SUNRPC: add AF_VSOCK backchannel support
NFS: add AF_VSOCK support to NFS client

drivers/vhost/vsock.c | 1 +
fs/nfs/client.c | 2 +
fs/nfs/super.c | 11 +-
include/linux/sunrpc/addr.h | 44 ++
include/linux/sunrpc/svc_xprt.h | 12 +
include/linux/sunrpc/xprt.h | 1 +
include/linux/sunrpc/xprtsock.h | 36 +-
include/linux/virtio_vsock.h | 4 +
include/net/af_vsock.h | 5 +
include/trace/events/sunrpc.h | 28 +-
net/sunrpc/Kconfig | 10 +
net/sunrpc/addr.c | 57 +++
net/sunrpc/svc_xprt.c | 18 +
net/sunrpc/svcsock.c | 48 ++-
net/sunrpc/xprtsock.c | 703 +++++++++++++++++++++++++-------
net/vmw_vsock/af_vsock.c | 16 +
net/vmw_vsock/virtio_transport.c | 1 +
net/vmw_vsock/virtio_transport_common.c | 66 +++
net/vmw_vsock/vmci_transport.c | 8 +
19 files changed, 880 insertions(+), 191 deletions(-)

--
2.7.4



2016-10-07 10:05:17

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]

AF_VSOCK addresses are a Context ID (CID) and port number tuple. The
CID is a unique address, similar to a IP address on a local subnet.

Extend the addr.h functions to handle AF_VSOCK addresses.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
v2:
* Replace CONFIG_VSOCKETS with CONFIG_SUNRPC_XPRT_VSOCK to prevent
build failures when SUNRPC=y and VSOCKETS=m. Built-in code cannot
link against code in a module.
---
include/linux/sunrpc/addr.h | 44 ++++++++++++++++++++++++++++++++++
net/sunrpc/Kconfig | 10 ++++++++
net/sunrpc/addr.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 111 insertions(+)

diff --git a/include/linux/sunrpc/addr.h b/include/linux/sunrpc/addr.h
index 5c9c6cd..c4169bc 100644
--- a/include/linux/sunrpc/addr.h
+++ b/include/linux/sunrpc/addr.h
@@ -10,6 +10,7 @@
#include <linux/socket.h>
#include <linux/in.h>
#include <linux/in6.h>
+#include <linux/vm_sockets.h>
#include <net/ipv6.h>

size_t rpc_ntop(const struct sockaddr *, char *, const size_t);
@@ -26,6 +27,8 @@ static inline unsigned short rpc_get_port(const struct sockaddr *sap)
return ntohs(((struct sockaddr_in *)sap)->sin_port);
case AF_INET6:
return ntohs(((struct sockaddr_in6 *)sap)->sin6_port);
+ case AF_VSOCK:
+ return ((struct sockaddr_vm *)sap)->svm_port;
}
return 0;
}
@@ -40,6 +43,9 @@ static inline void rpc_set_port(struct sockaddr *sap,
case AF_INET6:
((struct sockaddr_in6 *)sap)->sin6_port = htons(port);
break;
+ case AF_VSOCK:
+ ((struct sockaddr_vm *)sap)->svm_port = port;
+ break;
}
}

@@ -106,6 +112,40 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
}
#endif /* !(IS_ENABLED(CONFIG_IPV6) */

+#if IS_ENABLED(CONFIG_VSOCKETS)
+static inline bool rpc_cmp_vsock_addr(const struct sockaddr *sap1,
+ const struct sockaddr *sap2)
+{
+ const struct sockaddr_vm *svm1 = (const struct sockaddr_vm *)sap1;
+ const struct sockaddr_vm *svm2 = (const struct sockaddr_vm *)sap2;
+
+ return svm1->svm_cid == svm2->svm_cid;
+}
+
+static inline bool __rpc_copy_vsock_addr(struct sockaddr *dst,
+ const struct sockaddr *src)
+{
+ const struct sockaddr_vm *ssvm = (const struct sockaddr_vm *)src;
+ struct sockaddr_vm *dsvm = (struct sockaddr_vm *)dst;
+
+ dsvm->svm_family = ssvm->svm_family;
+ dsvm->svm_cid = ssvm->svm_cid;
+ return true;
+}
+#else /* !(IS_ENABLED(CONFIG_VSOCKETS) */
+static inline bool rpc_cmp_vsock_addr(const struct sockaddr *sap1,
+ const struct sockaddr *sap2)
+{
+ return false;
+}
+
+static inline bool __rpc_copy_vsock_addr(struct sockaddr *dst,
+ const struct sockaddr *src)
+{
+ return false;
+}
+#endif /* !(IS_ENABLED(CONFIG_VSOCKETS) */
+
/**
* rpc_cmp_addr - compare the address portion of two sockaddrs.
* @sap1: first sockaddr
@@ -125,6 +165,8 @@ static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
return rpc_cmp_addr4(sap1, sap2);
case AF_INET6:
return rpc_cmp_addr6(sap1, sap2);
+ case AF_VSOCK:
+ return rpc_cmp_vsock_addr(sap1, sap2);
}
}
return false;
@@ -161,6 +203,8 @@ static inline bool rpc_copy_addr(struct sockaddr *dst,
return __rpc_copy_addr4(dst, src);
case AF_INET6:
return __rpc_copy_addr6(dst, src);
+ case AF_VSOCK:
+ return __rpc_copy_vsock_addr(dst, src);
}
return false;
}
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 04ce2c0..d18fc1a 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -61,3 +61,13 @@ config SUNRPC_XPRT_RDMA

If unsure, or you know there is no RDMA capability on your
hardware platform, say N.
+
+config SUNRPC_XPRT_VSOCK
+ bool "RPC-over-AF_VSOCK transport"
+ depends on SUNRPC && VSOCKETS && !(SUNRPC=y && VSOCKETS=m)
+ default SUNRPC && VSOCKETS
+ help
+ This option allows the NFS client and server to use the AF_VSOCK
+ transport to communicate between virtual machines and the host.
+
+ If unsure, say Y.
diff --git a/net/sunrpc/addr.c b/net/sunrpc/addr.c
index 2e0a6f9..f4dd962 100644
--- a/net/sunrpc/addr.c
+++ b/net/sunrpc/addr.c
@@ -16,11 +16,14 @@
* RFC 4291, Section 2.2 for details on IPv6 presentation formats.
*/

+ /* TODO register netid and uaddr with IANA? (See RFC 5665 5.1/5.2) */
+
#include <net/ipv6.h>
#include <linux/sunrpc/addr.h>
#include <linux/sunrpc/msg_prot.h>
#include <linux/slab.h>
#include <linux/export.h>
+#include <linux/vm_sockets.h>

#if IS_ENABLED(CONFIG_IPV6)

@@ -108,6 +111,26 @@ static size_t rpc_ntop6(const struct sockaddr *sap,

#endif /* !IS_ENABLED(CONFIG_IPV6) */

+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+
+static size_t rpc_ntop_vsock(const struct sockaddr *sap,
+ char *buf, const size_t buflen)
+{
+ const struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
+
+ return snprintf(buf, buflen, "%u", svm->svm_cid);
+}
+
+#else /* !CONFIG_SUNRPC_XPRT_VSOCK */
+
+static size_t rpc_ntop_vsock(const struct sockaddr *sap,
+ char *buf, const size_t buflen)
+{
+ return 0;
+}
+
+#endif /* !CONFIG_SUNRPC_XPRT_VSOCK */
+
static int rpc_ntop4(const struct sockaddr *sap,
char *buf, const size_t buflen)
{
@@ -132,6 +155,8 @@ size_t rpc_ntop(const struct sockaddr *sap, char *buf, const size_t buflen)
return rpc_ntop4(sap, buf, buflen);
case AF_INET6:
return rpc_ntop6(sap, buf, buflen);
+ case AF_VSOCK:
+ return rpc_ntop_vsock(sap, buf, buflen);
}

return 0;
@@ -229,6 +254,34 @@ static size_t rpc_pton6(struct net *net, const char *buf, const size_t buflen,
}
#endif

+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
+ struct sockaddr *sap, const size_t salen)
+{
+ const size_t prefix_len = strlen("vsock:");
+ struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
+ unsigned int cid;
+
+ if (strncmp(buf, "vsock:", prefix_len) != 0 ||
+ salen < sizeof(struct sockaddr_vm))
+ return 0;
+
+ if (kstrtouint(buf + prefix_len, 10, &cid) != 0)
+ return 0;
+
+ memset(svm, 0, sizeof(struct sockaddr_vm));
+ svm->svm_family = AF_VSOCK;
+ svm->svm_cid = cid;
+ return sizeof(struct sockaddr_vm);
+}
+#else
+static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
+ struct sockaddr *sap, const size_t salen)
+{
+ return 0;
+}
+#endif
+
/**
* rpc_pton - Construct a sockaddr in @sap
* @net: applicable network namespace
@@ -249,6 +302,10 @@ size_t rpc_pton(struct net *net, const char *buf, const size_t buflen,
{
unsigned int i;

+ /* TODO is there a nicer way to distinguish vsock addresses? */
+ if (strncmp(buf, "vsock:", 6) == 0)
+ return rpc_pton_vsock(buf, buflen, sap, salen);
+
for (i = 0; i < buflen; i++)
if (buf[i] == ':')
return rpc_pton6(net, buf, buflen, sap, salen);
--
2.7.4


2016-10-07 10:05:53

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 02/10] SUNRPC: rename "TCP" record parser to "stream" parser

The TCP record parser is really a RFC 1831 record fragment parser.
There is nothing TCP protocol-specific about parsing record fragments.
The parser can be reused for any SOCK_STREAM socket.

This patch renames functions and fields but xs_stream_data_ready() still
calls tcp_read_sock(). This is addressed in the next patch.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/xprtsock.h | 31 ++---
include/trace/events/sunrpc.h | 28 ++---
net/sunrpc/xprtsock.c | 264 ++++++++++++++++++++--------------------
3 files changed, 162 insertions(+), 161 deletions(-)

diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index bef3fb0..db4c88c 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -27,17 +27,18 @@ struct sock_xprt {
struct sock * inet;

/*
- * State of TCP reply receive
+ * State of SOCK_STREAM reply receive
*/
- __be32 tcp_fraghdr,
- tcp_xid,
- tcp_calldir;
+ __be32 stream_fraghdr,
+ stream_xid,
+ stream_calldir;

- u32 tcp_offset,
- tcp_reclen;
+ u32 stream_offset,
+ stream_reclen;
+
+ unsigned long stream_copied,
+ stream_flags;

- unsigned long tcp_copied,
- tcp_flags;

/*
* Connection of transports
@@ -67,17 +68,17 @@ struct sock_xprt {
/*
* TCP receive state flags
*/
-#define TCP_RCV_LAST_FRAG (1UL << 0)
-#define TCP_RCV_COPY_FRAGHDR (1UL << 1)
-#define TCP_RCV_COPY_XID (1UL << 2)
-#define TCP_RCV_COPY_DATA (1UL << 3)
-#define TCP_RCV_READ_CALLDIR (1UL << 4)
-#define TCP_RCV_COPY_CALLDIR (1UL << 5)
+#define STREAM_RCV_LAST_FRAG (1UL << 0)
+#define STREAM_RCV_COPY_FRAGHDR (1UL << 1)
+#define STREAM_RCV_COPY_XID (1UL << 2)
+#define STREAM_RCV_COPY_DATA (1UL << 3)
+#define STREAM_RCV_READ_CALLDIR (1UL << 4)
+#define STREAM_RCV_COPY_CALLDIR (1UL << 5)

/*
* TCP RPC flags
*/
-#define TCP_RPC_REPLY (1UL << 6)
+#define STREAM_RPC_REPLY (1UL << 6)

#define XPRT_SOCK_CONNECTING 1U
#define XPRT_SOCK_DATA_READY (2)
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 8a707f8..916a30b 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -400,15 +400,15 @@ TRACE_EVENT(xs_tcp_data_ready,

#define rpc_show_sock_xprt_flags(flags) \
__print_flags(flags, "|", \
- { TCP_RCV_LAST_FRAG, "TCP_RCV_LAST_FRAG" }, \
- { TCP_RCV_COPY_FRAGHDR, "TCP_RCV_COPY_FRAGHDR" }, \
- { TCP_RCV_COPY_XID, "TCP_RCV_COPY_XID" }, \
- { TCP_RCV_COPY_DATA, "TCP_RCV_COPY_DATA" }, \
- { TCP_RCV_READ_CALLDIR, "TCP_RCV_READ_CALLDIR" }, \
- { TCP_RCV_COPY_CALLDIR, "TCP_RCV_COPY_CALLDIR" }, \
- { TCP_RPC_REPLY, "TCP_RPC_REPLY" })
-
-TRACE_EVENT(xs_tcp_data_recv,
+ { STREAM_RCV_LAST_FRAG, "STREAM_RCV_LAST_FRAG" }, \
+ { STREAM_RCV_COPY_FRAGHDR, "STREAM_RCV_COPY_FRAGHDR" }, \
+ { STREAM_RCV_COPY_XID, "STREAM_RCV_COPY_XID" }, \
+ { STREAM_RCV_COPY_DATA, "STREAM_RCV_COPY_DATA" }, \
+ { STREAM_RCV_READ_CALLDIR, "STREAM_RCV_READ_CALLDIR" }, \
+ { STREAM_RCV_COPY_CALLDIR, "STREAM_RCV_COPY_CALLDIR" }, \
+ { STREAM_RPC_REPLY, "STREAM_RPC_REPLY" })
+
+TRACE_EVENT(xs_stream_data_recv,
TP_PROTO(struct sock_xprt *xs),

TP_ARGS(xs),
@@ -426,11 +426,11 @@ TRACE_EVENT(xs_tcp_data_recv,
TP_fast_assign(
__assign_str(addr, xs->xprt.address_strings[RPC_DISPLAY_ADDR]);
__assign_str(port, xs->xprt.address_strings[RPC_DISPLAY_PORT]);
- __entry->xid = xs->tcp_xid;
- __entry->flags = xs->tcp_flags;
- __entry->copied = xs->tcp_copied;
- __entry->reclen = xs->tcp_reclen;
- __entry->offset = xs->tcp_offset;
+ __entry->xid = xs->stream_xid;
+ __entry->flags = xs->stream_flags;
+ __entry->copied = xs->stream_copied;
+ __entry->reclen = xs->stream_reclen;
+ __entry->offset = xs->stream_offset;
),

TP_printk("peer=[%s]:%s xid=0x%x flags=%s copied=%lu reclen=%u offset=%lu",
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bf16883..70eb917 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1127,119 +1127,119 @@ static void xs_tcp_force_close(struct rpc_xprt *xprt)
xprt_force_disconnect(xprt);
}

-static inline void xs_tcp_read_fraghdr(struct rpc_xprt *xprt, struct xdr_skb_reader *desc)
+static inline void xs_stream_read_fraghdr(struct rpc_xprt *xprt, struct xdr_skb_reader *desc)
{
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
size_t len, used;
char *p;

- p = ((char *) &transport->tcp_fraghdr) + transport->tcp_offset;
- len = sizeof(transport->tcp_fraghdr) - transport->tcp_offset;
+ p = ((char *) &transport->stream_fraghdr) + transport->stream_offset;
+ len = sizeof(transport->stream_fraghdr) - transport->stream_offset;
used = xdr_skb_read_bits(desc, p, len);
- transport->tcp_offset += used;
+ transport->stream_offset += used;
if (used != len)
return;

- transport->tcp_reclen = ntohl(transport->tcp_fraghdr);
- if (transport->tcp_reclen & RPC_LAST_STREAM_FRAGMENT)
- transport->tcp_flags |= TCP_RCV_LAST_FRAG;
+ transport->stream_reclen = ntohl(transport->stream_fraghdr);
+ if (transport->stream_reclen & RPC_LAST_STREAM_FRAGMENT)
+ transport->stream_flags |= STREAM_RCV_LAST_FRAG;
else
- transport->tcp_flags &= ~TCP_RCV_LAST_FRAG;
- transport->tcp_reclen &= RPC_FRAGMENT_SIZE_MASK;
+ transport->stream_flags &= ~STREAM_RCV_LAST_FRAG;
+ transport->stream_reclen &= RPC_FRAGMENT_SIZE_MASK;

- transport->tcp_flags &= ~TCP_RCV_COPY_FRAGHDR;
- transport->tcp_offset = 0;
+ transport->stream_flags &= ~STREAM_RCV_COPY_FRAGHDR;
+ transport->stream_offset = 0;

/* Sanity check of the record length */
- if (unlikely(transport->tcp_reclen < 8)) {
- dprintk("RPC: invalid TCP record fragment length\n");
+ if (unlikely(transport->stream_reclen < 8)) {
+ dprintk("RPC: invalid record fragment length\n");
xs_tcp_force_close(xprt);
return;
}
- dprintk("RPC: reading TCP record fragment of length %d\n",
- transport->tcp_reclen);
+ dprintk("RPC: reading record fragment of length %d\n",
+ transport->stream_reclen);
}

-static void xs_tcp_check_fraghdr(struct sock_xprt *transport)
+static void xs_stream_check_fraghdr(struct sock_xprt *transport)
{
- if (transport->tcp_offset == transport->tcp_reclen) {
- transport->tcp_flags |= TCP_RCV_COPY_FRAGHDR;
- transport->tcp_offset = 0;
- if (transport->tcp_flags & TCP_RCV_LAST_FRAG) {
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
- transport->tcp_flags |= TCP_RCV_COPY_XID;
- transport->tcp_copied = 0;
+ if (transport->stream_offset == transport->stream_reclen) {
+ transport->stream_flags |= STREAM_RCV_COPY_FRAGHDR;
+ transport->stream_offset = 0;
+ if (transport->stream_flags & STREAM_RCV_LAST_FRAG) {
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
+ transport->stream_flags |= STREAM_RCV_COPY_XID;
+ transport->stream_copied = 0;
}
}
}

-static inline void xs_tcp_read_xid(struct sock_xprt *transport, struct xdr_skb_reader *desc)
+static inline void xs_stream_read_xid(struct sock_xprt *transport, struct xdr_skb_reader *desc)
{
size_t len, used;
char *p;

- len = sizeof(transport->tcp_xid) - transport->tcp_offset;
+ len = sizeof(transport->stream_xid) - transport->stream_offset;
dprintk("RPC: reading XID (%Zu bytes)\n", len);
- p = ((char *) &transport->tcp_xid) + transport->tcp_offset;
+ p = ((char *) &transport->stream_xid) + transport->stream_offset;
used = xdr_skb_read_bits(desc, p, len);
- transport->tcp_offset += used;
+ transport->stream_offset += used;
if (used != len)
return;
- transport->tcp_flags &= ~TCP_RCV_COPY_XID;
- transport->tcp_flags |= TCP_RCV_READ_CALLDIR;
- transport->tcp_copied = 4;
+ transport->stream_flags &= ~STREAM_RCV_COPY_XID;
+ transport->stream_flags |= STREAM_RCV_READ_CALLDIR;
+ transport->stream_copied = 4;
dprintk("RPC: reading %s XID %08x\n",
- (transport->tcp_flags & TCP_RPC_REPLY) ? "reply for"
+ (transport->stream_flags & STREAM_RPC_REPLY) ? "reply for"
: "request with",
- ntohl(transport->tcp_xid));
- xs_tcp_check_fraghdr(transport);
+ ntohl(transport->stream_xid));
+ xs_stream_check_fraghdr(transport);
}

-static inline void xs_tcp_read_calldir(struct sock_xprt *transport,
- struct xdr_skb_reader *desc)
+static inline void xs_stream_read_calldir(struct sock_xprt *transport,
+ struct xdr_skb_reader *desc)
{
size_t len, used;
u32 offset;
char *p;

/*
- * We want transport->tcp_offset to be 8 at the end of this routine
+ * We want transport->stream_offset to be 8 at the end of this routine
* (4 bytes for the xid and 4 bytes for the call/reply flag).
* When this function is called for the first time,
- * transport->tcp_offset is 4 (after having already read the xid).
+ * transport->stream_offset is 4 (after having already read the xid).
*/
- offset = transport->tcp_offset - sizeof(transport->tcp_xid);
- len = sizeof(transport->tcp_calldir) - offset;
+ offset = transport->stream_offset - sizeof(transport->stream_xid);
+ len = sizeof(transport->stream_calldir) - offset;
dprintk("RPC: reading CALL/REPLY flag (%Zu bytes)\n", len);
- p = ((char *) &transport->tcp_calldir) + offset;
+ p = ((char *) &transport->stream_calldir) + offset;
used = xdr_skb_read_bits(desc, p, len);
- transport->tcp_offset += used;
+ transport->stream_offset += used;
if (used != len)
return;
- transport->tcp_flags &= ~TCP_RCV_READ_CALLDIR;
+ transport->stream_flags &= ~STREAM_RCV_READ_CALLDIR;
/*
* We don't yet have the XDR buffer, so we will write the calldir
* out after we get the buffer from the 'struct rpc_rqst'
*/
- switch (ntohl(transport->tcp_calldir)) {
+ switch (ntohl(transport->stream_calldir)) {
case RPC_REPLY:
- transport->tcp_flags |= TCP_RCV_COPY_CALLDIR;
- transport->tcp_flags |= TCP_RCV_COPY_DATA;
- transport->tcp_flags |= TCP_RPC_REPLY;
+ transport->stream_flags |= STREAM_RCV_COPY_CALLDIR;
+ transport->stream_flags |= STREAM_RCV_COPY_DATA;
+ transport->stream_flags |= STREAM_RPC_REPLY;
break;
case RPC_CALL:
- transport->tcp_flags |= TCP_RCV_COPY_CALLDIR;
- transport->tcp_flags |= TCP_RCV_COPY_DATA;
- transport->tcp_flags &= ~TCP_RPC_REPLY;
+ transport->stream_flags |= STREAM_RCV_COPY_CALLDIR;
+ transport->stream_flags |= STREAM_RCV_COPY_DATA;
+ transport->stream_flags &= ~STREAM_RPC_REPLY;
break;
default:
dprintk("RPC: invalid request message type\n");
xs_tcp_force_close(&transport->xprt);
}
- xs_tcp_check_fraghdr(transport);
+ xs_stream_check_fraghdr(transport);
}

-static inline void xs_tcp_read_common(struct rpc_xprt *xprt,
+static inline void xs_stream_read_common(struct rpc_xprt *xprt,
struct xdr_skb_reader *desc,
struct rpc_rqst *req)
{
@@ -1251,97 +1251,97 @@ static inline void xs_tcp_read_common(struct rpc_xprt *xprt,

rcvbuf = &req->rq_private_buf;

- if (transport->tcp_flags & TCP_RCV_COPY_CALLDIR) {
+ if (transport->stream_flags & STREAM_RCV_COPY_CALLDIR) {
/*
* Save the RPC direction in the XDR buffer
*/
- memcpy(rcvbuf->head[0].iov_base + transport->tcp_copied,
- &transport->tcp_calldir,
- sizeof(transport->tcp_calldir));
- transport->tcp_copied += sizeof(transport->tcp_calldir);
- transport->tcp_flags &= ~TCP_RCV_COPY_CALLDIR;
+ memcpy(rcvbuf->head[0].iov_base + transport->stream_copied,
+ &transport->stream_calldir,
+ sizeof(transport->stream_calldir));
+ transport->stream_copied += sizeof(transport->stream_calldir);
+ transport->stream_flags &= ~STREAM_RCV_COPY_CALLDIR;
}

len = desc->count;
- if (len > transport->tcp_reclen - transport->tcp_offset) {
+ if (len > transport->stream_reclen - transport->stream_offset) {
struct xdr_skb_reader my_desc;

- len = transport->tcp_reclen - transport->tcp_offset;
+ len = transport->stream_reclen - transport->stream_offset;
memcpy(&my_desc, desc, sizeof(my_desc));
my_desc.count = len;
- r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
+ r = xdr_partial_copy_from_skb(rcvbuf, transport->stream_copied,
&my_desc, xdr_skb_read_bits);
desc->count -= r;
desc->offset += r;
} else
- r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
+ r = xdr_partial_copy_from_skb(rcvbuf, transport->stream_copied,
desc, xdr_skb_read_bits);

if (r > 0) {
- transport->tcp_copied += r;
- transport->tcp_offset += r;
+ transport->stream_copied += r;
+ transport->stream_offset += r;
}
if (r != len) {
/* Error when copying to the receive buffer,
* usually because we weren't able to allocate
* additional buffer pages. All we can do now
- * is turn off TCP_RCV_COPY_DATA, so the request
+ * is turn off STREAM_RCV_COPY_DATA, so the request
* will not receive any additional updates,
* and time out.
* Any remaining data from this record will
* be discarded.
*/
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
dprintk("RPC: XID %08x truncated request\n",
- ntohl(transport->tcp_xid));
- dprintk("RPC: xprt = %p, tcp_copied = %lu, "
- "tcp_offset = %u, tcp_reclen = %u\n",
- xprt, transport->tcp_copied,
- transport->tcp_offset, transport->tcp_reclen);
+ ntohl(transport->stream_xid));
+ dprintk("RPC: xprt = %p, stream_copied = %lu, "
+ "stream_offset = %u, stream_reclen = %u\n",
+ xprt, transport->stream_copied,
+ transport->stream_offset, transport->stream_reclen);
return;
}

dprintk("RPC: XID %08x read %Zd bytes\n",
- ntohl(transport->tcp_xid), r);
- dprintk("RPC: xprt = %p, tcp_copied = %lu, tcp_offset = %u, "
- "tcp_reclen = %u\n", xprt, transport->tcp_copied,
- transport->tcp_offset, transport->tcp_reclen);
-
- if (transport->tcp_copied == req->rq_private_buf.buflen)
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
- else if (transport->tcp_offset == transport->tcp_reclen) {
- if (transport->tcp_flags & TCP_RCV_LAST_FRAG)
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ ntohl(transport->stream_xid), r);
+ dprintk("RPC: xprt = %p, stream_copied = %lu, stream_offset = %u, "
+ "stream_reclen = %u\n", xprt, transport->stream_copied,
+ transport->stream_offset, transport->stream_reclen);
+
+ if (transport->stream_copied == req->rq_private_buf.buflen)
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
+ else if (transport->stream_offset == transport->stream_reclen) {
+ if (transport->stream_flags & STREAM_RCV_LAST_FRAG)
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
}
}

/*
* Finds the request corresponding to the RPC xid and invokes the common
- * tcp read code to read the data.
+ * read code to read the data.
*/
-static inline int xs_tcp_read_reply(struct rpc_xprt *xprt,
+static inline int xs_stream_read_reply(struct rpc_xprt *xprt,
struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
container_of(xprt, struct sock_xprt, xprt);
struct rpc_rqst *req;

- dprintk("RPC: read reply XID %08x\n", ntohl(transport->tcp_xid));
+ dprintk("RPC: read reply XID %08x\n", ntohl(transport->stream_xid));

/* Find and lock the request corresponding to this xid */
spin_lock_bh(&xprt->transport_lock);
- req = xprt_lookup_rqst(xprt, transport->tcp_xid);
+ req = xprt_lookup_rqst(xprt, transport->stream_xid);
if (!req) {
dprintk("RPC: XID %08x request not found!\n",
- ntohl(transport->tcp_xid));
+ ntohl(transport->stream_xid));
spin_unlock_bh(&xprt->transport_lock);
return -1;
}

- xs_tcp_read_common(xprt, desc, req);
+ xs_stream_read_common(xprt, desc, req);

- if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
- xprt_complete_rqst(req->rq_task, transport->tcp_copied);
+ if (!(transport->stream_flags & STREAM_RCV_COPY_DATA))
+ xprt_complete_rqst(req->rq_task, transport->stream_copied);

spin_unlock_bh(&xprt->transport_lock);
return 0;
@@ -1355,7 +1355,7 @@ static inline int xs_tcp_read_reply(struct rpc_xprt *xprt,
* If we're unable to obtain the rpc_rqst we schedule the closing of the
* connection and return -1.
*/
-static int xs_tcp_read_callback(struct rpc_xprt *xprt,
+static int xs_stream_read_callback(struct rpc_xprt *xprt,
struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
@@ -1364,7 +1364,7 @@ static int xs_tcp_read_callback(struct rpc_xprt *xprt,

/* Look up and lock the request corresponding to the given XID */
spin_lock_bh(&xprt->transport_lock);
- req = xprt_lookup_bc_request(xprt, transport->tcp_xid);
+ req = xprt_lookup_bc_request(xprt, transport->stream_xid);
if (req == NULL) {
spin_unlock_bh(&xprt->transport_lock);
printk(KERN_WARNING "Callback slot table overflowed\n");
@@ -1373,24 +1373,24 @@ static int xs_tcp_read_callback(struct rpc_xprt *xprt,
}

dprintk("RPC: read callback XID %08x\n", ntohl(req->rq_xid));
- xs_tcp_read_common(xprt, desc, req);
+ xs_stream_read_common(xprt, desc, req);

- if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
- xprt_complete_bc_request(req, transport->tcp_copied);
+ if (!(transport->stream_flags & STREAM_RCV_COPY_DATA))
+ xprt_complete_bc_request(req, transport->stream_copied);
spin_unlock_bh(&xprt->transport_lock);

return 0;
}

-static inline int _xs_tcp_read_data(struct rpc_xprt *xprt,
- struct xdr_skb_reader *desc)
+static inline int _xs_stream_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
container_of(xprt, struct sock_xprt, xprt);

- return (transport->tcp_flags & TCP_RPC_REPLY) ?
- xs_tcp_read_reply(xprt, desc) :
- xs_tcp_read_callback(xprt, desc);
+ return (transport->stream_flags & STREAM_RPC_REPLY) ?
+ xs_stream_read_reply(xprt, desc) :
+ xs_stream_read_callback(xprt, desc);
}

static int xs_tcp_bc_up(struct svc_serv *serv, struct net *net)
@@ -1409,10 +1409,10 @@ static size_t xs_tcp_bc_maxpayload(struct rpc_xprt *xprt)
return PAGE_SIZE;
}
#else
-static inline int _xs_tcp_read_data(struct rpc_xprt *xprt,
- struct xdr_skb_reader *desc)
+static inline int _xs_stream_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
{
- return xs_tcp_read_reply(xprt, desc);
+ return xs_stream_read_reply(xprt, desc);
}
#endif /* CONFIG_SUNRPC_BACKCHANNEL */

@@ -1420,38 +1420,38 @@ static inline int _xs_tcp_read_data(struct rpc_xprt *xprt,
* Read data off the transport. This can be either an RPC_CALL or an
* RPC_REPLY. Relay the processing to helper functions.
*/
-static void xs_tcp_read_data(struct rpc_xprt *xprt,
- struct xdr_skb_reader *desc)
+static void xs_stream_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
container_of(xprt, struct sock_xprt, xprt);

- if (_xs_tcp_read_data(xprt, desc) == 0)
- xs_tcp_check_fraghdr(transport);
+ if (_xs_stream_read_data(xprt, desc) == 0)
+ xs_stream_check_fraghdr(transport);
else {
/*
* The transport_lock protects the request handling.
- * There's no need to hold it to update the tcp_flags.
+ * There's no need to hold it to update the stream_flags.
*/
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
}
}

-static inline void xs_tcp_read_discard(struct sock_xprt *transport, struct xdr_skb_reader *desc)
+static inline void xs_stream_read_discard(struct sock_xprt *transport, struct xdr_skb_reader *desc)
{
size_t len;

- len = transport->tcp_reclen - transport->tcp_offset;
+ len = transport->stream_reclen - transport->stream_offset;
if (len > desc->count)
len = desc->count;
desc->count -= len;
desc->offset += len;
- transport->tcp_offset += len;
+ transport->stream_offset += len;
dprintk("RPC: discarded %Zu bytes\n", len);
- xs_tcp_check_fraghdr(transport);
+ xs_stream_check_fraghdr(transport);
}

-static int xs_tcp_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len)
+static int xs_stream_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len)
{
struct rpc_xprt *xprt = rd_desc->arg.data;
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
@@ -1461,35 +1461,35 @@ static int xs_tcp_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, uns
.count = len,
};

- dprintk("RPC: xs_tcp_data_recv started\n");
+ dprintk("RPC: %s started\n", __func__);
do {
- trace_xs_tcp_data_recv(transport);
+ trace_xs_stream_data_recv(transport);
/* Read in a new fragment marker if necessary */
/* Can we ever really expect to get completely empty fragments? */
- if (transport->tcp_flags & TCP_RCV_COPY_FRAGHDR) {
- xs_tcp_read_fraghdr(xprt, &desc);
+ if (transport->stream_flags & STREAM_RCV_COPY_FRAGHDR) {
+ xs_stream_read_fraghdr(xprt, &desc);
continue;
}
/* Read in the xid if necessary */
- if (transport->tcp_flags & TCP_RCV_COPY_XID) {
- xs_tcp_read_xid(transport, &desc);
+ if (transport->stream_flags & STREAM_RCV_COPY_XID) {
+ xs_stream_read_xid(transport, &desc);
continue;
}
/* Read in the call/reply flag */
- if (transport->tcp_flags & TCP_RCV_READ_CALLDIR) {
- xs_tcp_read_calldir(transport, &desc);
+ if (transport->stream_flags & STREAM_RCV_READ_CALLDIR) {
+ xs_stream_read_calldir(transport, &desc);
continue;
}
/* Read in the request data */
- if (transport->tcp_flags & TCP_RCV_COPY_DATA) {
- xs_tcp_read_data(xprt, &desc);
+ if (transport->stream_flags & STREAM_RCV_COPY_DATA) {
+ xs_stream_read_data(xprt, &desc);
continue;
}
/* Skip over any trailing bytes on short reads */
- xs_tcp_read_discard(transport, &desc);
+ xs_stream_read_discard(transport, &desc);
} while (desc.count);
- trace_xs_tcp_data_recv(transport);
- dprintk("RPC: xs_tcp_data_recv done\n");
+ trace_xs_stream_data_recv(transport);
+ dprintk("RPC: %s done\n", __func__);
return len - desc.count;
}

@@ -1512,7 +1512,7 @@ static void xs_tcp_data_receive(struct sock_xprt *transport)
/* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
for (;;) {
lock_sock(sk);
- read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv);
+ read = tcp_read_sock(sk, &rd_desc, xs_stream_data_recv);
if (read <= 0) {
clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state);
release_sock(sk);
@@ -1563,12 +1563,12 @@ static void xs_tcp_state_change(struct sock *sk)
spin_lock(&xprt->transport_lock);
if (!xprt_test_and_set_connected(xprt)) {

- /* Reset TCP record info */
- transport->tcp_offset = 0;
- transport->tcp_reclen = 0;
- transport->tcp_copied = 0;
- transport->tcp_flags =
- TCP_RCV_COPY_FRAGHDR | TCP_RCV_COPY_XID;
+ /* Reset stream record info */
+ transport->stream_offset = 0;
+ transport->stream_reclen = 0;
+ transport->stream_copied = 0;
+ transport->stream_flags =
+ STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
xprt->connect_cookie++;
clear_bit(XPRT_SOCK_CONNECTING, &transport->sock_state);
xprt_clear_connecting(xprt);
--
2.7.4


2016-10-07 10:06:26

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 03/10] SUNRPC: abstract tcp_read_sock() in record fragment parser

Use a function pointer to abstract tcp_read_sock()-like functions. For
TCP this function will be tcp_read_sock(). For AF_VSOCK it will be
vsock_read_sock().

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/xprtsock.h | 5 +++++
net/sunrpc/xprtsock.c | 14 ++++++++------
2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index db4c88c..bb290b5 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -9,6 +9,8 @@

#ifdef __KERNEL__

+#include <net/tcp.h> /* for sk_read_actor_t */
+
int init_socket_xprt(void);
void cleanup_socket_xprt(void);

@@ -39,6 +41,9 @@ struct sock_xprt {
unsigned long stream_copied,
stream_flags;

+ int (*stream_read_sock)(struct sock *,
+ read_descriptor_t *,
+ sk_read_actor_t);

/*
* Connection of transports
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 70eb917..62a8ec6 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1493,7 +1493,7 @@ static int xs_stream_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
return len - desc.count;
}

-static void xs_tcp_data_receive(struct sock_xprt *transport)
+static void xs_stream_data_receive(struct sock_xprt *transport)
{
struct rpc_xprt *xprt = &transport->xprt;
struct sock *sk;
@@ -1509,10 +1509,11 @@ static void xs_tcp_data_receive(struct sock_xprt *transport)
if (sk == NULL)
goto out;

- /* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
+ /* We use rd_desc to pass struct xprt to xs_stream_data_recv */
for (;;) {
lock_sock(sk);
- read = tcp_read_sock(sk, &rd_desc, xs_stream_data_recv);
+ read = transport->stream_read_sock(sk, &rd_desc,
+ xs_stream_data_recv);
if (read <= 0) {
clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state);
release_sock(sk);
@@ -1529,11 +1530,11 @@ out:
trace_xs_tcp_data_ready(xprt, read, total);
}

-static void xs_tcp_data_receive_workfn(struct work_struct *work)
+static void xs_stream_data_receive_workfn(struct work_struct *work)
{
struct sock_xprt *transport =
container_of(work, struct sock_xprt, recv_worker);
- xs_tcp_data_receive(transport);
+ xs_stream_data_receive(transport);
}

/**
@@ -1569,6 +1570,7 @@ static void xs_tcp_state_change(struct sock *sk)
transport->stream_copied = 0;
transport->stream_flags =
STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
+ transport->stream_read_sock = tcp_read_sock;
xprt->connect_cookie++;
clear_bit(XPRT_SOCK_CONNECTING, &transport->sock_state);
xprt_clear_connecting(xprt);
@@ -2995,7 +2997,7 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)

xprt->max_reconnect_timeout = xprt->timeout->to_maxval;

- INIT_WORK(&transport->recv_worker, xs_tcp_data_receive_workfn);
+ INIT_WORK(&transport->recv_worker, xs_stream_data_receive_workfn);
INIT_DELAYED_WORK(&transport->connect_worker, xs_tcp_setup_socket);

switch (addr->sa_family) {
--
2.7.4


2016-10-07 10:06:37

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 04/10] SUNRPC: extract xs_stream_reset_state()

Extract a function to reset the record fragment parser.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
net/sunrpc/xprtsock.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 62a8ec6..dfdce75 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1538,6 +1538,28 @@ static void xs_stream_data_receive_workfn(struct work_struct *work)
}

/**
+ * xs_stream_reset_state - reset SOCK_STREAM record parser
+ * @transport: socket transport
+ * @read_sock: tcp_read_sock()-like function
+ *
+ */
+static void xs_stream_reset_state(struct rpc_xprt *xprt,
+ int (*read_sock)(struct sock *,
+ read_descriptor_t *,
+ sk_read_actor_t))
+{
+ struct sock_xprt *transport = container_of(xprt,
+ struct sock_xprt, xprt);
+
+ transport->stream_offset = 0;
+ transport->stream_reclen = 0;
+ transport->stream_copied = 0;
+ transport->stream_flags =
+ STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
+ transport->stream_read_sock = read_sock;
+}
+
+/**
* xs_tcp_state_change - callback to handle TCP socket state changes
* @sk: socket whose state has changed
*
@@ -1563,14 +1585,7 @@ static void xs_tcp_state_change(struct sock *sk)
case TCP_ESTABLISHED:
spin_lock(&xprt->transport_lock);
if (!xprt_test_and_set_connected(xprt)) {
-
- /* Reset stream record info */
- transport->stream_offset = 0;
- transport->stream_reclen = 0;
- transport->stream_copied = 0;
- transport->stream_flags =
- STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
- transport->stream_read_sock = tcp_read_sock;
+ xs_stream_reset_state(xprt, tcp_read_sock);
xprt->connect_cookie++;
clear_bit(XPRT_SOCK_CONNECTING, &transport->sock_state);
xprt_clear_connecting(xprt);
--
2.7.4


2016-10-07 10:06:42

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 05/10] VSOCK: add tcp_read_sock()-like vsock_read_sock() function

The tcp_read_sock() interface dequeues skbs and gives them to the
caller's callback function for processing. This interface can avoid
data copies since the caller accesses the skb instead of using its own
receive buffer.

This patch implements vsock_read_sock() for AF_VSOCK SOCK_STREAM
sockets. The implementation is only for virtio-vsock at this time, not
for the VMware VMCI transport. It is not zero-copy yet because the
virtio-vsock receive queue does not consist of skbs.

The tcp_read_sock()-like interface is needed for AF_VSOCK sunrpc
support.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
drivers/vhost/vsock.c | 1 +
include/linux/virtio_vsock.h | 4 ++
include/net/af_vsock.h | 5 +++
net/vmw_vsock/af_vsock.c | 16 ++++++++
net/vmw_vsock/virtio_transport.c | 1 +
net/vmw_vsock/virtio_transport_common.c | 66 +++++++++++++++++++++++++++++++++
net/vmw_vsock/vmci_transport.c | 8 ++++
7 files changed, 101 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e3b30ea..bfa8f3b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -677,6 +677,7 @@ static struct virtio_transport vhost_transport = {
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
.stream_allow = virtio_transport_stream_allow,
+ .stream_read_sock = virtio_transport_stream_read_sock,

.notify_poll_in = virtio_transport_notify_poll_in,
.notify_poll_out = virtio_transport_notify_poll_out,
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 9638bfe..3d5710d 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -5,6 +5,7 @@
#include <linux/socket.h>
#include <net/sock.h>
#include <net/af_vsock.h>
+#include <net/tcp.h> /* for sk_read_actor_t */

#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE 128
#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE (1024 * 256)
@@ -123,6 +124,9 @@ int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
bool virtio_transport_stream_allow(u32 cid, u32 port);
+int virtio_transport_stream_read_sock(struct vsock_sock *vsk,
+ read_descriptor_t *desc,
+ sk_read_actor_t recv_actor);
int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr);
bool virtio_transport_dgram_allow(u32 cid, u32 port);
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f275896..f46a26c 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -19,6 +19,7 @@
#include <linux/kernel.h>
#include <linux/workqueue.h>
#include <linux/vm_sockets.h>
+#include <net/tcp.h> /* for sk_read_actor_t */

#include "vsock_addr.h"

@@ -73,6 +74,8 @@ struct vsock_sock {
void *trans;
};

+int vsock_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor);
s64 vsock_stream_has_data(struct vsock_sock *vsk);
s64 vsock_stream_has_space(struct vsock_sock *vsk);
void vsock_pending_work(struct work_struct *work);
@@ -122,6 +125,8 @@ struct vsock_transport {
u64 (*stream_rcvhiwat)(struct vsock_sock *);
bool (*stream_is_active)(struct vsock_sock *);
bool (*stream_allow)(u32 cid, u32 port);
+ int (*stream_read_sock)(struct vsock_sock *, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor);

/* Notification. */
int (*notify_poll_in)(struct vsock_sock *, size_t, bool *);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 8a398b3..e35fbb4 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -705,6 +705,22 @@ static void vsock_sk_destruct(struct sock *sk)
put_cred(vsk->owner);
}

+/* See documentation for tcp_read_sock() */
+int vsock_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ struct vsock_sock *vsp = vsock_sk(sk);
+
+ if (sk->sk_type != SOCK_STREAM)
+ return -EOPNOTSUPP;
+
+ if (sk->sk_state != SS_CONNECTED && sk->sk_state != SS_DISCONNECTING)
+ return -ENOTCONN;
+
+ return transport->stream_read_sock(vsp, desc, recv_actor);
+}
+EXPORT_SYMBOL(vsock_read_sock);
+
static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
int err;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 936d7ee..4f1efc8 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -432,6 +432,7 @@ static struct virtio_transport virtio_transport = {
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
.stream_allow = virtio_transport_stream_allow,
+ .stream_read_sock = virtio_transport_stream_read_sock,

.notify_poll_in = virtio_transport_notify_poll_in,
.notify_poll_out = virtio_transport_notify_poll_out,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index a53b3a1..3bfd845 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -250,6 +250,72 @@ virtio_transport_stream_dequeue(struct vsock_sock *vsk,
EXPORT_SYMBOL_GPL(virtio_transport_stream_dequeue);

int
+virtio_transport_stream_read_sock(struct vsock_sock *vsk,
+ read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ struct virtio_vsock_sock *vvs;
+ int ret = 0;
+
+ vvs = vsk->trans;
+
+ spin_lock_bh(&vvs->rx_lock);
+ while (!list_empty(&vvs->rx_queue)) {
+ struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
+ size_t len;
+ int used;
+
+ pkt = list_first_entry(&vvs->rx_queue,
+ struct virtio_vsock_pkt, list);
+
+ /* sk_lock is held by caller so no one else can dequeue.
+ * Unlock rx_lock so recv_actor() can sleep.
+ */
+ spin_unlock_bh(&vvs->rx_lock);
+
+ len = pkt->len - pkt->off;
+ skb = alloc_skb(len, GFP_KERNEL);
+ if (!skb) {
+ ret = -ENOMEM;
+ goto out_nolock;
+ }
+
+ memcpy(skb_put(skb, len),
+ pkt->buf + pkt->off,
+ len);
+
+ used = recv_actor(desc, skb, 0, len);
+
+ kfree_skb(skb);
+
+ spin_lock_bh(&vvs->rx_lock);
+
+ if (used > 0) {
+ ret += used;
+ pkt->off += used;
+ if (pkt->off == pkt->len) {
+ virtio_transport_dec_rx_pkt(vvs, pkt);
+ list_del(&pkt->list);
+ virtio_transport_free_pkt(pkt);
+ }
+ }
+
+ if (used <= 0 || !desc->count)
+ break;
+ }
+ spin_unlock_bh(&vvs->rx_lock);
+
+out_nolock:
+ if (ret > 0)
+ virtio_transport_send_credit_update(vsk,
+ VIRTIO_VSOCK_TYPE_STREAM, NULL);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_read_sock);
+
+int
virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
struct msghdr *msg,
size_t len, int flags)
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 4be4fbb..d6204d9 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -654,6 +654,13 @@ static bool vmci_transport_stream_allow(u32 cid, u32 port)
return true;
}

+static int vmci_transport_stream_read_sock(struct vsock_sock *vsk,
+ read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ return -EOPNOTSUPP; /* not yet implemented */
+}
+
/* This is invoked as part of a tasklet that's scheduled when the VMCI
* interrupt fires. This is run in bottom-half context but it defers most of
* its work to the packet handling work queue.
@@ -2069,6 +2076,7 @@ static const struct vsock_transport vmci_transport = {
.stream_rcvhiwat = vmci_transport_stream_rcvhiwat,
.stream_is_active = vmci_transport_stream_is_active,
.stream_allow = vmci_transport_stream_allow,
+ .stream_read_sock = vmci_transport_stream_read_sock,
.notify_poll_in = vmci_transport_notify_poll_in,
.notify_poll_out = vmci_transport_notify_poll_out,
.notify_recv_init = vmci_transport_notify_recv_init,
--
2.7.4


2016-10-07 10:06:45

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 06/10] SUNRPC: add AF_VSOCK support to xprtsock.c

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/xprtsock.c | 385 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 381 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index a16070d..12048a4 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -165,6 +165,7 @@ enum xprt_transports {
XPRT_TRANSPORT_RDMA = 256,
XPRT_TRANSPORT_BC_RDMA = XPRT_TRANSPORT_RDMA | XPRT_TRANSPORT_BC,
XPRT_TRANSPORT_LOCAL = 257,
+ XPRT_TRANSPORT_VSOCK = 258,
};

struct rpc_xprt {
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index dfdce75..c61a0ed 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -46,6 +46,7 @@
#include <net/checksum.h>
#include <net/udp.h>
#include <net/tcp.h>
+#include <net/af_vsock.h>

#include <trace/events/sunrpc.h>

@@ -269,6 +270,13 @@ static void xs_format_common_peer_addresses(struct rpc_xprt *xprt)
sin6 = xs_addr_in6(xprt);
snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
break;
+ case AF_VSOCK:
+ (void)rpc_ntop(sap, buf, sizeof(buf));
+ xprt->address_strings[RPC_DISPLAY_ADDR] =
+ kstrdup(buf, GFP_KERNEL);
+ snprintf(buf, sizeof(buf), "%08x",
+ ((struct sockaddr_vm *)sap)->svm_cid);
+ break;
default:
BUG();
}
@@ -1865,21 +1873,30 @@ static int xs_bind(struct sock_xprt *transport, struct socket *sock)
nloop++;
} while (err == -EADDRINUSE && nloop != 2);

- if (myaddr.ss_family == AF_INET)
+ switch (myaddr.ss_family) {
+ case AF_INET:
dprintk("RPC: %s %pI4:%u: %s (%d)\n", __func__,
&((struct sockaddr_in *)&myaddr)->sin_addr,
port, err ? "failed" : "ok", err);
- else
+ break;
+ case AF_INET6:
dprintk("RPC: %s %pI6:%u: %s (%d)\n", __func__,
&((struct sockaddr_in6 *)&myaddr)->sin6_addr,
port, err ? "failed" : "ok", err);
+ break;
+ case AF_VSOCK:
+ dprintk("RPC: %s %u:%u: %s (%d)\n", __func__,
+ ((struct sockaddr_vm *)&myaddr)->svm_cid,
+ port, err ? "failed" : "ok", err);
+ break;
+ }
return err;
}

/*
- * We don't support autobind on AF_LOCAL sockets
+ * We don't support autobind on AF_LOCAL and AF_VSOCK sockets
*/
-static void xs_local_rpcbind(struct rpc_task *task)
+static void xs_dummy_rpcbind(struct rpc_task *task)
{
xprt_set_bound(task->tk_xprt);
}
@@ -1916,6 +1933,14 @@ static inline void xs_reclassify_socket6(struct socket *sock)
&xs_slock_key[1], "sk_lock-AF_INET6-RPC", &xs_key[1]);
}

+static inline void xs_reclassify_socket_vsock(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+
+ sock_lock_init_class_and_name(sk, "slock-AF_VSOCK-RPC",
+ &xs_slock_key[1], "sk_lock-AF_VSOCK-RPC", &xs_key[1]);
+}
+
static inline void xs_reclassify_socket(int family, struct socket *sock)
{
if (WARN_ON_ONCE(!sock_allow_reclassification(sock->sk)))
@@ -1931,6 +1956,9 @@ static inline void xs_reclassify_socket(int family, struct socket *sock)
case AF_INET6:
xs_reclassify_socket6(sock);
break;
+ case AF_VSOCK:
+ xs_reclassify_socket_vsock(sock);
+ break;
}
}
#else
@@ -2676,7 +2704,7 @@ static struct rpc_xprt_ops xs_local_ops = {
.reserve_xprt = xprt_reserve_xprt,
.release_xprt = xs_tcp_release_xprt,
.alloc_slot = xprt_alloc_slot,
- .rpcbind = xs_local_rpcbind,
+ .rpcbind = xs_dummy_rpcbind,
.set_port = xs_local_set_port,
.connect = xs_local_connect,
.buf_alloc = rpc_malloc,
@@ -2768,6 +2796,10 @@ static int xs_init_anyaddr(const int family, struct sockaddr *sap)
.sin6_family = AF_INET6,
.sin6_addr = IN6ADDR_ANY_INIT,
};
+ static const struct sockaddr_vm svm = {
+ .svm_family = AF_VSOCK,
+ .svm_cid = VMADDR_CID_ANY,
+ };

switch (family) {
case AF_LOCAL:
@@ -2778,6 +2810,9 @@ static int xs_init_anyaddr(const int family, struct sockaddr *sap)
case AF_INET6:
memcpy(sap, &sin6, sizeof(sin6));
break;
+ case AF_VSOCK:
+ memcpy(sap, &svm, sizeof(svm));
+ break;
default:
dprintk("RPC: %s: Bad address family\n", __func__);
return -EAFNOSUPPORT;
@@ -3133,6 +3168,330 @@ out_err:
return ret;
}

+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+/**
+ * xs_vsock_state_change - callback to handle vsock socket state changes
+ * @sk: socket whose state has changed
+ *
+ */
+static void xs_vsock_state_change(struct sock *sk)
+{
+ struct rpc_xprt *xprt;
+
+ read_lock_bh(&sk->sk_callback_lock);
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+ dprintk("RPC: %s client %p...\n", __func__, xprt);
+ dprintk("RPC: state %x conn %d dead %d zapped %d sk_shutdown %d\n",
+ sk->sk_state, xprt_connected(xprt),
+ sock_flag(sk, SOCK_DEAD),
+ sock_flag(sk, SOCK_ZAPPED),
+ sk->sk_shutdown);
+
+ trace_rpc_socket_state_change(xprt, sk->sk_socket);
+
+ switch (sk->sk_state) {
+ case SS_CONNECTING:
+ /* Do nothing */
+ break;
+
+ case SS_CONNECTED:
+ spin_lock(&xprt->transport_lock);
+ if (!xprt_test_and_set_connected(xprt)) {
+ xs_stream_reset_state(xprt, vsock_read_sock);
+ xprt->connect_cookie++;
+
+ xprt_wake_pending_tasks(xprt, -EAGAIN);
+ }
+ spin_unlock(&xprt->transport_lock);
+ break;
+
+ case SS_DISCONNECTING:
+ /* TODO do we need to distinguish between various shutdown (client-side/server-side)? */
+ /* The client initiated a shutdown of the socket */
+ xprt->connect_cookie++;
+ xprt->reestablish_timeout = 0;
+ set_bit(XPRT_CLOSING, &xprt->state);
+ smp_mb__before_atomic();
+ clear_bit(XPRT_CONNECTED, &xprt->state);
+ clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ smp_mb__after_atomic();
+ break;
+
+ case SS_UNCONNECTED:
+ xs_sock_mark_closed(xprt);
+ break;
+ }
+
+ out:
+ read_unlock_bh(&sk->sk_callback_lock);
+}
+
+/**
+ * xs_vsock_error_report - callback to handle vsock socket state errors
+ * @sk: socket
+ *
+ * Note: we don't call sock_error() since there may be a rpc_task
+ * using the socket, and so we don't want to clear sk->sk_err.
+ */
+static void xs_vsock_error_report(struct sock *sk)
+{
+ struct rpc_xprt *xprt;
+ int err;
+
+ read_lock_bh(&sk->sk_callback_lock);
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+
+ err = -sk->sk_err;
+ if (err == 0)
+ goto out;
+ /* Is this a reset event? */
+ if (sk->sk_state == SS_UNCONNECTED)
+ xs_sock_mark_closed(xprt);
+ dprintk("RPC: %s client %p, error=%d...\n",
+ __func__, xprt, -err);
+ trace_rpc_socket_error(xprt, sk->sk_socket, err);
+ xprt_wake_pending_tasks(xprt, err);
+ out:
+ read_unlock_bh(&sk->sk_callback_lock);
+}
+
+/**
+ * xs_vsock_finish_connecting - initialize and connect socket
+ */
+static int xs_vsock_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
+{
+ struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ int ret = -ENOTCONN;
+
+ if (!transport->inet) {
+ struct sock *sk = sock->sk;
+
+ write_lock_bh(&sk->sk_callback_lock);
+
+ xs_save_old_callbacks(transport, sk);
+
+ sk->sk_user_data = xprt;
+ sk->sk_data_ready = xs_data_ready;
+ sk->sk_state_change = xs_vsock_state_change;
+ sk->sk_write_space = xs_tcp_write_space;
+ sk->sk_error_report = xs_vsock_error_report;
+ sk->sk_allocation = GFP_ATOMIC;
+
+ xprt_clear_connected(xprt);
+
+ /* Reset to new socket */
+ transport->sock = sock;
+ transport->inet = sk;
+
+ write_unlock_bh(&sk->sk_callback_lock);
+ }
+
+ if (!xprt_bound(xprt))
+ goto out;
+
+ xs_set_memalloc(xprt);
+
+ /* Tell the socket layer to start connecting... */
+ xprt->stat.connect_count++;
+ xprt->stat.connect_start = jiffies;
+ ret = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, O_NONBLOCK);
+ switch (ret) {
+ case 0:
+ xs_set_srcport(transport, sock);
+ case -EINPROGRESS:
+ /* SYN_SENT! */
+ if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO)
+ xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ }
+out:
+ return ret;
+}
+
+/**
+ * xs_vsock_setup_socket - create a vsock socket and connect to a remote endpoint
+ *
+ * Invoked by a work queue tasklet.
+ */
+static void xs_vsock_setup_socket(struct work_struct *work)
+{
+ struct sock_xprt *transport =
+ container_of(work, struct sock_xprt, connect_worker.work);
+ struct socket *sock = transport->sock;
+ struct rpc_xprt *xprt = &transport->xprt;
+ int status = -EIO;
+
+ if (!sock) {
+ sock = xs_create_sock(xprt, transport,
+ xs_addr(xprt)->sa_family, SOCK_STREAM,
+ 0, true);
+ if (IS_ERR(sock)) {
+ status = PTR_ERR(sock);
+ goto out;
+ }
+ }
+
+ dprintk("RPC: worker connecting xprt %p via %s to "
+ "%s (port %s)\n", xprt,
+ xprt->address_strings[RPC_DISPLAY_PROTO],
+ xprt->address_strings[RPC_DISPLAY_ADDR],
+ xprt->address_strings[RPC_DISPLAY_PORT]);
+
+ status = xs_vsock_finish_connecting(xprt, sock);
+ trace_rpc_socket_connect(xprt, sock, status);
+ dprintk("RPC: %p connect status %d connected %d sock state %d\n",
+ xprt, -status, xprt_connected(xprt),
+ sock->sk->sk_state);
+ switch (status) {
+ default:
+ printk("%s: connect returned unhandled error %d\n",
+ __func__, status);
+ case -EADDRNOTAVAIL:
+ /* We're probably in TIME_WAIT. Get rid of existing socket,
+ * and retry
+ */
+ xs_tcp_force_close(xprt);
+ break;
+ case 0:
+ case -EINPROGRESS:
+ case -EALREADY:
+ xprt_unlock_connect(xprt, transport);
+ xprt_clear_connecting(xprt);
+ return;
+ case -EINVAL:
+ /* Happens, for instance, if the user specified a link
+ * local IPv6 address without a scope-id.
+ */
+ case -ECONNREFUSED:
+ case -ECONNRESET:
+ case -ENETUNREACH:
+ case -EADDRINUSE:
+ case -ENOBUFS:
+ /* retry with existing socket, after a delay */
+ xs_tcp_force_close(xprt);
+ goto out;
+ }
+ status = -EAGAIN;
+out:
+ xprt_unlock_connect(xprt, transport);
+ xprt_clear_connecting(xprt);
+ xprt_wake_pending_tasks(xprt, status);
+}
+
+/**
+ * xs_vsock_print_stats - display vsock socket-specifc stats
+ * @xprt: rpc_xprt struct containing statistics
+ * @seq: output file
+ *
+ */
+static void xs_vsock_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
+{
+ struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ long idle_time = 0;
+
+ if (xprt_connected(xprt))
+ idle_time = (long)(jiffies - xprt->last_used) / HZ;
+
+ seq_printf(seq, "\txprt:\tvsock %u %lu %lu %lu %ld %lu %lu %lu "
+ "%llu %llu %lu %llu %llu\n",
+ transport->srcport,
+ xprt->stat.bind_count,
+ xprt->stat.connect_count,
+ xprt->stat.connect_time,
+ idle_time,
+ xprt->stat.sends,
+ xprt->stat.recvs,
+ xprt->stat.bad_xids,
+ xprt->stat.req_u,
+ xprt->stat.bklog_u,
+ xprt->stat.max_slots,
+ xprt->stat.sending_u,
+ xprt->stat.pending_u);
+}
+
+static struct rpc_xprt_ops xs_vsock_ops = {
+ .reserve_xprt = xprt_reserve_xprt,
+ .release_xprt = xs_tcp_release_xprt,
+ .alloc_slot = xprt_lock_and_alloc_slot,
+ .rpcbind = xs_dummy_rpcbind,
+ .set_port = xs_set_port,
+ .connect = xs_connect,
+ .buf_alloc = rpc_malloc,
+ .buf_free = rpc_free,
+ .send_request = xs_tcp_send_request,
+ .set_retrans_timeout = xprt_set_retrans_timeout_def,
+ .close = xs_tcp_shutdown,
+ .destroy = xs_destroy,
+ .print_stats = xs_vsock_print_stats,
+};
+
+static const struct rpc_timeout xs_vsock_default_timeout = {
+ .to_initval = 60 * HZ,
+ .to_maxval = 60 * HZ,
+ .to_retries = 2,
+};
+
+/**
+ * xs_setup_vsock - Set up transport to use a vsock socket
+ * @args: rpc transport creation arguments
+ *
+ */
+static struct rpc_xprt *xs_setup_vsock(struct xprt_create *args)
+{
+ struct sockaddr_vm *addr = (struct sockaddr_vm *)args->dstaddr;
+ struct sock_xprt *transport;
+ struct rpc_xprt *xprt;
+ struct rpc_xprt *ret;
+
+ xprt = xs_setup_xprt(args, xprt_tcp_slot_table_entries,
+ xprt_max_tcp_slot_table_entries);
+ if (IS_ERR(xprt))
+ return xprt;
+ transport = container_of(xprt, struct sock_xprt, xprt);
+
+ xprt->prot = 0;
+ xprt->tsh_size = sizeof(rpc_fraghdr) / sizeof(u32);
+ xprt->max_payload = RPC_MAX_FRAGMENT_SIZE;
+
+ xprt->bind_timeout = XS_BIND_TO;
+ xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ xprt->idle_timeout = XS_IDLE_DISC_TO;
+
+ xprt->ops = &xs_vsock_ops;
+ xprt->timeout = &xs_vsock_default_timeout;
+
+ INIT_WORK(&transport->recv_worker, xs_stream_data_receive_workfn);
+ INIT_DELAYED_WORK(&transport->connect_worker, xs_vsock_setup_socket);
+
+ switch (addr->svm_family) {
+ case AF_VSOCK:
+ if (addr->svm_port == 0) {
+ dprintk("RPC: autobind not supported with AF_VSOCK\n");
+ ret = ERR_PTR(-EINVAL);
+ goto out_err;
+ }
+ xprt_set_bound(xprt);
+ xs_format_peer_addresses(xprt, "vsock", "vsock" /* TODO register official netid? */);
+ break;
+ default:
+ ret = ERR_PTR(-EAFNOSUPPORT);
+ goto out_err;
+ }
+
+ dprintk("RPC: set up xprt to %s (port %s) via AF_VSOCK\n",
+ xprt->address_strings[RPC_DISPLAY_ADDR],
+ xprt->address_strings[RPC_DISPLAY_PORT]);
+
+ if (try_module_get(THIS_MODULE))
+ return xprt;
+ ret = ERR_PTR(-EINVAL);
+out_err:
+ xs_xprt_free(xprt);
+ return ret;
+}
+#endif
+
static struct xprt_class xs_local_transport = {
.list = LIST_HEAD_INIT(xs_local_transport.list),
.name = "named UNIX socket",
@@ -3165,6 +3524,16 @@ static struct xprt_class xs_bc_tcp_transport = {
.setup = xs_setup_bc_tcp,
};

+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+static struct xprt_class xs_vsock_transport = {
+ .list = LIST_HEAD_INIT(xs_vsock_transport.list),
+ .name = "vsock",
+ .owner = THIS_MODULE,
+ .ident = XPRT_TRANSPORT_VSOCK,
+ .setup = xs_setup_vsock,
+};
+#endif
+
/**
* init_socket_xprt - set up xprtsock's sysctls, register with RPC client
*
@@ -3180,6 +3549,9 @@ int init_socket_xprt(void)
xprt_register_transport(&xs_udp_transport);
xprt_register_transport(&xs_tcp_transport);
xprt_register_transport(&xs_bc_tcp_transport);
+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+ xprt_register_transport(&xs_vsock_transport);
+#endif

return 0;
}
@@ -3201,6 +3573,9 @@ void cleanup_socket_xprt(void)
xprt_unregister_transport(&xs_udp_transport);
xprt_unregister_transport(&xs_tcp_transport);
xprt_unregister_transport(&xs_bc_tcp_transport);
+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+ xprt_unregister_transport(&xs_vsock_transport);
+#endif
}

static int param_set_uint_minmax(const char *val,
--
2.7.4


2016-10-07 10:06:47

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 07/10] SUNRPC: drop unnecessary svc_bc_tcp_create() helper

svc_bc_tcp_create() is a helper function that simply calls
svc_bc_create_socket() with an added IPPROTO_TCP argument.

svc_bc_create_socket() then checks that the protocol argument is indeed
IPPROTO_TCP.

This isn't necessary since svc_bc_tcp_create() is the only
svc_bc_create_socket() caller. The next patch adds a second caller for
AF_VSOCK where IPPROTO_TCP will not be used.

Scrap this scheme and just call svc_bc_create_socket() directly.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
net/sunrpc/svcsock.c | 21 +++------------------
1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 57625f6..7c14028 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -70,7 +70,7 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *, int,
struct net *, struct sockaddr *,
int, int);
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
-static struct svc_xprt *svc_bc_create_socket(struct svc_serv *, int,
+static struct svc_xprt *svc_bc_create_socket(struct svc_serv *,
struct net *, struct sockaddr *,
int, int);
static void svc_bc_sock_free(struct svc_xprt *xprt);
@@ -1180,25 +1180,17 @@ static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
}

#if defined(CONFIG_SUNRPC_BACKCHANNEL)
-static struct svc_xprt *svc_bc_create_socket(struct svc_serv *, int,
+static struct svc_xprt *svc_bc_create_socket(struct svc_serv *,
struct net *, struct sockaddr *,
int, int);
static void svc_bc_sock_free(struct svc_xprt *xprt);

-static struct svc_xprt *svc_bc_tcp_create(struct svc_serv *serv,
- struct net *net,
- struct sockaddr *sa, int salen,
- int flags)
-{
- return svc_bc_create_socket(serv, IPPROTO_TCP, net, sa, salen, flags);
-}
-
static void svc_bc_tcp_sock_detach(struct svc_xprt *xprt)
{
}

static struct svc_xprt_ops svc_tcp_bc_ops = {
- .xpo_create = svc_bc_tcp_create,
+ .xpo_create = svc_bc_create_socket,
.xpo_detach = svc_bc_tcp_sock_detach,
.xpo_free = svc_bc_sock_free,
.xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
@@ -1581,7 +1573,6 @@ static void svc_sock_free(struct svc_xprt *xprt)
* Create a back channel svc_xprt which shares the fore channel socket.
*/
static struct svc_xprt *svc_bc_create_socket(struct svc_serv *serv,
- int protocol,
struct net *net,
struct sockaddr *sin, int len,
int flags)
@@ -1589,12 +1580,6 @@ static struct svc_xprt *svc_bc_create_socket(struct svc_serv *serv,
struct svc_sock *svsk;
struct svc_xprt *xprt;

- if (protocol != IPPROTO_TCP) {
- printk(KERN_WARNING "svc: only TCP sockets"
- " supported on shared back channel\n");
- return ERR_PTR(-EINVAL);
- }
-
svsk = kzalloc(sizeof(*svsk), GFP_KERNEL);
if (!svsk)
return ERR_PTR(-ENOMEM);
--
2.7.4


2016-10-07 10:06:49

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 08/10] SUNRPC: add AF_VSOCK support to svc_xprt.c

Allow creation of AF_VSOCK service xprts. This is needed for the
"vsock-bc" backchannel.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/svc_xprt.h | 12 ++++++++++++
net/sunrpc/svc_xprt.c | 18 ++++++++++++++++++
2 files changed, 30 insertions(+)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index ab02a45..01834f2 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -8,6 +8,7 @@
#define SUNRPC_SVC_XPRT_H

#include <linux/sunrpc/svc.h>
+#include <linux/vm_sockets.h>

struct module;

@@ -153,12 +154,15 @@ static inline unsigned short svc_addr_port(const struct sockaddr *sa)
{
const struct sockaddr_in *sin = (const struct sockaddr_in *)sa;
const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)sa;
+ const struct sockaddr_vm *svm = (const struct sockaddr_vm *)sa;

switch (sa->sa_family) {
case AF_INET:
return ntohs(sin->sin_port);
case AF_INET6:
return ntohs(sin6->sin6_port);
+ case AF_VSOCK:
+ return svm->svm_port;
}

return 0;
@@ -171,6 +175,8 @@ static inline size_t svc_addr_len(const struct sockaddr *sa)
return sizeof(struct sockaddr_in);
case AF_INET6:
return sizeof(struct sockaddr_in6);
+ case AF_VSOCK:
+ return sizeof(struct sockaddr_vm);
}
BUG();
}
@@ -190,6 +196,7 @@ static inline char *__svc_print_addr(const struct sockaddr *addr,
{
const struct sockaddr_in *sin = (const struct sockaddr_in *)addr;
const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)addr;
+ const struct sockaddr_vm *svm = (const struct sockaddr_vm *)addr;

switch (addr->sa_family) {
case AF_INET:
@@ -203,6 +210,11 @@ static inline char *__svc_print_addr(const struct sockaddr *addr,
ntohs(sin6->sin6_port));
break;

+ case AF_VSOCK:
+ snprintf(buf, len, "%u, port=%u",
+ svm->svm_cid, svm->svm_port);
+ break;
+
default:
snprintf(buf, len, "unknown address type: %d", addr->sa_family);
break;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index c3f6523..d929bc7 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -10,6 +10,7 @@
#include <linux/kthread.h>
#include <linux/slab.h>
#include <net/sock.h>
+#include <net/af_vsock.h>
#include <linux/sunrpc/addr.h>
#include <linux/sunrpc/stats.h>
#include <linux/sunrpc/svc_xprt.h>
@@ -195,6 +196,13 @@ static struct svc_xprt *__svc_xpo_create(struct svc_xprt_class *xcl,
.sin6_port = htons(port),
};
#endif
+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+ struct sockaddr_vm svm = {
+ .svm_family = AF_VSOCK,
+ .svm_cid = VMADDR_CID_ANY,
+ .svm_port = port,
+ };
+#endif
struct sockaddr *sap;
size_t len;

@@ -209,6 +217,12 @@ static struct svc_xprt *__svc_xpo_create(struct svc_xprt_class *xcl,
len = sizeof(sin6);
break;
#endif
+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+ case AF_VSOCK:
+ sap = (struct sockaddr *)&svm;
+ len = sizeof(svm);
+ break;
+#endif
default:
return ERR_PTR(-EAFNOSUPPORT);
}
@@ -595,6 +609,10 @@ int svc_port_is_privileged(struct sockaddr *sin)
case AF_INET6:
return ntohs(((struct sockaddr_in6 *)sin)->sin6_port)
< PROT_SOCK;
+ case AF_VSOCK:
+ return ((struct sockaddr_vm *)sin)->svm_port <=
+ LAST_RESERVED_PORT;
+
default:
return 0;
}
--
2.7.4


2016-10-07 10:06:51

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 09/10] SUNRPC: add AF_VSOCK backchannel support

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
net/sunrpc/svcsock.c | 27 +++++++++++++++++++++++++++
net/sunrpc/xprtsock.c | 25 +++++++++++++++++++++++++
2 files changed, 52 insertions(+)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 7c14028..857c38e 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1204,14 +1204,41 @@ static struct svc_xprt_class svc_tcp_bc_class = {
.xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
};

+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+static void svc_bc_vsock_sock_detach(struct svc_xprt *xprt)
+{
+}
+
+static struct svc_xprt_ops svc_vsock_bc_ops = {
+ .xpo_create = svc_bc_create_socket,
+ .xpo_detach = svc_bc_vsock_sock_detach,
+ .xpo_free = svc_bc_sock_free,
+ .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
+ .xpo_secure_port = svc_sock_secure_port,
+};
+
+static struct svc_xprt_class svc_vsock_bc_class = {
+ .xcl_name = "vsock-bc",
+ .xcl_owner = THIS_MODULE,
+ .xcl_ops = &svc_vsock_bc_ops,
+ .xcl_max_payload = RPCSVC_MAXPAYLOAD,
+};
+#endif /* CONFIG_SUNRPC_XPRT_VSOCK */
+
static void svc_init_bc_xprt_sock(void)
{
svc_reg_xprt_class(&svc_tcp_bc_class);
+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+ svc_reg_xprt_class(&svc_vsock_bc_class);
+#endif
}

static void svc_cleanup_bc_xprt_sock(void)
{
svc_unreg_xprt_class(&svc_tcp_bc_class);
+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+ svc_unreg_xprt_class(&svc_vsock_bc_class);
+#endif
}
#else /* CONFIG_SUNRPC_BACKCHANNEL */
static void svc_init_bc_xprt_sock(void)
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index c61a0ed..27df0e2 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1416,6 +1416,24 @@ static size_t xs_tcp_bc_maxpayload(struct rpc_xprt *xprt)
{
return PAGE_SIZE;
}
+
+#ifdef CONFIG_SUNRPC_XPRT_VSOCK
+static int xs_vsock_bc_up(struct svc_serv *serv, struct net *net)
+{
+ int ret;
+
+ ret = svc_create_xprt(serv, "vsock-bc", net, AF_VSOCK, 0,
+ SVC_SOCK_ANONYMOUS);
+ if (ret < 0)
+ return ret;
+ return 0;
+}
+
+static size_t xs_vsock_bc_maxpayload(struct rpc_xprt *xprt)
+{
+ return PAGE_SIZE;
+}
+#endif /* !CONFIG_SUNRPC_XPRT_VSOCK */
#else
static inline int _xs_stream_read_data(struct rpc_xprt *xprt,
struct xdr_skb_reader *desc)
@@ -3424,6 +3442,13 @@ static struct rpc_xprt_ops xs_vsock_ops = {
.close = xs_tcp_shutdown,
.destroy = xs_destroy,
.print_stats = xs_vsock_print_stats,
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+ .bc_setup = xprt_setup_bc,
+ .bc_up = xs_vsock_bc_up,
+ .bc_maxpayload = xs_vsock_bc_maxpayload,
+ .bc_free_rqst = xprt_free_bc_rqst,
+ .bc_destroy = xprt_destroy_bc,
+#endif
};

static const struct rpc_timeout xs_vsock_default_timeout = {
--
2.7.4


2016-10-07 10:06:53

by Stefan Hajnoczi

[permalink] [raw]
Subject: [PATCH v2 10/10] NFS: add AF_VSOCK support to NFS client

This patch adds AF_VSOCK to the NFS client. Mounts can now use the
"vsock" proto option and pass "vsock:<cid>" address strings, which are
interpreted by sunrpc for xprt creation.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/nfs/client.c | 2 ++
fs/nfs/super.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 1e10678..71c5b33 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -35,6 +35,7 @@
#include <linux/vfs.h>
#include <linux/inet.h>
#include <linux/in6.h>
+#include <linux/vm_sockets.h>
#include <linux/slab.h>
#include <linux/idr.h>
#include <net/ipv6.h>
@@ -434,6 +435,7 @@ void nfs_init_timeout_values(struct rpc_timeout *to, int proto,
switch (proto) {
case XPRT_TRANSPORT_TCP:
case XPRT_TRANSPORT_RDMA:
+ case XPRT_TRANSPORT_VSOCK:
if (retrans == NFS_UNSPEC_RETRANS)
to->to_retries = NFS_DEF_TCP_RETRANS;
if (timeo == NFS_UNSPEC_TIMEO || to->to_retries == 0)
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index d396013..01e0589 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -191,7 +191,7 @@ static const match_table_t nfs_mount_option_tokens = {

enum {
Opt_xprt_udp, Opt_xprt_udp6, Opt_xprt_tcp, Opt_xprt_tcp6, Opt_xprt_rdma,
- Opt_xprt_rdma6,
+ Opt_xprt_rdma6, Opt_xprt_vsock,

Opt_xprt_err
};
@@ -203,6 +203,7 @@ static const match_table_t nfs_xprt_protocol_tokens = {
{ Opt_xprt_tcp6, "tcp6" },
{ Opt_xprt_rdma, "rdma" },
{ Opt_xprt_rdma6, "rdma6" },
+ { Opt_xprt_vsock, "vsock" },

{ Opt_xprt_err, NULL }
};
@@ -971,6 +972,8 @@ static int nfs_verify_server_address(struct sockaddr *addr)
struct in6_addr *sa = &((struct sockaddr_in6 *)addr)->sin6_addr;
return !ipv6_addr_any(sa);
}
+ case AF_VSOCK:
+ return 1;
}

dfprintk(MOUNT, "NFS: Invalid IP address specified\n");
@@ -1000,6 +1003,7 @@ static void nfs_validate_transport_protocol(struct nfs_parsed_mount_data *mnt)
case XPRT_TRANSPORT_UDP:
case XPRT_TRANSPORT_TCP:
case XPRT_TRANSPORT_RDMA:
+ case XPRT_TRANSPORT_VSOCK:
break;
default:
mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
@@ -1481,6 +1485,11 @@ static int nfs_parse_mount_options(char *raw,
mnt->nfs_server.protocol = XPRT_TRANSPORT_RDMA;
xprt_load_transport(string);
break;
+ case Opt_xprt_vsock:
+ protofamily = AF_VSOCK;
+ mnt->flags &= ~NFS_MOUNT_TCP;
+ mnt->nfs_server.protocol = XPRT_TRANSPORT_VSOCK;
+ break;
default:
dfprintk(MOUNT, "NFS: unrecognized "
"transport protocol\n");
--
2.7.4


2016-10-07 15:15:28

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]

Hi Stefan-

> On Oct 7, 2016, at 6:01 AM, Stefan Hajnoczi <[email protected]> wrote:
>
> AF_VSOCK addresses are a Context ID (CID) and port number tuple. The
> CID is a unique address, similar to a IP address on a local subnet.
>
> Extend the addr.h functions to handle AF_VSOCK addresses.

I'm wondering if there's a specification for how to construct
the universal address form of an AF_VSOCK address. This would
be needed for populating an fs_locations response, or for
updating the NFS server's local rpcbind service.

A traditional NFS server employs IP-address based access
control. How does that work with the new address family? Do
you expect changes to mountd or exportfs?

Is there a standard that defines the "vsock" netid? A new
netid requires at least an IANA action. Is there a document
that describes how RPC works with a VSOCK transport?

This work appears to define two separate things: a new address
family, and a new transport type. Wouldn't it be cleaner to
dispense with the "proto=vsock" piece, and just support TCP
over AF_VSOCK (just as it works for AF_INET and AF_INET6) ?

At Connectathon, we discussed what happens when a guest is
live-migrated to another host with a vsock-enabled NFSD.
Essentially, the server at the known-local address would
change identities and its content could be completely
different. For instance, the file handles would all change,
including the file handle of the export's root directory.
Clients don't tolerate that especially well.

Can't a Docker-based or kvm-based guest simply mount one of
the host's local file systems directly? What would be the
value of inserting NFS into that picture?


> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> v2:
> * Replace CONFIG_VSOCKETS with CONFIG_SUNRPC_XPRT_VSOCK to prevent
> build failures when SUNRPC=y and VSOCKETS=m. Built-in code cannot
> link against code in a module.

> ---
> include/linux/sunrpc/addr.h | 44 ++++++++++++++++++++++++++++++++++
> net/sunrpc/Kconfig | 10 ++++++++
> net/sunrpc/addr.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 111 insertions(+)
>
> diff --git a/include/linux/sunrpc/addr.h b/include/linux/sunrpc/addr.h
> index 5c9c6cd..c4169bc 100644
> --- a/include/linux/sunrpc/addr.h
> +++ b/include/linux/sunrpc/addr.h
> @@ -10,6 +10,7 @@
> #include <linux/socket.h>
> #include <linux/in.h>
> #include <linux/in6.h>
> +#include <linux/vm_sockets.h>
> #include <net/ipv6.h>
>
> size_t rpc_ntop(const struct sockaddr *, char *, const size_t);
> @@ -26,6 +27,8 @@ static inline unsigned short rpc_get_port(const struct sockaddr *sap)
> return ntohs(((struct sockaddr_in *)sap)->sin_port);
> case AF_INET6:
> return ntohs(((struct sockaddr_in6 *)sap)->sin6_port);
> + case AF_VSOCK:
> + return ((struct sockaddr_vm *)sap)->svm_port;
> }
> return 0;
> }
> @@ -40,6 +43,9 @@ static inline void rpc_set_port(struct sockaddr *sap,
> case AF_INET6:
> ((struct sockaddr_in6 *)sap)->sin6_port = htons(port);
> break;
> + case AF_VSOCK:
> + ((struct sockaddr_vm *)sap)->svm_port = port;
> + break;
> }
> }
>
> @@ -106,6 +112,40 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
> }
> #endif /* !(IS_ENABLED(CONFIG_IPV6) */
>
> +#if IS_ENABLED(CONFIG_VSOCKETS)
> +static inline bool rpc_cmp_vsock_addr(const struct sockaddr *sap1,
> + const struct sockaddr *sap2)
> +{
> + const struct sockaddr_vm *svm1 = (const struct sockaddr_vm *)sap1;
> + const struct sockaddr_vm *svm2 = (const struct sockaddr_vm *)sap2;
> +
> + return svm1->svm_cid == svm2->svm_cid;
> +}
> +
> +static inline bool __rpc_copy_vsock_addr(struct sockaddr *dst,
> + const struct sockaddr *src)
> +{
> + const struct sockaddr_vm *ssvm = (const struct sockaddr_vm *)src;
> + struct sockaddr_vm *dsvm = (struct sockaddr_vm *)dst;
> +
> + dsvm->svm_family = ssvm->svm_family;
> + dsvm->svm_cid = ssvm->svm_cid;
> + return true;
> +}
> +#else /* !(IS_ENABLED(CONFIG_VSOCKETS) */
> +static inline bool rpc_cmp_vsock_addr(const struct sockaddr *sap1,
> + const struct sockaddr *sap2)
> +{
> + return false;
> +}
> +
> +static inline bool __rpc_copy_vsock_addr(struct sockaddr *dst,
> + const struct sockaddr *src)
> +{
> + return false;
> +}
> +#endif /* !(IS_ENABLED(CONFIG_VSOCKETS) */
> +
> /**
> * rpc_cmp_addr - compare the address portion of two sockaddrs.
> * @sap1: first sockaddr
> @@ -125,6 +165,8 @@ static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
> return rpc_cmp_addr4(sap1, sap2);
> case AF_INET6:
> return rpc_cmp_addr6(sap1, sap2);
> + case AF_VSOCK:
> + return rpc_cmp_vsock_addr(sap1, sap2);
> }
> }
> return false;
> @@ -161,6 +203,8 @@ static inline bool rpc_copy_addr(struct sockaddr *dst,
> return __rpc_copy_addr4(dst, src);
> case AF_INET6:
> return __rpc_copy_addr6(dst, src);
> + case AF_VSOCK:
> + return __rpc_copy_vsock_addr(dst, src);
> }
> return false;
> }
> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
> index 04ce2c0..d18fc1a 100644
> --- a/net/sunrpc/Kconfig
> +++ b/net/sunrpc/Kconfig
> @@ -61,3 +61,13 @@ config SUNRPC_XPRT_RDMA
>
> If unsure, or you know there is no RDMA capability on your
> hardware platform, say N.
> +
> +config SUNRPC_XPRT_VSOCK
> + bool "RPC-over-AF_VSOCK transport"
> + depends on SUNRPC && VSOCKETS && !(SUNRPC=y && VSOCKETS=m)
> + default SUNRPC && VSOCKETS
> + help
> + This option allows the NFS client and server to use the AF_VSOCK
> + transport to communicate between virtual machines and the host.
> +
> + If unsure, say Y.
> diff --git a/net/sunrpc/addr.c b/net/sunrpc/addr.c
> index 2e0a6f9..f4dd962 100644
> --- a/net/sunrpc/addr.c
> +++ b/net/sunrpc/addr.c
> @@ -16,11 +16,14 @@
> * RFC 4291, Section 2.2 for details on IPv6 presentation formats.
> */
>
> + /* TODO register netid and uaddr with IANA? (See RFC 5665 5.1/5.2) */
> +
> #include <net/ipv6.h>
> #include <linux/sunrpc/addr.h>
> #include <linux/sunrpc/msg_prot.h>
> #include <linux/slab.h>
> #include <linux/export.h>
> +#include <linux/vm_sockets.h>
>
> #if IS_ENABLED(CONFIG_IPV6)
>
> @@ -108,6 +111,26 @@ static size_t rpc_ntop6(const struct sockaddr *sap,
>
> #endif /* !IS_ENABLED(CONFIG_IPV6) */
>
> +#ifdef CONFIG_SUNRPC_XPRT_VSOCK
> +
> +static size_t rpc_ntop_vsock(const struct sockaddr *sap,
> + char *buf, const size_t buflen)
> +{
> + const struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
> +
> + return snprintf(buf, buflen, "%u", svm->svm_cid);
> +}
> +
> +#else /* !CONFIG_SUNRPC_XPRT_VSOCK */
> +
> +static size_t rpc_ntop_vsock(const struct sockaddr *sap,
> + char *buf, const size_t buflen)
> +{
> + return 0;
> +}
> +
> +#endif /* !CONFIG_SUNRPC_XPRT_VSOCK */
> +
> static int rpc_ntop4(const struct sockaddr *sap,
> char *buf, const size_t buflen)
> {
> @@ -132,6 +155,8 @@ size_t rpc_ntop(const struct sockaddr *sap, char *buf, const size_t buflen)
> return rpc_ntop4(sap, buf, buflen);
> case AF_INET6:
> return rpc_ntop6(sap, buf, buflen);
> + case AF_VSOCK:
> + return rpc_ntop_vsock(sap, buf, buflen);
> }
>
> return 0;
> @@ -229,6 +254,34 @@ static size_t rpc_pton6(struct net *net, const char *buf, const size_t buflen,
> }
> #endif
>
> +#ifdef CONFIG_SUNRPC_XPRT_VSOCK
> +static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
> + struct sockaddr *sap, const size_t salen)
> +{
> + const size_t prefix_len = strlen("vsock:");
> + struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
> + unsigned int cid;
> +
> + if (strncmp(buf, "vsock:", prefix_len) != 0 ||
> + salen < sizeof(struct sockaddr_vm))
> + return 0;
> +
> + if (kstrtouint(buf + prefix_len, 10, &cid) != 0)
> + return 0;
> +
> + memset(svm, 0, sizeof(struct sockaddr_vm));
> + svm->svm_family = AF_VSOCK;
> + svm->svm_cid = cid;
> + return sizeof(struct sockaddr_vm);
> +}
> +#else
> +static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
> + struct sockaddr *sap, const size_t salen)
> +{
> + return 0;
> +}
> +#endif
> +
> /**
> * rpc_pton - Construct a sockaddr in @sap
> * @net: applicable network namespace
> @@ -249,6 +302,10 @@ size_t rpc_pton(struct net *net, const char *buf, const size_t buflen,
> {
> unsigned int i;
>
> + /* TODO is there a nicer way to distinguish vsock addresses? */
> + if (strncmp(buf, "vsock:", 6) == 0)
> + return rpc_pton_vsock(buf, buflen, sap, salen);
> +
> for (i = 0; i < buflen; i++)
> if (buf[i] == ':')
> return rpc_pton6(net, buf, buflen, sap, salen);
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2016-10-08 00:42:19

by Cedric Blancher

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] NFS: add AF_VSOCK support to NFS client

So basically you're creating a new (Red Hat) Linux-only wormhole which
bypasses all network security between VM host and guest and needs
extra work&thought&tool support (wireshark, valgrind, ...) to handle,
trace, debug, monitor and secure?

Ced

On 7 October 2016 at 12:01, Stefan Hajnoczi <[email protected]> wrote:
> This patch series enables AF_VSOCK address family support in the NFS client.
> You can also get the commits from the vsock-nfs branch at
> https://github.com/stefanha/linux.git.
>
> The AF_VSOCK address family provides dgram and stream socket communication
> between virtual machines and hypervisors. VMware VMCI and virtio (for KVM)
> transports are available, see net/vmw_vsock.
>
> The goal of this work is sharing files between virtual machines and
> hypervisors. AF_VSOCK is well-suited to this because it requires no
> configuration inside the virtual machine, making it simple to manage and
> reliable.
>
> Why NFS over AF_VSOCK?
> ----------------------
> It is unusual to add a new NFS transport, only TCP, RDMA, and UDP are currently
> supported. Here is the rationale for adding AF_VSOCK.
>
> Sharing files with a virtual machine can be configured manually:
> 1. Add a dedicated network card to the virtual machine. It will be used for
> NFS traffic.
> 2. Configure a local subnet and assign IP addresses to the virtual machine and
> hypervisor
> 3. Configure an NFS export on the hypervisor and start the NFS server
> 4. Mount the export inside the virtual machine
>
> Automating these steps poses a problem: modifying network configuration inside
> the virtual machine is invasive. It's hard to add a network interface to an
> arbitrary running system in an automated fashion, considering the network
> management tools, firewall rules, IP address usage, etc.
>
> Furthermore, the user may disrupt file sharing by accident when they add
> firewall rules, restart networking, etc because the NFS network interface is
> visible alongside the network interfaces managed by the user.
>
> AF_VSOCK is a zero-configuration network transport that avoids these problems.
> Adding it to a virtual machine is non-invasive. It also avoids accidental
> misconfiguration by the user. This is why "guest agents" and other services in
> various hypervisors (KVM, Xen, VMware, VirtualBox) do not use regular network
> interfaces.
>
> This is why AF_VSOCK is appropriate for providing shared files as a hypervisor
> service.
>
> The approach in this series
> ---------------------------
> AF_VSOCK stream sockets can be used for NFSv4.1 much in the same way as TCP.
> RFC 1831 record fragments divide messages since SOCK_STREAM semantics are
> present. The backchannel shares the connection just like the default TCP
> configuration.
>
> Addresses are <Context ID, Port Number> pairs. These patches use "vsock:<cid>"
> string representation to distinguish AF_VSOCK addresses from IPv4 and IPv6
> numeric addresses.
>
> The following example mounts /export from the hypervisor (CID 2) inside the
> virtual machine (CID 3):
>
> # /sbin/mount.nfs 2:/export /mnt -o clientaddr=3,proto=vsock
>
> Please see the nfs-utils patch series I have just sent to
> [email protected] for the necessary patches.
>
> Status
> ------
> The virtio-vsock transport was merged in Linux 4.8 and the vhost-vsock-pci
> device is available in QEMU git master. This means the underlying AF_VSOCK
> transport for KVM is now available upstream.
>
> I have begun work on nfsd support in the kernel and nfs-utils. This is not
> complete yet and will be sent as separate patch series.
>
> Stefan Hajnoczi (10):
> SUNRPC: add AF_VSOCK support to addr.[ch]
> SUNRPC: rename "TCP" record parser to "stream" parser
> SUNRPC: abstract tcp_read_sock() in record fragment parser
> SUNRPC: extract xs_stream_reset_state()
> VSOCK: add tcp_read_sock()-like vsock_read_sock() function
> SUNRPC: add AF_VSOCK support to xprtsock.c
> SUNRPC: drop unnecessary svc_bc_tcp_create() helper
> SUNRPC: add AF_VSOCK support to svc_xprt.c
> SUNRPC: add AF_VSOCK backchannel support
> NFS: add AF_VSOCK support to NFS client
>
> drivers/vhost/vsock.c | 1 +
> fs/nfs/client.c | 2 +
> fs/nfs/super.c | 11 +-
> include/linux/sunrpc/addr.h | 44 ++
> include/linux/sunrpc/svc_xprt.h | 12 +
> include/linux/sunrpc/xprt.h | 1 +
> include/linux/sunrpc/xprtsock.h | 36 +-
> include/linux/virtio_vsock.h | 4 +
> include/net/af_vsock.h | 5 +
> include/trace/events/sunrpc.h | 28 +-
> net/sunrpc/Kconfig | 10 +
> net/sunrpc/addr.c | 57 +++
> net/sunrpc/svc_xprt.c | 18 +
> net/sunrpc/svcsock.c | 48 ++-
> net/sunrpc/xprtsock.c | 703 +++++++++++++++++++++++++-------
> net/vmw_vsock/af_vsock.c | 16 +
> net/vmw_vsock/virtio_transport.c | 1 +
> net/vmw_vsock/virtio_transport_common.c | 66 +++
> net/vmw_vsock/vmci_transport.c | 8 +
> 19 files changed, 880 insertions(+), 191 deletions(-)
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Cedric Blancher <[email protected]>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

2016-10-20 14:36:05

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] NFS: add AF_VSOCK support to NFS client

On Sat, Oct 08, 2016 at 02:42:17AM +0200, Cedric Blancher wrote:
> So basically you're creating a new (Red Hat) Linux-only wormhole which
> bypasses all network security between VM host and guest and needs
> extra work&thought&tool support (wireshark, valgrind, ...) to handle,
> trace, debug, monitor and secure?

vsock is not Linux-only and not Red Hat-only. There are two
paravirtualized hardware interfaces (VMware VMCI and KVM's
virtio-vsock). Drivers for other operating systems exist and can be
written for OSes that are not yet supported. The virtio-vsock spec is
public.

Regarding bypassing network security, this is a non-routable
guest<->host protocol. It is very locked down by design.

You can simply not use the device if you prefer to go inside the guest
and configure a traditional NFS TCP/IP setup instead. As mentioned in
the cover letter, that is not feasible for cloud providers and other
scenarios where reaching inside the guest isn't allowed.


Attachments:
(No filename) (972.00 B)
signature.asc (455.00 B)
Download all attachments

2016-10-21 13:04:11

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]

On Fri, Oct 07, 2016 at 11:15:20AM -0400, Chuck Lever wrote:
> > On Oct 7, 2016, at 6:01 AM, Stefan Hajnoczi <[email protected]> wrote:
> >
> > AF_VSOCK addresses are a Context ID (CID) and port number tuple. The
> > CID is a unique address, similar to a IP address on a local subnet.
> >
> > Extend the addr.h functions to handle AF_VSOCK addresses.

Thanks for your reply. A lot of these areas are covered in the
presentation I gave at Connectathon 2016. Here is the link in case
you're interested:
http://vmsplice.net/~stefan/stefanha-connectathon-2016.pdf

Replies to your questions below:

> I'm wondering if there's a specification for how to construct
> the universal address form of an AF_VSOCK address. This would
> be needed for populating an fs_locations response, or for
> updating the NFS server's local rpcbind service.

The uaddr format I'm proposing is "vsock:cid.port". Both cid and port
are unsigned 32-bit integers. The netid I'm proposing is "vsock".

> A traditional NFS server employs IP-address based access
> control. How does that work with the new address family? Do
> you expect changes to mountd or exportfs?

Yes, the /etc/exports syntax I'm proposing is:

/srv/vm001 vsock:5(rw)

This allows CID 5 to access /srv/vm001. The CID is equivalent to an IP
address.

This patch series only addresses the NFS client side but I will be
sending nfsd and nfs-utils rpc.mountd patches once I've completed the
work.

The way it works so far is that /proc/net/rpc/auth.unix.ip is extended
to support not just IP but also vsock addresses. So the cache is
separated by network address family (IP or vsock).

> Is there a standard that defines the "vsock" netid? A new
> netid requires at least an IANA action. Is there a document
> that describes how RPC works with a VSOCK transport?

I haven't submitted a request to IANA yet. The RPC is the same as TCP
(it uses the same Recording Marking to delimit boundaries in the
stream).

> This work appears to define two separate things: a new address
> family, and a new transport type. Wouldn't it be cleaner to
> dispense with the "proto=vsock" piece, and just support TCP
> over AF_VSOCK (just as it works for AF_INET and AF_INET6) ?

Can you explain how this would simplify things? I don't think much of
the code is transport-specific (the stream parsing is already shared
with TCP). Most of the code is to add the new address family. AF_VSOCK
already offers TCP-like semantics natively so no extra protocol is used
on top.

> At Connectathon, we discussed what happens when a guest is
> live-migrated to another host with a vsock-enabled NFSD.
> Essentially, the server at the known-local address would
> change identities and its content could be completely
> different. For instance, the file handles would all change,
> including the file handle of the export's root directory.
> Clients don't tolerate that especially well.

This issue remains. I looked into checkpoint-resume style TCP_REPAIR to
allow existing connections to persist across migration but I hope a
simpler approach can be taken.

Let's forget about AF_VSOCK, the problem is that an NFS client loses
connectivity to the old server and must connect to the new server. We
want to keep all state (open files, etc). Are configurations like that
possible with Linux nfsd?

> Can't a Docker-based or kvm-based guest simply mount one of
> the host's local file systems directly? What would be the
> value of inserting NFS into that picture?

The host cannot access a file system currently mounted by the guest and
vice versa. NFS allows sharing of a file system between the host and
one or more guests.


Attachments:
(No filename) (3.57 kB)
signature.asc (455.00 B)
Download all attachments

2016-10-21 14:22:23

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]


> On Oct 21, 2016, at 9:04 AM, Stefan Hajnoczi <[email protected]> wrote:
>
> On Fri, Oct 07, 2016 at 11:15:20AM -0400, Chuck Lever wrote:
>>> On Oct 7, 2016, at 6:01 AM, Stefan Hajnoczi <[email protected]> wrote:
>>>
>>> AF_VSOCK addresses are a Context ID (CID) and port number tuple. The
>>> CID is a unique address, similar to a IP address on a local subnet.
>>>
>>> Extend the addr.h functions to handle AF_VSOCK addresses.
>
> Thanks for your reply. A lot of these areas are covered in the
> presentation I gave at Connectathon 2016. Here is the link in case
> you're interested:
> http://vmsplice.net/~stefan/stefanha-connectathon-2016.pdf
>
> Replies to your questions below:
>
>> I'm wondering if there's a specification for how to construct
>> the universal address form of an AF_VSOCK address. This would
>> be needed for populating an fs_locations response, or for
>> updating the NFS server's local rpcbind service.
>
> The uaddr format I'm proposing is "vsock:cid.port". Both cid and port
> are unsigned 32-bit integers. The netid I'm proposing is "vsock".
>
>> A traditional NFS server employs IP-address based access
>> control. How does that work with the new address family? Do
>> you expect changes to mountd or exportfs?
>
> Yes, the /etc/exports syntax I'm proposing is:
>
> /srv/vm001 vsock:5(rw)
>
> This allows CID 5 to access /srv/vm001. The CID is equivalent to an IP
> address.
>
> This patch series only addresses the NFS client side but I will be
> sending nfsd and nfs-utils rpc.mountd patches once I've completed the
> work.
>
> The way it works so far is that /proc/net/rpc/auth.unix.ip is extended
> to support not just IP but also vsock addresses. So the cache is
> separated by network address family (IP or vsock).
>
>> Is there a standard that defines the "vsock" netid? A new
>> netid requires at least an IANA action. Is there a document
>> that describes how RPC works with a VSOCK transport?
>
> I haven't submitted a request to IANA yet. The RPC is the same as TCP
> (it uses the same Recording Marking to delimit boundaries in the
> stream).

>> This work appears to define two separate things: a new address
>> family, and a new transport type. Wouldn't it be cleaner to
>> dispense with the "proto=vsock" piece, and just support TCP
>> over AF_VSOCK (just as it works for AF_INET and AF_INET6) ?
>
> Can you explain how this would simplify things? I don't think much of
> the code is transport-specific (the stream parsing is already shared
> with TCP). Most of the code is to add the new address family. AF_VSOCK
> already offers TCP-like semantics natively so no extra protocol is used
> on top.

If this really is just TCP on a new address family, then "tcpv"
is more in line with previous work, and you can get away with
just an IANA action for a new netid, since RPC-over-TCP is
already specified.


>> At Connectathon, we discussed what happens when a guest is
>> live-migrated to another host with a vsock-enabled NFSD.
>> Essentially, the server at the known-local address would
>> change identities and its content could be completely
>> different. For instance, the file handles would all change,
>> including the file handle of the export's root directory.
>> Clients don't tolerate that especially well.
>
> This issue remains. I looked into checkpoint-resume style TCP_REPAIR to
> allow existing connections to persist across migration but I hope a
> simpler approach can be taken.
>
> Let's forget about AF_VSOCK, the problem is that an NFS client loses
> connectivity to the old server and must connect to the new server. We
> want to keep all state (open files, etc). Are configurations like that
> possible with Linux nfsd?

You have two problems:

- OPEN and LOCK state would appear to vanish on the server. To recover
this state you would need an NFS server restart and grace period on the
destination host to allow the client to use reclaiming OPENs.

- The FSID and filehandles would be different.

You could mandate fixed well-known filehandles and FSIDs, just as you
are doing with the vsock addresses.

Or, implement NFSv4 migration in the Linux NFS server. Migrate the data
and the VM at the same time, then the filehandles and state can come
along for the ride, and no grace period is needed.


--
Chuck Lever




2016-10-27 01:05:37

by Cedric Blancher

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] NFS: add AF_VSOCK support to NFS client

On 20 October 2016 at 16:36, Stefan Hajnoczi <[email protected]> wrote:
> On Sat, Oct 08, 2016 at 02:42:17AM +0200, Cedric Blancher wrote:
>> So basically you're creating a new (Red Hat) Linux-only wormhole which
>> bypasses all network security between VM host and guest and needs
>> extra work&thought&tool support (wireshark, valgrind, ...) to handle,
>> trace, debug, monitor and secure?
>
> vsock is not Linux-only and not Red Hat-only.

This is clearly Red Hat only. Debian and Ubuntu folks already have
rejected this out of security concerns, so why are you pressing this?
Where is support for other operating systems, like Windows, FreeBSD or
Solaris/Illumos?

Ced
--
Cedric Blancher <[email protected]>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

2016-11-30 10:21:14

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] NFS: add AF_VSOCK support to NFS client

On Thu, Oct 27, 2016 at 03:05:35AM +0200, Cedric Blancher wrote:
> On 20 October 2016 at 16:36, Stefan Hajnoczi <[email protected]> wrote:
> > On Sat, Oct 08, 2016 at 02:42:17AM +0200, Cedric Blancher wrote:
> >> So basically you're creating a new (Red Hat) Linux-only wormhole which
> >> bypasses all network security between VM host and guest and needs
> >> extra work&thought&tool support (wireshark, valgrind, ...) to handle,
> >> trace, debug, monitor and secure?
> >
> > vsock is not Linux-only and not Red Hat-only.
>
> This is clearly Red Hat only. Debian and Ubuntu folks already have
> rejected this out of security concerns, so why are you pressing this?

Are you aware that Debian ships the vsock.ko and
vmw_vsock_vmci_transport.ko kernel modules?
https://packages.debian.org/jessie/amd64/linux-image-3.16.0-4-amd64/filelist

Do you have a URL regarding virtio-vsock in Debian and Ubuntu? There
was no discussion upstream in QEMU or Linux that I can recall.

Stefan


Attachments:
(No filename) (982.00 B)
signature.asc (455.00 B)
Download all attachments

2017-05-18 14:04:27

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]

On Fri, 2016-10-07 at 11:01 +0100, Stefan Hajnoczi wrote:
> AF_VSOCK addresses are a Context ID (CID) and port number tuple. The
> CID is a unique address, similar to a IP address on a local subnet.
>
> Extend the addr.h functions to handle AF_VSOCK addresses.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> v2:
> * Replace CONFIG_VSOCKETS with CONFIG_SUNRPC_XPRT_VSOCK to prevent
> build failures when SUNRPC=y and VSOCKETS=m. Built-in code cannot
> link against code in a module.
> ---
> include/linux/sunrpc/addr.h | 44 ++++++++++++++++++++++++++++++++++
> net/sunrpc/Kconfig | 10 ++++++++
> net/sunrpc/addr.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 111 insertions(+)
>
> diff --git a/include/linux/sunrpc/addr.h b/include/linux/sunrpc/addr.h
> index 5c9c6cd..c4169bc 100644
> --- a/include/linux/sunrpc/addr.h
> +++ b/include/linux/sunrpc/addr.h
> @@ -10,6 +10,7 @@
> #include <linux/socket.h>
> #include <linux/in.h>
> #include <linux/in6.h>
> +#include <linux/vm_sockets.h>
> #include <net/ipv6.h>
>
> size_t rpc_ntop(const struct sockaddr *, char *, const size_t);
> @@ -26,6 +27,8 @@ static inline unsigned short rpc_get_port(const struct sockaddr *sap)
> return ntohs(((struct sockaddr_in *)sap)->sin_port);
> case AF_INET6:
> return ntohs(((struct sockaddr_in6 *)sap)->sin6_port);
> + case AF_VSOCK:
> + return ((struct sockaddr_vm *)sap)->svm_port;
> }
> return 0;
> }
> @@ -40,6 +43,9 @@ static inline void rpc_set_port(struct sockaddr *sap,
> case AF_INET6:
> ((struct sockaddr_in6 *)sap)->sin6_port = htons(port);
> break;
> + case AF_VSOCK:
> + ((struct sockaddr_vm *)sap)->svm_port = port;
> + break;
> }
> }
>
> @@ -106,6 +112,40 @@ static inline bool __rpc_copy_addr6(struct sockaddr *dst,
> }
> #endif /* !(IS_ENABLED(CONFIG_IPV6) */
>
> +#if IS_ENABLED(CONFIG_VSOCKETS)
> +static inline bool rpc_cmp_vsock_addr(const struct sockaddr *sap1,
> + const struct sockaddr *sap2)
> +{
> + const struct sockaddr_vm *svm1 = (const struct sockaddr_vm *)sap1;
> + const struct sockaddr_vm *svm2 = (const struct sockaddr_vm *)sap2;
> +
> + return svm1->svm_cid == svm2->svm_cid;
> +}
> +
> +static inline bool __rpc_copy_vsock_addr(struct sockaddr *dst,
> + const struct sockaddr *src)
> +{
> + const struct sockaddr_vm *ssvm = (const struct sockaddr_vm *)src;
> + struct sockaddr_vm *dsvm = (struct sockaddr_vm *)dst;
> +
> + dsvm->svm_family = ssvm->svm_family;
> + dsvm->svm_cid = ssvm->svm_cid;
> + return true;
> +}
> +#else /* !(IS_ENABLED(CONFIG_VSOCKETS) */
> +static inline bool rpc_cmp_vsock_addr(const struct sockaddr *sap1,
> + const struct sockaddr *sap2)
> +{
> + return false;
> +}
> +
> +static inline bool __rpc_copy_vsock_addr(struct sockaddr *dst,
> + const struct sockaddr *src)
> +{
> + return false;
> +}
> +#endif /* !(IS_ENABLED(CONFIG_VSOCKETS) */
> +
> /**
> * rpc_cmp_addr - compare the address portion of two sockaddrs.
> * @sap1: first sockaddr
> @@ -125,6 +165,8 @@ static inline bool rpc_cmp_addr(const struct sockaddr *sap1,
> return rpc_cmp_addr4(sap1, sap2);
> case AF_INET6:
> return rpc_cmp_addr6(sap1, sap2);
> + case AF_VSOCK:
> + return rpc_cmp_vsock_addr(sap1, sap2);
> }
> }
> return false;
> @@ -161,6 +203,8 @@ static inline bool rpc_copy_addr(struct sockaddr *dst,
> return __rpc_copy_addr4(dst, src);
> case AF_INET6:
> return __rpc_copy_addr6(dst, src);
> + case AF_VSOCK:
> + return __rpc_copy_vsock_addr(dst, src);
> }
> return false;
> }
> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
> index 04ce2c0..d18fc1a 100644
> --- a/net/sunrpc/Kconfig
> +++ b/net/sunrpc/Kconfig
> @@ -61,3 +61,13 @@ config SUNRPC_XPRT_RDMA
>
> If unsure, or you know there is no RDMA capability on your
> hardware platform, say N.
> +
> +config SUNRPC_XPRT_VSOCK
> + bool "RPC-over-AF_VSOCK transport"
> + depends on SUNRPC && VSOCKETS && !(SUNRPC=y && VSOCKETS=m)
> + default SUNRPC && VSOCKETS
> + help
> + This option allows the NFS client and server to use the AF_VSOCK
> + transport to communicate between virtual machines and the host.
> +
> + If unsure, say Y.
> diff --git a/net/sunrpc/addr.c b/net/sunrpc/addr.c
> index 2e0a6f9..f4dd962 100644
> --- a/net/sunrpc/addr.c
> +++ b/net/sunrpc/addr.c
> @@ -16,11 +16,14 @@
> * RFC 4291, Section 2.2 for details on IPv6 presentation formats.
> */
>
> + /* TODO register netid and uaddr with IANA? (See RFC 5665 5.1/5.2) */
> +
> #include <net/ipv6.h>
> #include <linux/sunrpc/addr.h>
> #include <linux/sunrpc/msg_prot.h>
> #include <linux/slab.h>
> #include <linux/export.h>
> +#include <linux/vm_sockets.h>
>
> #if IS_ENABLED(CONFIG_IPV6)
>
> @@ -108,6 +111,26 @@ static size_t rpc_ntop6(const struct sockaddr *sap,
>
> #endif /* !IS_ENABLED(CONFIG_IPV6) */
>
> +#ifdef CONFIG_SUNRPC_XPRT_VSOCK
> +
> +static size_t rpc_ntop_vsock(const struct sockaddr *sap,
> + char *buf, const size_t buflen)
> +{
> + const struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
> +
> + return snprintf(buf, buflen, "%u", svm->svm_cid);
> +}
> +
> +#else /* !CONFIG_SUNRPC_XPRT_VSOCK */
> +
> +static size_t rpc_ntop_vsock(const struct sockaddr *sap,
> + char *buf, const size_t buflen)
> +{
> + return 0;
> +}
> +
> +#endif /* !CONFIG_SUNRPC_XPRT_VSOCK */
> +
> static int rpc_ntop4(const struct sockaddr *sap,
> char *buf, const size_t buflen)
> {
> @@ -132,6 +155,8 @@ size_t rpc_ntop(const struct sockaddr *sap, char *buf, const size_t buflen)
> return rpc_ntop4(sap, buf, buflen);
> case AF_INET6:
> return rpc_ntop6(sap, buf, buflen);
> + case AF_VSOCK:
> + return rpc_ntop_vsock(sap, buf, buflen);
> }
>
> return 0;
> @@ -229,6 +254,34 @@ static size_t rpc_pton6(struct net *net, const char *buf, const size_t buflen,
> }
> #endif
>
> +#ifdef CONFIG_SUNRPC_XPRT_VSOCK
> +static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
> + struct sockaddr *sap, const size_t salen)
> +{
> + const size_t prefix_len = strlen("vsock:");
> + struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
> + unsigned int cid;
> +
> + if (strncmp(buf, "vsock:", prefix_len) != 0 ||
> + salen < sizeof(struct sockaddr_vm))
> + return 0;
> +
> + if (kstrtouint(buf + prefix_len, 10, &cid) != 0)
> + return 0;
> +
> + memset(svm, 0, sizeof(struct sockaddr_vm));
> + svm->svm_family = AF_VSOCK;
> + svm->svm_cid = cid;
> + return sizeof(struct sockaddr_vm);
> +}
> +#else
> +static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
> + struct sockaddr *sap, const size_t salen)
> +{
> + return 0;
> +}
> +#endif
> +
> /**
> * rpc_pton - Construct a sockaddr in @sap
> * @net: applicable network namespace
> @@ -249,6 +302,10 @@ size_t rpc_pton(struct net *net, const char *buf, const size_t buflen,
> {
> unsigned int i;
>
> + /* TODO is there a nicer way to distinguish vsock addresses? */
> + if (strncmp(buf, "vsock:", 6) == 0)
> + return rpc_pton_vsock(buf, buflen, sap, salen);
> +

Ick, what if I have a host on the network named "vsock"? I think you'll
need to come up with a different way to do this.

> for (i = 0; i < buflen; i++)
> if (buf[i] == ':')
> return rpc_pton6(net, buf, buflen, sap, salen);

--
Jeff Layton <[email protected]>

2017-05-22 12:21:13

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]

On Thu, May 18, 2017 at 10:04:24AM -0400, Jeff Layton wrote:
> On Fri, 2016-10-07 at 11:01 +0100, Stefan Hajnoczi wrote:
> > @@ -249,6 +302,10 @@ size_t rpc_pton(struct net *net, const char *buf, const size_t buflen,
> > {
> > unsigned int i;
> >
> > + /* TODO is there a nicer way to distinguish vsock addresses? */
> > + if (strncmp(buf, "vsock:", 6) == 0)
> > + return rpc_pton_vsock(buf, buflen, sap, salen);
> > +
>
> Ick, what if I have a host on the network named "vsock"? I think you'll
> need to come up with a different way to do this.

There is no collision. This function doesn't do name resolution and no
valid IPv4/IPv6 address starts with "vsock:".

I am open to suggestions for a cleaner way of doing it though :).

Stefan


Attachments:
(No filename) (747.00 B)
signature.asc (455.00 B)
Download all attachments

2017-05-22 12:54:59

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]

On Mon, 2017-05-22 at 13:21 +0100, Stefan Hajnoczi wrote:
> On Thu, May 18, 2017 at 10:04:24AM -0400, Jeff Layton wrote:
> > On Fri, 2016-10-07 at 11:01 +0100, Stefan Hajnoczi wrote:
> > > @@ -249,6 +302,10 @@ size_t rpc_pton(struct net *net, const char *buf, const size_t buflen,
> > > {
> > > unsigned int i;
> > >
> > > + /* TODO is there a nicer way to distinguish vsock addresses? */
> > > + if (strncmp(buf, "vsock:", 6) == 0)
> > > + return rpc_pton_vsock(buf, buflen, sap, salen);
> > > +
> >
> > Ick, what if I have a host on the network named "vsock"? I think you'll
> > need to come up with a different way to do this.
>
> There is no collision. This function doesn't do name resolution and no
> valid IPv4/IPv6 address starts with "vsock:".
>

Doh! Of course... :)

> I am open to suggestions for a cleaner way of doing it though :).

Does lsof recognize vsock sockets? How does it format them?
--
Jeff Layton <[email protected]>

2017-05-23 13:11:07

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] SUNRPC: add AF_VSOCK support to addr.[ch]

On Mon, May 22, 2017 at 08:54:56AM -0400, Jeff Layton wrote:
> On Mon, 2017-05-22 at 13:21 +0100, Stefan Hajnoczi wrote:
> > On Thu, May 18, 2017 at 10:04:24AM -0400, Jeff Layton wrote:
> > > On Fri, 2016-10-07 at 11:01 +0100, Stefan Hajnoczi wrote:
> > > > @@ -249,6 +302,10 @@ size_t rpc_pton(struct net *net, const char *buf, const size_t buflen,
> > > > {
> > > > unsigned int i;
> > > >
> > > > + /* TODO is there a nicer way to distinguish vsock addresses? */
> > > > + if (strncmp(buf, "vsock:", 6) == 0)
> > > > + return rpc_pton_vsock(buf, buflen, sap, salen);
> > > > +
> > >
> > > Ick, what if I have a host on the network named "vsock"? I think you'll
> > > need to come up with a different way to do this.
> >
> > There is no collision. This function doesn't do name resolution and no
> > valid IPv4/IPv6 address starts with "vsock:".
> >
>
> Doh! Of course... :)
>
> > I am open to suggestions for a cleaner way of doing it though :).
>
> Does lsof recognize vsock sockets? How does it format them?

lsof only prints a generic socket representation:

COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nc-vsock 20775 stefanha 3u sock 0,9 0t0 1518648 protocol: AF_VSOCK

Depending on a program's command-line syntax, addresses are usually
written as CID:PORT, or vsock:CID:PORT if the program must differentiate
between address types from the string itself.

QEMU, qemu-guest-agent, and systemd have syntax for specifying AF_VSOCK
sockets. For example:
https://github.com/systemd/systemd/blob/master/src/test/test-socket-util.c#L98

If I have time to submit lsof patches I'll propose the following syntax
(a combination of how AF_UNIX and AF_INET TCP sockets are formatted):

COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nc-vsock 20775 stefanha 3u vsock 1520136 0t0 1520136 local=2:1234 state=LISTEN type=STREAM
nc-vsock 20775 stefanha 4u vsock 1520138 0t0 1520138 local=2:1234 remote=3:51213 state=CONNECTED type=STREAM

Stefan


Attachments:
(No filename) (2.11 kB)
signature.asc (455.00 B)
Download all attachments