2015-06-04 16:45:56

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 00/10] NFS: add AF_VSOCK support to NFS client

This patch series enables AF_VSOCK address family support in the NFS client.
Please use the https://github.com/stefanha/linux.git vsock-nfs branch, which
contains the dependencies for this series.

The AF_VSOCK address family provides dgram and stream socket communication
between virtual machines and hypervisors. A VMware VMCI transport is currently
available in-tree (see net/vmw_vsock) and I have posted virtio-vsock patches
for use with QEMU/KVM: http://thread.gmane.org/gmane.linux.network/365205

The goal of this work is sharing files between virtual machines and
hypervisors. AF_VSOCK is well-suited to this because it requires no
configuration inside the virtual machine, making it simple to manage and
reliable.

Why NFS over AF_VSOCK?
----------------------
It is unusual to add a new NFS transport, only TCP, RDMA, and UDP are currently
supported. Here is the rationale for adding AF_VSOCK.

Sharing files with a virtual machine can be configured manually:
1. Add a dedicated network card to the virtual machine. It will be used for
NFS traffic.
2. Configure a local subnet and assign IP addresses to the virtual machine and
hypervisor
3. Configure an NFS export on the hypervisor and start the NFS server
4. Mount the export inside the virtual machine

Automating these steps poses a problem: modifying network configuration inside
the virtual machine is invasive. It's hard to add a network interface to an
arbitrary running system in an automated fashion, considering the network
management tools, firewall rules, IP address usage, etc.

Furthermore, the user may disrupt file sharing by accident when they add
firewall rules, restart networking, etc because the NFS network interface is
visible alongside the network interfaces managed by the user.

AF_VSOCK is a zero-configuration network transport that avoids these problems.
Adding it to a virtual machine is non-invasive. It also avoids accidental
misconfiguration by the user. This is why "guest agents" and other services in
various hypervisors (KVM, Xen, VMware, VirtualBox) do not use regular network
interfaces.

This is why AF_VSOCK is appropriate for providing shared files as a hypervisor
service.

The approach in this series
---------------------------
AF_VSOCK stream sockets can be used for NFSv4.1 much in the same way as TCP.
RFC 1831 record fragments divide messages since SOCK_STREAM semantics are
present. The backchannel shares the connection just like the default TCP
configuration.

Addresses are <Context ID, Port Number> pairs. These patches use "vsock:<cid>"
string representation to distinguish AF_VSOCK addresses from IPv4 and IPv6
numeric addresses.

The patches cover the following areas:

Patch 1 - support struct sockaddr_vm in sunrpc addr.h

Patch 2-4 - make sunrpc TCP record fragment parser reusable for any stream
socket

Patch 5 - add tcp_read_sock()-like interface to AF_VSOCK sockets

Patch 6 - extend sunrpc xprtsock.c for AF_VSOCK RPC clients

Patch 7-9 - AF_VSOCK backchannel support

Patch 10 - add AF_VSOCK support to NFS client

The following example mounts /export from the hypervisor (CID 2) inside the
virtual machine (CID 3):

# /sbin/mount.nfs 2:/export /mnt -o clientaddr=3,proto=vsock

Status
------
I am looking for feedback on this approach. There are TODOs remaining in the code.

Hopefully the way I add AF_VSOCK support to sunrpc is reasonable and something
that can be standardized (a netid assigned and the uaddr string format decided).

See below for the nfs-utils patch. It can be made nice once glibc
getnameinfo()/getaddrinfo() support AF_VSOCK.

The vsock_read_sock() implementation is dumb. Less of a NFS/SUNRPC issue and
more of a vsock issue, but perhaps virtio_transport.c should use skbs for its
receive queue instead of a custom packet struct. That would eliminate memory
allocation and copying in vsock_read_sock().

The next step is tackling NFS server. In the meantime, I have tested the
patches using the nc-vsock netcat-like utility that is available in my Linux
kernel repo below.

Repositories
------------
* Linux kernel: https://github.com/stefanha/linux.git vsock-nfs
* QEMU virtio-vsock device: https://github.com/stefanha/qemu.git vsock
* nfs-utils vsock: https://github.com/stefanha/nfs-utils.git vsock

Stefan Hajnoczi (10):
SUNRPC: add AF_VSOCK support to addr.h
SUNRPC: rename "TCP" record parser to "stream" parser
SUNRPC: abstract tcp_read_sock() in record fragment parser
SUNRPC: extract xs_stream_reset_state()
VSOCK: add tcp_read_sock()-like vsock_read_sock() function
SUNRPC: add AF_VSOCK support to xprtsock.c
SUNRPC: restrict backchannel svc IPPROTO_TCP check to IP
SUNRPC: add vsock-bc backchannel
SUNRPC: add AF_VSOCK support to svc_xprt.c
NFS: add AF_VSOCK support to NFS client

drivers/vhost/vsock.c | 1 +
fs/nfs/callback.c | 7 +-
fs/nfs/client.c | 16 +
fs/nfs/super.c | 10 +
include/linux/sunrpc/addr.h | 6 +
include/linux/sunrpc/svc_xprt.h | 12 +
include/linux/sunrpc/xprt.h | 1 +
include/linux/sunrpc/xprtsock.h | 37 +-
include/linux/virtio_vsock.h | 4 +
include/net/af_vsock.h | 5 +
include/trace/events/sunrpc.h | 30 +-
net/sunrpc/addr.c | 57 +++
net/sunrpc/svc.c | 13 +-
net/sunrpc/svc_xprt.c | 13 +
net/sunrpc/svcsock.c | 48 ++-
net/sunrpc/xprtsock.c | 693 +++++++++++++++++++++++++-------
net/vmw_vsock/af_vsock.c | 15 +
net/vmw_vsock/virtio_transport.c | 1 +
net/vmw_vsock/virtio_transport_common.c | 55 +++
net/vmw_vsock/vmci_transport.c | 8 +
20 files changed, 825 insertions(+), 207 deletions(-)

--
2.4.2



2015-06-04 16:45:58

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 01/10] SUNRPC: add AF_VSOCK support to addr.h

AF_VSOCK addresses are a Context ID (CID) and port number tuple. The
CID is a unique address, similar to a IP address on a local subnet.

Extend the addr.h functions to handle AF_VSOCK addresses.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/addr.h | 6 +++++
net/sunrpc/addr.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 63 insertions(+)

diff --git a/include/linux/sunrpc/addr.h b/include/linux/sunrpc/addr.h
index 07d8e53..b6530de 100644
--- a/include/linux/sunrpc/addr.h
+++ b/include/linux/sunrpc/addr.h
@@ -10,6 +10,7 @@
#include <linux/socket.h>
#include <linux/in.h>
#include <linux/in6.h>
+#include <linux/vm_sockets.h>
#include <net/ipv6.h>

size_t rpc_ntop(const struct sockaddr *, char *, const size_t);
@@ -26,6 +27,8 @@ static inline unsigned short rpc_get_port(const struct sockaddr *sap)
return ntohs(((struct sockaddr_in *)sap)->sin_port);
case AF_INET6:
return ntohs(((struct sockaddr_in6 *)sap)->sin6_port);
+ case AF_VSOCK:
+ return ((struct sockaddr_vm *)sap)->svm_port;
}
return 0;
}
@@ -40,6 +43,9 @@ static inline void rpc_set_port(struct sockaddr *sap,
case AF_INET6:
((struct sockaddr_in6 *)sap)->sin6_port = htons(port);
break;
+ case AF_VSOCK:
+ ((struct sockaddr_vm *)sap)->svm_port = port;
+ break;
}
}

diff --git a/net/sunrpc/addr.c b/net/sunrpc/addr.c
index 2e0a6f9..fe6eb31 100644
--- a/net/sunrpc/addr.c
+++ b/net/sunrpc/addr.c
@@ -16,11 +16,14 @@
* RFC 4291, Section 2.2 for details on IPv6 presentation formats.
*/

+ /* TODO register netid and uaddr with IANA? (See RFC 5665 5.1/5.2) */
+
#include <net/ipv6.h>
#include <linux/sunrpc/addr.h>
#include <linux/sunrpc/msg_prot.h>
#include <linux/slab.h>
#include <linux/export.h>
+#include <linux/vm_sockets.h>

#if IS_ENABLED(CONFIG_IPV6)

@@ -108,6 +111,26 @@ static size_t rpc_ntop6(const struct sockaddr *sap,

#endif /* !IS_ENABLED(CONFIG_IPV6) */

+#if IS_ENABLED(CONFIG_VSOCKETS)
+
+static size_t rpc_ntop_vsock(const struct sockaddr *sap,
+ char *buf, const size_t buflen)
+{
+ const struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
+
+ return snprintf(buf, buflen, "%u", svm->svm_cid);
+}
+
+#else /* !IS_ENABLED(CONFIG_VSOCKETS) */
+
+static size_t rpc_ntop_vsock(const struct sockaddr *sap,
+ char *buf, const size_t buflen)
+{
+ return 0;
+}
+
+#endif /* !IS_ENABLED(CONFIG_VSOCKETS) */
+
static int rpc_ntop4(const struct sockaddr *sap,
char *buf, const size_t buflen)
{
@@ -132,6 +155,8 @@ size_t rpc_ntop(const struct sockaddr *sap, char *buf, const size_t buflen)
return rpc_ntop4(sap, buf, buflen);
case AF_INET6:
return rpc_ntop6(sap, buf, buflen);
+ case AF_VSOCK:
+ return rpc_ntop_vsock(sap, buf, buflen);
}

return 0;
@@ -229,6 +254,34 @@ static size_t rpc_pton6(struct net *net, const char *buf, const size_t buflen,
}
#endif

+#if IS_ENABLED(CONFIG_VSOCKETS)
+static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
+ struct sockaddr *sap, const size_t salen)
+{
+ const size_t prefix_len = strlen("vsock:");
+ struct sockaddr_vm *svm = (struct sockaddr_vm *)sap;
+ unsigned int cid;
+
+ if (strncmp(buf, "vsock:", prefix_len) != 0 ||
+ salen < sizeof(struct sockaddr_vm))
+ return 0;
+
+ if (kstrtouint(buf + prefix_len, 10, &cid) != 0)
+ return 0;
+
+ memset(svm, 0, sizeof(struct sockaddr_vm));
+ svm->svm_family = AF_VSOCK;
+ svm->svm_cid = cid;
+ return sizeof(struct sockaddr_vm);
+}
+#else
+static size_t rpc_pton_vsock(const char *buf, const size_t buflen,
+ struct sockaddr *sap, const size_t salen)
+{
+ return 0;
+}
+#endif
+
/**
* rpc_pton - Construct a sockaddr in @sap
* @net: applicable network namespace
@@ -249,6 +302,10 @@ size_t rpc_pton(struct net *net, const char *buf, const size_t buflen,
{
unsigned int i;

+ /* TODO is there a nicer way to distinguish vsock addresses? */
+ if (strncmp(buf, "vsock:", 6) == 0)
+ return rpc_pton_vsock(buf, buflen, sap, salen);
+
for (i = 0; i < buflen; i++)
if (buf[i] == ':')
return rpc_pton6(net, buf, buflen, sap, salen);
--
2.4.2


2015-06-04 16:46:01

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 02/10] SUNRPC: rename "TCP" record parser to "stream" parser

The TCP record parser is really a RFC 1831 record fragment parser.
There is nothing TCP protocol-specific about parsing record fragments.
The parser can be reused for any SOCK_STREAM socket.

This patch renames functions and fields but xs_stream_data_ready() still
calls tcp_read_sock(). This is addressed in the next patch.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/xprtsock.h | 31 ++---
include/trace/events/sunrpc.h | 30 ++---
net/sunrpc/xprtsock.c | 276 ++++++++++++++++++++--------------------
3 files changed, 169 insertions(+), 168 deletions(-)

diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index 7591788..e3de7bf 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -27,17 +27,18 @@ struct sock_xprt {
struct sock * inet;

/*
- * State of TCP reply receive
+ * State of SOCK_STREAM reply receive
*/
- __be32 tcp_fraghdr,
- tcp_xid,
- tcp_calldir;
+ __be32 stream_fraghdr,
+ stream_xid,
+ stream_calldir;

- u32 tcp_offset,
- tcp_reclen;
+ u32 stream_offset,
+ stream_reclen;
+
+ unsigned long stream_copied,
+ stream_flags;

- unsigned long tcp_copied,
- tcp_flags;

/*
* Connection of transports
@@ -64,17 +65,17 @@ struct sock_xprt {
/*
* TCP receive state flags
*/
-#define TCP_RCV_LAST_FRAG (1UL << 0)
-#define TCP_RCV_COPY_FRAGHDR (1UL << 1)
-#define TCP_RCV_COPY_XID (1UL << 2)
-#define TCP_RCV_COPY_DATA (1UL << 3)
-#define TCP_RCV_READ_CALLDIR (1UL << 4)
-#define TCP_RCV_COPY_CALLDIR (1UL << 5)
+#define STREAM_RCV_LAST_FRAG (1UL << 0)
+#define STREAM_RCV_COPY_FRAGHDR (1UL << 1)
+#define STREAM_RCV_COPY_XID (1UL << 2)
+#define STREAM_RCV_COPY_DATA (1UL << 3)
+#define STREAM_RCV_READ_CALLDIR (1UL << 4)
+#define STREAM_RCV_COPY_CALLDIR (1UL << 5)

/*
* TCP RPC flags
*/
-#define TCP_RPC_REPLY (1UL << 6)
+#define STREAM_RPC_REPLY (1UL << 6)

#endif /* __KERNEL__ */

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index fd1a02c..a17a533 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -371,7 +371,7 @@ DEFINE_EVENT(rpc_xprt_event, xprt_complete_rqst,
TP_PROTO(struct rpc_xprt *xprt, __be32 xid, int status),
TP_ARGS(xprt, xid, status));

-TRACE_EVENT(xs_tcp_data_ready,
+TRACE_EVENT(xs_stream_data_ready,
TP_PROTO(struct rpc_xprt *xprt, int err, unsigned int total),

TP_ARGS(xprt, err, total),
@@ -400,15 +400,15 @@ TRACE_EVENT(xs_tcp_data_ready,

#define rpc_show_sock_xprt_flags(flags) \
__print_flags(flags, "|", \
- { TCP_RCV_LAST_FRAG, "TCP_RCV_LAST_FRAG" }, \
- { TCP_RCV_COPY_FRAGHDR, "TCP_RCV_COPY_FRAGHDR" }, \
- { TCP_RCV_COPY_XID, "TCP_RCV_COPY_XID" }, \
- { TCP_RCV_COPY_DATA, "TCP_RCV_COPY_DATA" }, \
- { TCP_RCV_READ_CALLDIR, "TCP_RCV_READ_CALLDIR" }, \
- { TCP_RCV_COPY_CALLDIR, "TCP_RCV_COPY_CALLDIR" }, \
- { TCP_RPC_REPLY, "TCP_RPC_REPLY" })
-
-TRACE_EVENT(xs_tcp_data_recv,
+ { STREAM_RCV_LAST_FRAG, "STREAM_RCV_LAST_FRAG" }, \
+ { STREAM_RCV_COPY_FRAGHDR, "STREAM_RCV_COPY_FRAGHDR" }, \
+ { STREAM_RCV_COPY_XID, "STREAM_RCV_COPY_XID" }, \
+ { STREAM_RCV_COPY_DATA, "STREAM_RCV_COPY_DATA" }, \
+ { STREAM_RCV_READ_CALLDIR, "STREAM_RCV_READ_CALLDIR" }, \
+ { STREAM_RCV_COPY_CALLDIR, "STREAM_RCV_COPY_CALLDIR" }, \
+ { STREAM_RPC_REPLY, "STREAM_RPC_REPLY" })
+
+TRACE_EVENT(xs_stream_data_recv,
TP_PROTO(struct sock_xprt *xs),

TP_ARGS(xs),
@@ -426,11 +426,11 @@ TRACE_EVENT(xs_tcp_data_recv,
TP_fast_assign(
__assign_str(addr, xs->xprt.address_strings[RPC_DISPLAY_ADDR]);
__assign_str(port, xs->xprt.address_strings[RPC_DISPLAY_PORT]);
- __entry->xid = xs->tcp_xid;
- __entry->flags = xs->tcp_flags;
- __entry->copied = xs->tcp_copied;
- __entry->reclen = xs->tcp_reclen;
- __entry->offset = xs->tcp_offset;
+ __entry->xid = xs->stream_xid;
+ __entry->flags = xs->stream_flags;
+ __entry->copied = xs->stream_copied;
+ __entry->reclen = xs->stream_reclen;
+ __entry->offset = xs->stream_offset;
),

TP_printk("peer=[%s]:%s xid=0x%x flags=%s copied=%lu reclen=%u offset=%lu",
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index a8d6c6e..c84d45e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1036,119 +1036,119 @@ static void xs_tcp_force_close(struct rpc_xprt *xprt)
xprt_force_disconnect(xprt);
}

-static inline void xs_tcp_read_fraghdr(struct rpc_xprt *xprt, struct xdr_skb_reader *desc)
+static inline void xs_stream_read_fraghdr(struct rpc_xprt *xprt, struct xdr_skb_reader *desc)
{
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
size_t len, used;
char *p;

- p = ((char *) &transport->tcp_fraghdr) + transport->tcp_offset;
- len = sizeof(transport->tcp_fraghdr) - transport->tcp_offset;
+ p = ((char *) &transport->stream_fraghdr) + transport->stream_offset;
+ len = sizeof(transport->stream_fraghdr) - transport->stream_offset;
used = xdr_skb_read_bits(desc, p, len);
- transport->tcp_offset += used;
+ transport->stream_offset += used;
if (used != len)
return;

- transport->tcp_reclen = ntohl(transport->tcp_fraghdr);
- if (transport->tcp_reclen & RPC_LAST_STREAM_FRAGMENT)
- transport->tcp_flags |= TCP_RCV_LAST_FRAG;
+ transport->stream_reclen = ntohl(transport->stream_fraghdr);
+ if (transport->stream_reclen & RPC_LAST_STREAM_FRAGMENT)
+ transport->stream_flags |= STREAM_RCV_LAST_FRAG;
else
- transport->tcp_flags &= ~TCP_RCV_LAST_FRAG;
- transport->tcp_reclen &= RPC_FRAGMENT_SIZE_MASK;
+ transport->stream_flags &= ~STREAM_RCV_LAST_FRAG;
+ transport->stream_reclen &= RPC_FRAGMENT_SIZE_MASK;

- transport->tcp_flags &= ~TCP_RCV_COPY_FRAGHDR;
- transport->tcp_offset = 0;
+ transport->stream_flags &= ~STREAM_RCV_COPY_FRAGHDR;
+ transport->stream_offset = 0;

/* Sanity check of the record length */
- if (unlikely(transport->tcp_reclen < 8)) {
- dprintk("RPC: invalid TCP record fragment length\n");
+ if (unlikely(transport->stream_reclen < 8)) {
+ dprintk("RPC: invalid record fragment length\n");
xs_tcp_force_close(xprt);
return;
}
- dprintk("RPC: reading TCP record fragment of length %d\n",
- transport->tcp_reclen);
+ dprintk("RPC: reading record fragment of length %d\n",
+ transport->stream_reclen);
}

-static void xs_tcp_check_fraghdr(struct sock_xprt *transport)
+static void xs_stream_check_fraghdr(struct sock_xprt *transport)
{
- if (transport->tcp_offset == transport->tcp_reclen) {
- transport->tcp_flags |= TCP_RCV_COPY_FRAGHDR;
- transport->tcp_offset = 0;
- if (transport->tcp_flags & TCP_RCV_LAST_FRAG) {
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
- transport->tcp_flags |= TCP_RCV_COPY_XID;
- transport->tcp_copied = 0;
+ if (transport->stream_offset == transport->stream_reclen) {
+ transport->stream_flags |= STREAM_RCV_COPY_FRAGHDR;
+ transport->stream_offset = 0;
+ if (transport->stream_flags & STREAM_RCV_LAST_FRAG) {
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
+ transport->stream_flags |= STREAM_RCV_COPY_XID;
+ transport->stream_copied = 0;
}
}
}

-static inline void xs_tcp_read_xid(struct sock_xprt *transport, struct xdr_skb_reader *desc)
+static inline void xs_stream_read_xid(struct sock_xprt *transport, struct xdr_skb_reader *desc)
{
size_t len, used;
char *p;

- len = sizeof(transport->tcp_xid) - transport->tcp_offset;
+ len = sizeof(transport->stream_xid) - transport->stream_offset;
dprintk("RPC: reading XID (%Zu bytes)\n", len);
- p = ((char *) &transport->tcp_xid) + transport->tcp_offset;
+ p = ((char *) &transport->stream_xid) + transport->stream_offset;
used = xdr_skb_read_bits(desc, p, len);
- transport->tcp_offset += used;
+ transport->stream_offset += used;
if (used != len)
return;
- transport->tcp_flags &= ~TCP_RCV_COPY_XID;
- transport->tcp_flags |= TCP_RCV_READ_CALLDIR;
- transport->tcp_copied = 4;
+ transport->stream_flags &= ~STREAM_RCV_COPY_XID;
+ transport->stream_flags |= STREAM_RCV_READ_CALLDIR;
+ transport->stream_copied = 4;
dprintk("RPC: reading %s XID %08x\n",
- (transport->tcp_flags & TCP_RPC_REPLY) ? "reply for"
+ (transport->stream_flags & STREAM_RPC_REPLY) ? "reply for"
: "request with",
- ntohl(transport->tcp_xid));
- xs_tcp_check_fraghdr(transport);
+ ntohl(transport->stream_xid));
+ xs_stream_check_fraghdr(transport);
}

-static inline void xs_tcp_read_calldir(struct sock_xprt *transport,
- struct xdr_skb_reader *desc)
+static inline void xs_stream_read_calldir(struct sock_xprt *transport,
+ struct xdr_skb_reader *desc)
{
size_t len, used;
u32 offset;
char *p;

/*
- * We want transport->tcp_offset to be 8 at the end of this routine
+ * We want transport->stream_offset to be 8 at the end of this routine
* (4 bytes for the xid and 4 bytes for the call/reply flag).
* When this function is called for the first time,
- * transport->tcp_offset is 4 (after having already read the xid).
+ * transport->stream_offset is 4 (after having already read the xid).
*/
- offset = transport->tcp_offset - sizeof(transport->tcp_xid);
- len = sizeof(transport->tcp_calldir) - offset;
+ offset = transport->stream_offset - sizeof(transport->stream_xid);
+ len = sizeof(transport->stream_calldir) - offset;
dprintk("RPC: reading CALL/REPLY flag (%Zu bytes)\n", len);
- p = ((char *) &transport->tcp_calldir) + offset;
+ p = ((char *) &transport->stream_calldir) + offset;
used = xdr_skb_read_bits(desc, p, len);
- transport->tcp_offset += used;
+ transport->stream_offset += used;
if (used != len)
return;
- transport->tcp_flags &= ~TCP_RCV_READ_CALLDIR;
+ transport->stream_flags &= ~STREAM_RCV_READ_CALLDIR;
/*
* We don't yet have the XDR buffer, so we will write the calldir
* out after we get the buffer from the 'struct rpc_rqst'
*/
- switch (ntohl(transport->tcp_calldir)) {
+ switch (ntohl(transport->stream_calldir)) {
case RPC_REPLY:
- transport->tcp_flags |= TCP_RCV_COPY_CALLDIR;
- transport->tcp_flags |= TCP_RCV_COPY_DATA;
- transport->tcp_flags |= TCP_RPC_REPLY;
+ transport->stream_flags |= STREAM_RCV_COPY_CALLDIR;
+ transport->stream_flags |= STREAM_RCV_COPY_DATA;
+ transport->stream_flags |= STREAM_RPC_REPLY;
break;
case RPC_CALL:
- transport->tcp_flags |= TCP_RCV_COPY_CALLDIR;
- transport->tcp_flags |= TCP_RCV_COPY_DATA;
- transport->tcp_flags &= ~TCP_RPC_REPLY;
+ transport->stream_flags |= STREAM_RCV_COPY_CALLDIR;
+ transport->stream_flags |= STREAM_RCV_COPY_DATA;
+ transport->stream_flags &= ~STREAM_RPC_REPLY;
break;
default:
dprintk("RPC: invalid request message type\n");
xs_tcp_force_close(&transport->xprt);
}
- xs_tcp_check_fraghdr(transport);
+ xs_stream_check_fraghdr(transport);
}

-static inline void xs_tcp_read_common(struct rpc_xprt *xprt,
+static inline void xs_stream_read_common(struct rpc_xprt *xprt,
struct xdr_skb_reader *desc,
struct rpc_rqst *req)
{
@@ -1160,97 +1160,97 @@ static inline void xs_tcp_read_common(struct rpc_xprt *xprt,

rcvbuf = &req->rq_private_buf;

- if (transport->tcp_flags & TCP_RCV_COPY_CALLDIR) {
+ if (transport->stream_flags & STREAM_RCV_COPY_CALLDIR) {
/*
* Save the RPC direction in the XDR buffer
*/
- memcpy(rcvbuf->head[0].iov_base + transport->tcp_copied,
- &transport->tcp_calldir,
- sizeof(transport->tcp_calldir));
- transport->tcp_copied += sizeof(transport->tcp_calldir);
- transport->tcp_flags &= ~TCP_RCV_COPY_CALLDIR;
+ memcpy(rcvbuf->head[0].iov_base + transport->stream_copied,
+ &transport->stream_calldir,
+ sizeof(transport->stream_calldir));
+ transport->stream_copied += sizeof(transport->stream_calldir);
+ transport->stream_flags &= ~STREAM_RCV_COPY_CALLDIR;
}

len = desc->count;
- if (len > transport->tcp_reclen - transport->tcp_offset) {
+ if (len > transport->stream_reclen - transport->stream_offset) {
struct xdr_skb_reader my_desc;

- len = transport->tcp_reclen - transport->tcp_offset;
+ len = transport->stream_reclen - transport->stream_offset;
memcpy(&my_desc, desc, sizeof(my_desc));
my_desc.count = len;
- r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
+ r = xdr_partial_copy_from_skb(rcvbuf, transport->stream_copied,
&my_desc, xdr_skb_read_bits);
desc->count -= r;
desc->offset += r;
} else
- r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
+ r = xdr_partial_copy_from_skb(rcvbuf, transport->stream_copied,
desc, xdr_skb_read_bits);

if (r > 0) {
- transport->tcp_copied += r;
- transport->tcp_offset += r;
+ transport->stream_copied += r;
+ transport->stream_offset += r;
}
if (r != len) {
/* Error when copying to the receive buffer,
* usually because we weren't able to allocate
* additional buffer pages. All we can do now
- * is turn off TCP_RCV_COPY_DATA, so the request
+ * is turn off STREAM_RCV_COPY_DATA, so the request
* will not receive any additional updates,
* and time out.
* Any remaining data from this record will
* be discarded.
*/
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
dprintk("RPC: XID %08x truncated request\n",
- ntohl(transport->tcp_xid));
- dprintk("RPC: xprt = %p, tcp_copied = %lu, "
- "tcp_offset = %u, tcp_reclen = %u\n",
- xprt, transport->tcp_copied,
- transport->tcp_offset, transport->tcp_reclen);
+ ntohl(transport->stream_xid));
+ dprintk("RPC: xprt = %p, stream_copied = %lu, "
+ "stream_offset = %u, stream_reclen = %u\n",
+ xprt, transport->stream_copied,
+ transport->stream_offset, transport->stream_reclen);
return;
}

dprintk("RPC: XID %08x read %Zd bytes\n",
- ntohl(transport->tcp_xid), r);
- dprintk("RPC: xprt = %p, tcp_copied = %lu, tcp_offset = %u, "
- "tcp_reclen = %u\n", xprt, transport->tcp_copied,
- transport->tcp_offset, transport->tcp_reclen);
-
- if (transport->tcp_copied == req->rq_private_buf.buflen)
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
- else if (transport->tcp_offset == transport->tcp_reclen) {
- if (transport->tcp_flags & TCP_RCV_LAST_FRAG)
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ ntohl(transport->stream_xid), r);
+ dprintk("RPC: xprt = %p, stream_copied = %lu, stream_offset = %u, "
+ "stream_reclen = %u\n", xprt, transport->stream_copied,
+ transport->stream_offset, transport->stream_reclen);
+
+ if (transport->stream_copied == req->rq_private_buf.buflen)
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
+ else if (transport->stream_offset == transport->stream_reclen) {
+ if (transport->stream_flags & STREAM_RCV_LAST_FRAG)
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
}
}

/*
* Finds the request corresponding to the RPC xid and invokes the common
- * tcp read code to read the data.
+ * read code to read the data.
*/
-static inline int xs_tcp_read_reply(struct rpc_xprt *xprt,
+static inline int xs_stream_read_reply(struct rpc_xprt *xprt,
struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
container_of(xprt, struct sock_xprt, xprt);
struct rpc_rqst *req;

- dprintk("RPC: read reply XID %08x\n", ntohl(transport->tcp_xid));
+ dprintk("RPC: read reply XID %08x\n", ntohl(transport->stream_xid));

/* Find and lock the request corresponding to this xid */
spin_lock(&xprt->transport_lock);
- req = xprt_lookup_rqst(xprt, transport->tcp_xid);
+ req = xprt_lookup_rqst(xprt, transport->stream_xid);
if (!req) {
dprintk("RPC: XID %08x request not found!\n",
- ntohl(transport->tcp_xid));
+ ntohl(transport->stream_xid));
spin_unlock(&xprt->transport_lock);
return -1;
}

- xs_tcp_read_common(xprt, desc, req);
+ xs_stream_read_common(xprt, desc, req);

- if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
- xprt_complete_rqst(req->rq_task, transport->tcp_copied);
+ if (!(transport->stream_flags & STREAM_RCV_COPY_DATA))
+ xprt_complete_rqst(req->rq_task, transport->stream_copied);

spin_unlock(&xprt->transport_lock);
return 0;
@@ -1264,7 +1264,7 @@ static inline int xs_tcp_read_reply(struct rpc_xprt *xprt,
* If we're unable to obtain the rpc_rqst we schedule the closing of the
* connection and return -1.
*/
-static int xs_tcp_read_callback(struct rpc_xprt *xprt,
+static int xs_stream_read_callback(struct rpc_xprt *xprt,
struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
@@ -1273,7 +1273,7 @@ static int xs_tcp_read_callback(struct rpc_xprt *xprt,

/* Look up and lock the request corresponding to the given XID */
spin_lock(&xprt->transport_lock);
- req = xprt_lookup_bc_request(xprt, transport->tcp_xid);
+ req = xprt_lookup_bc_request(xprt, transport->stream_xid);
if (req == NULL) {
spin_unlock(&xprt->transport_lock);
printk(KERN_WARNING "Callback slot table overflowed\n");
@@ -1282,30 +1282,30 @@ static int xs_tcp_read_callback(struct rpc_xprt *xprt,
}

dprintk("RPC: read callback XID %08x\n", ntohl(req->rq_xid));
- xs_tcp_read_common(xprt, desc, req);
+ xs_stream_read_common(xprt, desc, req);

- if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
- xprt_complete_bc_request(req, transport->tcp_copied);
+ if (!(transport->stream_flags & STREAM_RCV_COPY_DATA))
+ xprt_complete_bc_request(req, transport->stream_copied);
spin_unlock(&xprt->transport_lock);

return 0;
}

-static inline int _xs_tcp_read_data(struct rpc_xprt *xprt,
- struct xdr_skb_reader *desc)
+static inline int _xs_stream_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
container_of(xprt, struct sock_xprt, xprt);

- return (transport->tcp_flags & TCP_RPC_REPLY) ?
- xs_tcp_read_reply(xprt, desc) :
- xs_tcp_read_callback(xprt, desc);
+ return (transport->stream_flags & STREAM_RPC_REPLY) ?
+ xs_stream_read_reply(xprt, desc) :
+ xs_stream_read_callback(xprt, desc);
}
#else
-static inline int _xs_tcp_read_data(struct rpc_xprt *xprt,
- struct xdr_skb_reader *desc)
+static inline int _xs_stream_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
{
- return xs_tcp_read_reply(xprt, desc);
+ return xs_stream_read_reply(xprt, desc);
}
#endif /* CONFIG_SUNRPC_BACKCHANNEL */

@@ -1313,38 +1313,38 @@ static inline int _xs_tcp_read_data(struct rpc_xprt *xprt,
* Read data off the transport. This can be either an RPC_CALL or an
* RPC_REPLY. Relay the processing to helper functions.
*/
-static void xs_tcp_read_data(struct rpc_xprt *xprt,
- struct xdr_skb_reader *desc)
+static void xs_stream_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
{
struct sock_xprt *transport =
container_of(xprt, struct sock_xprt, xprt);

- if (_xs_tcp_read_data(xprt, desc) == 0)
- xs_tcp_check_fraghdr(transport);
+ if (_xs_stream_read_data(xprt, desc) == 0)
+ xs_stream_check_fraghdr(transport);
else {
/*
* The transport_lock protects the request handling.
- * There's no need to hold it to update the tcp_flags.
+ * There's no need to hold it to update the stream_flags.
*/
- transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ transport->stream_flags &= ~STREAM_RCV_COPY_DATA;
}
}

-static inline void xs_tcp_read_discard(struct sock_xprt *transport, struct xdr_skb_reader *desc)
+static inline void xs_stream_read_discard(struct sock_xprt *transport, struct xdr_skb_reader *desc)
{
size_t len;

- len = transport->tcp_reclen - transport->tcp_offset;
+ len = transport->stream_reclen - transport->stream_offset;
if (len > desc->count)
len = desc->count;
desc->count -= len;
desc->offset += len;
- transport->tcp_offset += len;
+ transport->stream_offset += len;
dprintk("RPC: discarded %Zu bytes\n", len);
- xs_tcp_check_fraghdr(transport);
+ xs_stream_check_fraghdr(transport);
}

-static int xs_tcp_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len)
+static int xs_stream_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len)
{
struct rpc_xprt *xprt = rd_desc->arg.data;
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
@@ -1354,52 +1354,52 @@ static int xs_tcp_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, uns
.count = len,
};

- dprintk("RPC: xs_tcp_data_recv started\n");
+ dprintk("RPC: %s started\n", __func__);
do {
- trace_xs_tcp_data_recv(transport);
+ trace_xs_stream_data_recv(transport);
/* Read in a new fragment marker if necessary */
/* Can we ever really expect to get completely empty fragments? */
- if (transport->tcp_flags & TCP_RCV_COPY_FRAGHDR) {
- xs_tcp_read_fraghdr(xprt, &desc);
+ if (transport->stream_flags & STREAM_RCV_COPY_FRAGHDR) {
+ xs_stream_read_fraghdr(xprt, &desc);
continue;
}
/* Read in the xid if necessary */
- if (transport->tcp_flags & TCP_RCV_COPY_XID) {
- xs_tcp_read_xid(transport, &desc);
+ if (transport->stream_flags & STREAM_RCV_COPY_XID) {
+ xs_stream_read_xid(transport, &desc);
continue;
}
/* Read in the call/reply flag */
- if (transport->tcp_flags & TCP_RCV_READ_CALLDIR) {
- xs_tcp_read_calldir(transport, &desc);
+ if (transport->stream_flags & STREAM_RCV_READ_CALLDIR) {
+ xs_stream_read_calldir(transport, &desc);
continue;
}
/* Read in the request data */
- if (transport->tcp_flags & TCP_RCV_COPY_DATA) {
- xs_tcp_read_data(xprt, &desc);
+ if (transport->stream_flags & STREAM_RCV_COPY_DATA) {
+ xs_stream_read_data(xprt, &desc);
continue;
}
/* Skip over any trailing bytes on short reads */
- xs_tcp_read_discard(transport, &desc);
+ xs_stream_read_discard(transport, &desc);
} while (desc.count);
- trace_xs_tcp_data_recv(transport);
- dprintk("RPC: xs_tcp_data_recv done\n");
+ trace_xs_stream_data_recv(transport);
+ dprintk("RPC: %s done\n", __func__);
return len - desc.count;
}

/**
- * xs_tcp_data_ready - "data ready" callback for TCP sockets
+ * xs_stream_data_ready - "data ready" callback for SOCK_STREAM sockets
* @sk: socket with data to read
* @bytes: how much data to read
*
*/
-static void xs_tcp_data_ready(struct sock *sk)
+static void xs_stream_data_ready(struct sock *sk)
{
struct rpc_xprt *xprt;
read_descriptor_t rd_desc;
int read;
unsigned long total = 0;

- dprintk("RPC: xs_tcp_data_ready...\n");
+ dprintk("RPC: %s...\n", __func__);

read_lock_bh(&sk->sk_callback_lock);
if (!(xprt = xprt_from_sock(sk))) {
@@ -1412,16 +1412,16 @@ static void xs_tcp_data_ready(struct sock *sk)
if (xprt->reestablish_timeout)
xprt->reestablish_timeout = 0;

- /* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
+ /* We use rd_desc to pass struct xprt to xs_stream_data_recv */
rd_desc.arg.data = xprt;
do {
rd_desc.count = 65536;
- read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv);
+ read = tcp_read_sock(sk, &rd_desc, xs_stream_data_recv);
if (read > 0)
total += read;
} while (read > 0);
out:
- trace_xs_tcp_data_ready(xprt, read, total);
+ trace_xs_stream_data_ready(xprt, read, total);
read_unlock_bh(&sk->sk_callback_lock);
}

@@ -1452,12 +1452,12 @@ static void xs_tcp_state_change(struct sock *sk)
struct sock_xprt *transport = container_of(xprt,
struct sock_xprt, xprt);

- /* Reset TCP record info */
- transport->tcp_offset = 0;
- transport->tcp_reclen = 0;
- transport->tcp_copied = 0;
- transport->tcp_flags =
- TCP_RCV_COPY_FRAGHDR | TCP_RCV_COPY_XID;
+ /* Reset stream record info */
+ transport->stream_offset = 0;
+ transport->stream_reclen = 0;
+ transport->stream_copied = 0;
+ transport->stream_flags =
+ STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
xprt->connect_cookie++;

xprt_wake_pending_tasks(xprt, -EAGAIN);
@@ -2081,7 +2081,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
xs_save_old_callbacks(transport, sk);

sk->sk_user_data = xprt;
- sk->sk_data_ready = xs_tcp_data_ready;
+ sk->sk_data_ready = xs_stream_data_ready;
sk->sk_state_change = xs_tcp_state_change;
sk->sk_write_space = xs_tcp_write_space;
sk->sk_error_report = xs_error_report;
--
2.4.2


2015-06-04 16:46:10

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 06/10] SUNRPC: add AF_VSOCK support to xprtsock.c

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/xprtsock.c | 392 ++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 380 insertions(+), 13 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 8b93ef5..055a350 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -151,6 +151,7 @@ enum xprt_transports {
XPRT_TRANSPORT_BC_TCP = IPPROTO_TCP | XPRT_TRANSPORT_BC,
XPRT_TRANSPORT_RDMA = 256,
XPRT_TRANSPORT_LOCAL = 257,
+ XPRT_TRANSPORT_VSOCK = 258,
};

struct rpc_xprt {
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 9fa63f7..0e2c6e8 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -46,6 +46,7 @@
#include <net/checksum.h>
#include <net/udp.h>
#include <net/tcp.h>
+#include <net/af_vsock.h>

#include <trace/events/sunrpc.h>

@@ -270,6 +271,13 @@ static void xs_format_common_peer_addresses(struct rpc_xprt *xprt)
sin6 = xs_addr_in6(xprt);
snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
break;
+ case AF_VSOCK:
+ (void)rpc_ntop(sap, buf, sizeof(buf));
+ xprt->address_strings[RPC_DISPLAY_ADDR] =
+ kstrdup(buf, GFP_KERNEL);
+ snprintf(buf, sizeof(buf), "%08x",
+ ((struct sockaddr_vm *)sap)->svm_cid);
+ break;
default:
BUG();
}
@@ -1747,21 +1755,30 @@ static int xs_bind(struct sock_xprt *transport, struct socket *sock)
nloop++;
} while (err == -EADDRINUSE && nloop != 2);

- if (myaddr.ss_family == AF_INET)
+ switch (myaddr.ss_family) {
+ case AF_INET:
dprintk("RPC: %s %pI4:%u: %s (%d)\n", __func__,
&((struct sockaddr_in *)&myaddr)->sin_addr,
port, err ? "failed" : "ok", err);
- else
+ break;
+ case AF_INET6:
dprintk("RPC: %s %pI6:%u: %s (%d)\n", __func__,
&((struct sockaddr_in6 *)&myaddr)->sin6_addr,
port, err ? "failed" : "ok", err);
+ break;
+ case AF_VSOCK:
+ dprintk("RPC: %s %u:%u: %s (%d)\n", __func__,
+ ((struct sockaddr_vm *)&myaddr)->svm_cid,
+ port, err ? "failed" : "ok", err);
+ break;
+ }
return err;
}

/*
- * We don't support autobind on AF_LOCAL sockets
+ * We don't support autobind on AF_LOCAL and AF_VSOCK sockets
*/
-static void xs_local_rpcbind(struct rpc_task *task)
+static void xs_dummy_rpcbind(struct rpc_task *task)
{
rcu_read_lock();
xprt_set_bound(rcu_dereference(task->tk_client->cl_xprt));
@@ -1800,6 +1817,14 @@ static inline void xs_reclassify_socket6(struct socket *sock)
&xs_slock_key[1], "sk_lock-AF_INET6-RPC", &xs_key[1]);
}

+static inline void xs_reclassify_socket_vsock(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+
+ sock_lock_init_class_and_name(sk, "slock-AF_VSOCK-RPC",
+ &xs_slock_key[1], "sk_lock-AF_VSOCK-RPC", &xs_key[1]);
+}
+
static inline void xs_reclassify_socket(int family, struct socket *sock)
{
WARN_ON_ONCE(sock_owned_by_user(sock->sk));
@@ -1816,6 +1841,9 @@ static inline void xs_reclassify_socket(int family, struct socket *sock)
case AF_INET6:
xs_reclassify_socket6(sock);
break;
+ case AF_VSOCK:
+ xs_reclassify_socket_vsock(sock);
+ break;
}
}
#else
@@ -1823,14 +1851,6 @@ static inline void xs_reclassify_socketu(struct socket *sock)
{
}

-static inline void xs_reclassify_socket4(struct socket *sock)
-{
-}
-
-static inline void xs_reclassify_socket6(struct socket *sock)
-{
-}
-
static inline void xs_reclassify_socket(int family, struct socket *sock)
{
}
@@ -2467,7 +2487,7 @@ static struct rpc_xprt_ops xs_local_ops = {
.reserve_xprt = xprt_reserve_xprt,
.release_xprt = xs_tcp_release_xprt,
.alloc_slot = xprt_alloc_slot,
- .rpcbind = xs_local_rpcbind,
+ .rpcbind = xs_dummy_rpcbind,
.set_port = xs_local_set_port,
.connect = xs_local_connect,
.buf_alloc = rpc_malloc,
@@ -2541,6 +2561,10 @@ static int xs_init_anyaddr(const int family, struct sockaddr *sap)
.sin6_family = AF_INET6,
.sin6_addr = IN6ADDR_ANY_INIT,
};
+ static const struct sockaddr_vm svm = {
+ .svm_family = AF_VSOCK,
+ .svm_cid = VMADDR_CID_ANY,
+ };

switch (family) {
case AF_LOCAL:
@@ -2551,6 +2575,9 @@ static int xs_init_anyaddr(const int family, struct sockaddr *sap)
case AF_INET6:
memcpy(sap, &sin6, sizeof(sin6));
break;
+ case AF_VSOCK:
+ memcpy(sap, &svm, sizeof(svm));
+ break;
default:
dprintk("RPC: %s: Bad address family\n", __func__);
return -EAFNOSUPPORT;
@@ -2903,6 +2930,329 @@ out_err:
return ret;
}

+#if IS_ENABLED(CONFIG_VSOCKETS)
+/**
+ * xs_vsock_state_change - callback to handle vsock socket state changes
+ * @sk: socket whose state has changed
+ *
+ */
+static void xs_vsock_state_change(struct sock *sk)
+{
+ struct rpc_xprt *xprt;
+
+ read_lock_bh(&sk->sk_callback_lock);
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+ dprintk("RPC: %s client %p...\n", __func__, xprt);
+ dprintk("RPC: state %x conn %d dead %d zapped %d sk_shutdown %d\n",
+ sk->sk_state, xprt_connected(xprt),
+ sock_flag(sk, SOCK_DEAD),
+ sock_flag(sk, SOCK_ZAPPED),
+ sk->sk_shutdown);
+
+ trace_rpc_socket_state_change(xprt, sk->sk_socket);
+
+ switch (sk->sk_state) {
+ case SS_CONNECTING:
+ /* Do nothing */
+ break;
+
+ case SS_CONNECTED:
+ spin_lock(&xprt->transport_lock);
+ if (!xprt_test_and_set_connected(xprt)) {
+ xs_stream_reset_state(xprt, vsock_read_sock);
+ xprt->connect_cookie++;
+
+ xprt_wake_pending_tasks(xprt, -EAGAIN);
+ }
+ spin_unlock(&xprt->transport_lock);
+ break;
+
+ case SS_DISCONNECTING:
+ /* TODO do we need to distinguish between various shutdown (client-side/server-side)? */
+ /* The client initiated a shutdown of the socket */
+ xprt->connect_cookie++;
+ xprt->reestablish_timeout = 0;
+ set_bit(XPRT_CLOSING, &xprt->state);
+ smp_mb__before_atomic();
+ clear_bit(XPRT_CONNECTED, &xprt->state);
+ clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ smp_mb__after_atomic();
+ break;
+
+ case SS_UNCONNECTED:
+ xs_sock_mark_closed(xprt);
+ break;
+ }
+
+ out:
+ read_unlock_bh(&sk->sk_callback_lock);
+}
+
+/**
+ * xs_vsock_error_report - callback to handle vsock socket state errors
+ * @sk: socket
+ *
+ * Note: we don't call sock_error() since there may be a rpc_task
+ * using the socket, and so we don't want to clear sk->sk_err.
+ */
+static void xs_vsock_error_report(struct sock *sk)
+{
+ struct rpc_xprt *xprt;
+ int err;
+
+ read_lock_bh(&sk->sk_callback_lock);
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+
+ err = -sk->sk_err;
+ if (err == 0)
+ goto out;
+ /* Is this a reset event? */
+ if (sk->sk_state == SS_UNCONNECTED)
+ xs_sock_mark_closed(xprt);
+ dprintk("RPC: %s client %p, error=%d...\n",
+ __func__, xprt, -err);
+ trace_rpc_socket_error(xprt, sk->sk_socket, err);
+ xprt_wake_pending_tasks(xprt, err);
+ out:
+ read_unlock_bh(&sk->sk_callback_lock);
+}
+
+/**
+ * xs_vsock_finish_connecting - initialize and connect socket
+ */
+static int xs_vsock_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
+{
+ struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ int ret = -ENOTCONN;
+
+ if (!transport->inet) {
+ struct sock *sk = sock->sk;
+
+ write_lock_bh(&sk->sk_callback_lock);
+
+ xs_save_old_callbacks(transport, sk);
+
+ sk->sk_user_data = xprt;
+ sk->sk_data_ready = xs_stream_data_ready;
+ sk->sk_state_change = xs_vsock_state_change;
+ sk->sk_write_space = xs_tcp_write_space;
+ sk->sk_error_report = xs_vsock_error_report;
+ sk->sk_allocation = GFP_ATOMIC;
+
+ xprt_clear_connected(xprt);
+
+ /* Reset to new socket */
+ transport->sock = sock;
+ transport->inet = sk;
+
+ write_unlock_bh(&sk->sk_callback_lock);
+ }
+
+ if (!xprt_bound(xprt))
+ goto out;
+
+ xs_set_memalloc(xprt);
+
+ /* Tell the socket layer to start connecting... */
+ xprt->stat.connect_count++;
+ xprt->stat.connect_start = jiffies;
+ ret = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, O_NONBLOCK);
+ switch (ret) {
+ case 0:
+ xs_set_srcport(transport, sock);
+ case -EINPROGRESS:
+ /* SYN_SENT! */
+ if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO)
+ xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ }
+out:
+ return ret;
+}
+
+/**
+ * xs_vsock_setup_socket - create a vsock socket and connect to a remote endpoint
+ *
+ * Invoked by a work queue tasklet.
+ */
+static void xs_vsock_setup_socket(struct work_struct *work)
+{
+ struct sock_xprt *transport =
+ container_of(work, struct sock_xprt, connect_worker.work);
+ struct socket *sock = transport->sock;
+ struct rpc_xprt *xprt = &transport->xprt;
+ int status = -EIO;
+
+ if (!sock) {
+ sock = xs_create_sock(xprt, transport,
+ xs_addr(xprt)->sa_family, SOCK_STREAM,
+ 0, true);
+ if (IS_ERR(sock)) {
+ status = PTR_ERR(sock);
+ goto out;
+ }
+ }
+
+ dprintk("RPC: worker connecting xprt %p via %s to "
+ "%s (port %s)\n", xprt,
+ xprt->address_strings[RPC_DISPLAY_PROTO],
+ xprt->address_strings[RPC_DISPLAY_ADDR],
+ xprt->address_strings[RPC_DISPLAY_PORT]);
+
+ status = xs_vsock_finish_connecting(xprt, sock);
+ trace_rpc_socket_connect(xprt, sock, status);
+ dprintk("RPC: %p connect status %d connected %d sock state %d\n",
+ xprt, -status, xprt_connected(xprt),
+ sock->sk->sk_state);
+ switch (status) {
+ default:
+ printk("%s: connect returned unhandled error %d\n",
+ __func__, status);
+ case -EADDRNOTAVAIL:
+ /* We're probably in TIME_WAIT. Get rid of existing socket,
+ * and retry
+ */
+ xs_tcp_force_close(xprt);
+ break;
+ case 0:
+ case -EINPROGRESS:
+ case -EALREADY:
+ xprt_unlock_connect(xprt, transport);
+ xprt_clear_connecting(xprt);
+ return;
+ case -EINVAL:
+ /* Happens, for instance, if the user specified a link
+ * local IPv6 address without a scope-id.
+ */
+ case -ECONNREFUSED:
+ case -ECONNRESET:
+ case -ENETUNREACH:
+ case -EADDRINUSE:
+ case -ENOBUFS:
+ /* retry with existing socket, after a delay */
+ xs_tcp_force_close(xprt);
+ goto out;
+ }
+ status = -EAGAIN;
+out:
+ xprt_unlock_connect(xprt, transport);
+ xprt_clear_connecting(xprt);
+ xprt_wake_pending_tasks(xprt, status);
+}
+
+/**
+ * xs_vsock_print_stats - display vsock socket-specifc stats
+ * @xprt: rpc_xprt struct containing statistics
+ * @seq: output file
+ *
+ */
+static void xs_vsock_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
+{
+ struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ long idle_time = 0;
+
+ if (xprt_connected(xprt))
+ idle_time = (long)(jiffies - xprt->last_used) / HZ;
+
+ seq_printf(seq, "\txprt:\tvsock %u %lu %lu %lu %ld %lu %lu %lu "
+ "%llu %llu %lu %llu %llu\n",
+ transport->srcport,
+ xprt->stat.bind_count,
+ xprt->stat.connect_count,
+ xprt->stat.connect_time,
+ idle_time,
+ xprt->stat.sends,
+ xprt->stat.recvs,
+ xprt->stat.bad_xids,
+ xprt->stat.req_u,
+ xprt->stat.bklog_u,
+ xprt->stat.max_slots,
+ xprt->stat.sending_u,
+ xprt->stat.pending_u);
+}
+
+static struct rpc_xprt_ops xs_vsock_ops = {
+ .reserve_xprt = xprt_reserve_xprt,
+ .release_xprt = xs_tcp_release_xprt,
+ .alloc_slot = xprt_lock_and_alloc_slot,
+ .rpcbind = xs_dummy_rpcbind,
+ .set_port = xs_set_port,
+ .connect = xs_connect,
+ .buf_alloc = rpc_malloc,
+ .buf_free = rpc_free,
+ .send_request = xs_tcp_send_request,
+ .set_retrans_timeout = xprt_set_retrans_timeout_def,
+ .close = xs_tcp_shutdown,
+ .destroy = xs_destroy,
+ .print_stats = xs_vsock_print_stats,
+};
+
+static const struct rpc_timeout xs_vsock_default_timeout = {
+ .to_initval = 60 * HZ,
+ .to_maxval = 60 * HZ,
+ .to_retries = 2,
+};
+
+/**
+ * xs_setup_vsock - Set up transport to use a vsock socket
+ * @args: rpc transport creation arguments
+ *
+ */
+static struct rpc_xprt *xs_setup_vsock(struct xprt_create *args)
+{
+ struct sockaddr_vm *addr = (struct sockaddr_vm *)args->dstaddr;
+ struct sock_xprt *transport;
+ struct rpc_xprt *xprt;
+ struct rpc_xprt *ret;
+
+ xprt = xs_setup_xprt(args, xprt_tcp_slot_table_entries,
+ xprt_max_tcp_slot_table_entries);
+ if (IS_ERR(xprt))
+ return xprt;
+ transport = container_of(xprt, struct sock_xprt, xprt);
+
+ xprt->prot = 0;
+ xprt->tsh_size = sizeof(rpc_fraghdr) / sizeof(u32);
+ xprt->max_payload = RPC_MAX_FRAGMENT_SIZE;
+
+ xprt->bind_timeout = XS_BIND_TO;
+ xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ xprt->idle_timeout = XS_IDLE_DISC_TO;
+
+ xprt->ops = &xs_vsock_ops;
+ xprt->timeout = &xs_vsock_default_timeout;
+
+ switch (addr->svm_family) {
+ case AF_VSOCK:
+ if (addr->svm_port == 0) {
+ dprintk("RPC: autobind not supported with AF_VSOCK\n");
+ ret = ERR_PTR(-EINVAL);
+ goto out_err;
+ }
+ xprt_set_bound(xprt);
+ INIT_DELAYED_WORK(&transport->connect_worker,
+ xs_vsock_setup_socket);
+ xs_format_peer_addresses(xprt, "vsock", "vsock" /* TODO register official netid? */);
+ break;
+ default:
+ ret = ERR_PTR(-EAFNOSUPPORT);
+ goto out_err;
+ }
+
+ dprintk("RPC: set up xprt to %s (port %s) via AF_VSOCK\n",
+ xprt->address_strings[RPC_DISPLAY_ADDR],
+ xprt->address_strings[RPC_DISPLAY_PORT]);
+
+ if (try_module_get(THIS_MODULE))
+ return xprt;
+ ret = ERR_PTR(-EINVAL);
+out_err:
+ xs_xprt_free(xprt);
+ return ret;
+}
+#endif
+
static struct xprt_class xs_local_transport = {
.list = LIST_HEAD_INIT(xs_local_transport.list),
.name = "named UNIX socket",
@@ -2935,6 +3285,16 @@ static struct xprt_class xs_bc_tcp_transport = {
.setup = xs_setup_bc_tcp,
};

+#if IS_ENABLED(CONFIG_VSOCKETS)
+static struct xprt_class xs_vsock_transport = {
+ .list = LIST_HEAD_INIT(xs_vsock_transport.list),
+ .name = "vsock",
+ .owner = THIS_MODULE,
+ .ident = XPRT_TRANSPORT_VSOCK,
+ .setup = xs_setup_vsock,
+};
+#endif
+
/**
* init_socket_xprt - set up xprtsock's sysctls, register with RPC client
*
@@ -2950,6 +3310,9 @@ int init_socket_xprt(void)
xprt_register_transport(&xs_udp_transport);
xprt_register_transport(&xs_tcp_transport);
xprt_register_transport(&xs_bc_tcp_transport);
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ xprt_register_transport(&xs_vsock_transport);
+#endif

return 0;
}
@@ -2971,6 +3334,9 @@ void cleanup_socket_xprt(void)
xprt_unregister_transport(&xs_udp_transport);
xprt_unregister_transport(&xs_tcp_transport);
xprt_unregister_transport(&xs_bc_tcp_transport);
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ xprt_unregister_transport(&xs_vsock_transport);
+#endif
}

static int param_set_uint_minmax(const char *val,
--
2.4.2


2015-06-04 16:46:12

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 07/10] SUNRPC: restrict backchannel svc IPPROTO_TCP check to IP

The IPPROTO_TCP check only applies to AF_INET and AF_INET6.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
net/sunrpc/svc.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 78974e4..1c9c5bc 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1365,9 +1365,16 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst *req,
/* reset result send buffer "put" position */
resv->iov_len = 0;

- if (rqstp->rq_prot != IPPROTO_TCP) {
- printk(KERN_ERR "No support for Non-TCP transports!\n");
- BUG();
+ switch (((struct sockaddr *)&rqstp->rq_addr)->sa_family) {
+ case AF_INET:
+ case AF_INET6:
+ if (rqstp->rq_prot != IPPROTO_TCP) {
+ printk(KERN_ERR "No support for Non-TCP transports!\n");
+ BUG();
+ }
+ break;
+ default:
+ break;
}

/*
--
2.4.2


2015-06-04 16:46:14

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 08/10] SUNRPC: add vsock-bc backchannel

Note that this is currently a hack which unconditionally replaces
"tcp-bc" with "vsock-bc". Need to find a way to select the appropriate
backchannel name based on the address family.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/nfs/callback.c | 7 +++++--
net/sunrpc/svcsock.c | 48 ++++++++++++++++++++++++++++++------------------
2 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index 8d129bb..ce4cae1 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -106,8 +106,11 @@ static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net)
* fore channel connection.
* Returns the input port (0) and sets the svc_serv bc_xprt on success
*/
- return svc_create_xprt(serv, "tcp-bc", net, PF_INET, 0,
- SVC_SOCK_ANONYMOUS);
+/* return svc_create_xprt(serv, "tcp-bc", net, PF_INET, 0,
+ SVC_SOCK_ANONYMOUS); */
+ /* TODO check address family and choose appropriately */
+ return svc_create_xprt(serv, "vsock-bc", net, AF_VSOCK, 0,
+ SVC_SOCK_ANONYMOUS);
}

/*
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 0c81202..c6ba593 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -71,7 +71,7 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *, int,
struct net *, struct sockaddr *,
int, int);
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
-static struct svc_xprt *svc_bc_create_socket(struct svc_serv *, int,
+static struct svc_xprt *svc_bc_create_socket(struct svc_serv *,
struct net *, struct sockaddr *,
int, int);
static void svc_bc_sock_free(struct svc_xprt *xprt);
@@ -1232,25 +1232,17 @@ static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
}

#if defined(CONFIG_SUNRPC_BACKCHANNEL)
-static struct svc_xprt *svc_bc_create_socket(struct svc_serv *, int,
+static struct svc_xprt *svc_bc_create_socket(struct svc_serv *,
struct net *, struct sockaddr *,
int, int);
static void svc_bc_sock_free(struct svc_xprt *xprt);

-static struct svc_xprt *svc_bc_tcp_create(struct svc_serv *serv,
- struct net *net,
- struct sockaddr *sa, int salen,
- int flags)
-{
- return svc_bc_create_socket(serv, IPPROTO_TCP, net, sa, salen, flags);
-}
-
static void svc_bc_tcp_sock_detach(struct svc_xprt *xprt)
{
}

static struct svc_xprt_ops svc_tcp_bc_ops = {
- .xpo_create = svc_bc_tcp_create,
+ .xpo_create = svc_bc_create_socket,
.xpo_detach = svc_bc_tcp_sock_detach,
.xpo_free = svc_bc_sock_free,
.xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
@@ -1264,14 +1256,41 @@ static struct svc_xprt_class svc_tcp_bc_class = {
.xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
};

+#if IS_ENABLED(CONFIG_VSOCKETS)
+static void svc_bc_vsock_sock_detach(struct svc_xprt *xprt)
+{
+}
+
+static struct svc_xprt_ops svc_vsock_bc_ops = {
+ .xpo_create = svc_bc_create_socket,
+ .xpo_detach = svc_bc_vsock_sock_detach,
+ .xpo_free = svc_bc_sock_free,
+ .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
+ .xpo_secure_port = svc_sock_secure_port,
+};
+
+static struct svc_xprt_class svc_vsock_bc_class = {
+ .xcl_name = "vsock-bc",
+ .xcl_owner = THIS_MODULE,
+ .xcl_ops = &svc_vsock_bc_ops,
+ .xcl_max_payload = RPCSVC_MAXPAYLOAD,
+};
+#endif /* IS_ENABLED(CONFIG_VSOCKETS) */
+
static void svc_init_bc_xprt_sock(void)
{
svc_reg_xprt_class(&svc_tcp_bc_class);
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ svc_reg_xprt_class(&svc_vsock_bc_class);
+#endif
}

static void svc_cleanup_bc_xprt_sock(void)
{
svc_unreg_xprt_class(&svc_tcp_bc_class);
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ svc_unreg_xprt_class(&svc_vsock_bc_class);
+#endif
}
#else /* CONFIG_SUNRPC_BACKCHANNEL */
static void svc_init_bc_xprt_sock(void)
@@ -1635,7 +1654,6 @@ static void svc_sock_free(struct svc_xprt *xprt)
* Create a back channel svc_xprt which shares the fore channel socket.
*/
static struct svc_xprt *svc_bc_create_socket(struct svc_serv *serv,
- int protocol,
struct net *net,
struct sockaddr *sin, int len,
int flags)
@@ -1643,12 +1661,6 @@ static struct svc_xprt *svc_bc_create_socket(struct svc_serv *serv,
struct svc_sock *svsk;
struct svc_xprt *xprt;

- if (protocol != IPPROTO_TCP) {
- printk(KERN_WARNING "svc: only TCP sockets"
- " supported on shared back channel\n");
- return ERR_PTR(-EINVAL);
- }
-
svsk = kzalloc(sizeof(*svsk), GFP_KERNEL);
if (!svsk)
return ERR_PTR(-ENOMEM);
--
2.4.2


2015-06-04 16:46:17

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 09/10] SUNRPC: add AF_VSOCK support to svc_xprt.c

Allow creation of AF_VSOCK service xprts. This is needed for the
"vsock-bc" backchannel.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/svc_xprt.h | 12 ++++++++++++
net/sunrpc/svc_xprt.c | 13 +++++++++++++
2 files changed, 25 insertions(+)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 79f6f8f..f0cb5a8 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -8,6 +8,7 @@
#define SUNRPC_SVC_XPRT_H

#include <linux/sunrpc/svc.h>
+#include <linux/vm_sockets.h>

struct module;

@@ -150,12 +151,15 @@ static inline unsigned short svc_addr_port(const struct sockaddr *sa)
{
const struct sockaddr_in *sin = (const struct sockaddr_in *)sa;
const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)sa;
+ const struct sockaddr_vm *svm = (const struct sockaddr_vm *)sa;

switch (sa->sa_family) {
case AF_INET:
return ntohs(sin->sin_port);
case AF_INET6:
return ntohs(sin6->sin6_port);
+ case AF_VSOCK:
+ return svm->svm_port;
}

return 0;
@@ -168,6 +172,8 @@ static inline size_t svc_addr_len(const struct sockaddr *sa)
return sizeof(struct sockaddr_in);
case AF_INET6:
return sizeof(struct sockaddr_in6);
+ case AF_VSOCK:
+ return sizeof(struct sockaddr_vm);
}
BUG();
}
@@ -187,6 +193,7 @@ static inline char *__svc_print_addr(const struct sockaddr *addr,
{
const struct sockaddr_in *sin = (const struct sockaddr_in *)addr;
const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)addr;
+ const struct sockaddr_vm *svm = (const struct sockaddr_vm *)addr;

switch (addr->sa_family) {
case AF_INET:
@@ -200,6 +207,11 @@ static inline char *__svc_print_addr(const struct sockaddr *addr,
ntohs(sin6->sin6_port));
break;

+ case AF_VSOCK:
+ snprintf(buf, len, "%u, port=%u",
+ svm->svm_cid, svm->svm_port);
+ break;
+
default:
snprintf(buf, len, "unknown address type: %d", addr->sa_family);
break;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 163ac45..9e011d1 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -188,6 +188,13 @@ static struct svc_xprt *__svc_xpo_create(struct svc_xprt_class *xcl,
.sin6_port = htons(port),
};
#endif
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ struct sockaddr_vm svm = {
+ .svm_family = AF_VSOCK,
+ .svm_cid = VMADDR_CID_ANY,
+ .svm_port = VMADDR_PORT_ANY,
+ };
+#endif
struct sockaddr *sap;
size_t len;

@@ -202,6 +209,12 @@ static struct svc_xprt *__svc_xpo_create(struct svc_xprt_class *xcl,
len = sizeof(sin6);
break;
#endif
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ case AF_VSOCK:
+ sap = (struct sockaddr *)&svm;
+ len = sizeof(svm);
+ break;
+#endif
default:
return ERR_PTR(-EAFNOSUPPORT);
}
--
2.4.2


2015-06-04 16:46:19

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 10/10] NFS: add AF_VSOCK support to NFS client

This patch adds AF_VSOCK to the NFS client. Mounts can now use the
"vsock" proto option and pass "vsock:<cid>" address strings, which are
interpreted by sunrpc for xprt creation.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/nfs/client.c | 16 ++++++++++++++++
fs/nfs/super.c | 10 ++++++++++
2 files changed, 26 insertions(+)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 892aeff..ff282f1 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -34,6 +34,7 @@
#include <linux/vfs.h>
#include <linux/inet.h>
#include <linux/in6.h>
+#include <linux/vm_sockets.h>
#include <linux/slab.h>
#include <linux/idr.h>
#include <net/ipv6.h>
@@ -354,6 +355,16 @@ static int nfs_sockaddr_cmp_ip4(const struct sockaddr *sa1,
(sin1->sin_port == sin2->sin_port);
}

+static int nfs_sockaddr_cmp_vsock(const struct sockaddr *sa1,
+ const struct sockaddr *sa2)
+{
+ const struct sockaddr_vm *svm1 = (const struct sockaddr_vm *)sa1;
+ const struct sockaddr_vm *svm2 = (const struct sockaddr_vm *)sa2;
+
+ return svm1->svm_cid == svm2->svm_cid &&
+ svm1->svm_port == svm2->svm_port;
+}
+
#if defined(CONFIG_NFS_V4_1)
/*
* Test if two socket addresses represent the same actual socket,
@@ -370,6 +381,8 @@ int nfs_sockaddr_match_ipaddr(const struct sockaddr *sa1,
return nfs_sockaddr_match_ipaddr4(sa1, sa2);
case AF_INET6:
return nfs_sockaddr_match_ipaddr6(sa1, sa2);
+ default:
+ BUG();
}
return 0;
}
@@ -391,6 +404,8 @@ static int nfs_sockaddr_cmp(const struct sockaddr *sa1,
return nfs_sockaddr_cmp_ip4(sa1, sa2);
case AF_INET6:
return nfs_sockaddr_cmp_ip6(sa1, sa2);
+ case AF_VSOCK:
+ return nfs_sockaddr_cmp_vsock(sa1, sa2);
}
return 0;
}
@@ -545,6 +560,7 @@ void nfs_init_timeout_values(struct rpc_timeout *to, int proto,
switch (proto) {
case XPRT_TRANSPORT_TCP:
case XPRT_TRANSPORT_RDMA:
+ case XPRT_TRANSPORT_VSOCK:
if (to->to_retries == 0)
to->to_retries = NFS_DEF_TCP_RETRANS;
if (to->to_initval == 0)
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index f175b83..564ed41 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -191,6 +191,7 @@ static const match_table_t nfs_mount_option_tokens = {

enum {
Opt_xprt_udp, Opt_xprt_udp6, Opt_xprt_tcp, Opt_xprt_tcp6, Opt_xprt_rdma,
+ Opt_xprt_vsock,

Opt_xprt_err
};
@@ -201,6 +202,7 @@ static const match_table_t nfs_xprt_protocol_tokens = {
{ Opt_xprt_tcp, "tcp" },
{ Opt_xprt_tcp6, "tcp6" },
{ Opt_xprt_rdma, "rdma" },
+ { Opt_xprt_vsock, "vsock" },

{ Opt_xprt_err, NULL }
};
@@ -964,6 +966,8 @@ static int nfs_verify_server_address(struct sockaddr *addr)
struct in6_addr *sa = &((struct sockaddr_in6 *)addr)->sin6_addr;
return !ipv6_addr_any(sa);
}
+ case AF_VSOCK:
+ return 1;
}

dfprintk(MOUNT, "NFS: Invalid IP address specified\n");
@@ -993,6 +997,7 @@ static void nfs_validate_transport_protocol(struct nfs_parsed_mount_data *mnt)
case XPRT_TRANSPORT_UDP:
case XPRT_TRANSPORT_TCP:
case XPRT_TRANSPORT_RDMA:
+ case XPRT_TRANSPORT_VSOCK:
break;
default:
mnt->nfs_server.protocol = XPRT_TRANSPORT_TCP;
@@ -1459,6 +1464,11 @@ static int nfs_parse_mount_options(char *raw,
mnt->nfs_server.protocol = XPRT_TRANSPORT_RDMA;
xprt_load_transport(string);
break;
+ case Opt_xprt_vsock:
+ protofamily = AF_VSOCK;
+ mnt->flags &= ~NFS_MOUNT_TCP;
+ mnt->nfs_server.protocol = XPRT_TRANSPORT_VSOCK;
+ break;
default:
dfprintk(MOUNT, "NFS: unrecognized "
"transport protocol\n");
--
2.4.2


2015-06-04 16:46:05

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 04/10] SUNRPC: extract xs_stream_reset_state()

Extract a function to reset the record fragment parser.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
net/sunrpc/xprtsock.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 06fde0e..9fa63f7 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1431,6 +1431,28 @@ out:
}

/**
+ * xs_stream_reset_state - reset SOCK_STREAM record parser
+ * @transport: socket transport
+ * @read_sock: tcp_read_sock()-like function
+ *
+ */
+static void xs_stream_reset_state(struct rpc_xprt *xprt,
+ int (*read_sock)(struct sock *,
+ read_descriptor_t *,
+ sk_read_actor_t))
+{
+ struct sock_xprt *transport = container_of(xprt,
+ struct sock_xprt, xprt);
+
+ transport->stream_offset = 0;
+ transport->stream_reclen = 0;
+ transport->stream_copied = 0;
+ transport->stream_flags =
+ STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
+ transport->stream_read_sock = read_sock;
+}
+
+/**
* xs_tcp_state_change - callback to handle TCP socket state changes
* @sk: socket whose state has changed
*
@@ -1454,16 +1476,7 @@ static void xs_tcp_state_change(struct sock *sk)
case TCP_ESTABLISHED:
spin_lock(&xprt->transport_lock);
if (!xprt_test_and_set_connected(xprt)) {
- struct sock_xprt *transport = container_of(xprt,
- struct sock_xprt, xprt);
-
- /* Reset stream record info */
- transport->stream_offset = 0;
- transport->stream_reclen = 0;
- transport->stream_copied = 0;
- transport->stream_flags =
- STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
- transport->stream_read_sock = tcp_read_sock;
+ xs_stream_reset_state(xprt, tcp_read_sock);
xprt->connect_cookie++;

xprt_wake_pending_tasks(xprt, -EAGAIN);
--
2.4.2


2015-06-04 16:46:03

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 03/10] SUNRPC: abstract tcp_read_sock() in record fragment parser

Use a function pointer to abstract tcp_read_sock()-like functions. For
TCP this function will be tcp_read_sock(). For AF_VSOCK it will be
vsock_read_sock().

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/sunrpc/xprtsock.h | 6 ++++++
net/sunrpc/xprtsock.c | 8 +++++++-
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index e3de7bf..3b4cd4c 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -9,6 +9,9 @@

#ifdef __KERNEL__

+/* TODO why does this header have no includes? */
+#include <net/tcp.h> /* for sk_read_actor_t */
+
int init_socket_xprt(void);
void cleanup_socket_xprt(void);

@@ -39,6 +42,9 @@ struct sock_xprt {
unsigned long stream_copied,
stream_flags;

+ int (*stream_read_sock)(struct sock *,
+ read_descriptor_t *,
+ sk_read_actor_t);

/*
* Connection of transports
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index c84d45e..06fde0e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1395,6 +1395,7 @@ static int xs_stream_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
static void xs_stream_data_ready(struct sock *sk)
{
struct rpc_xprt *xprt;
+ struct sock_xprt *transport;
read_descriptor_t rd_desc;
int read;
unsigned long total = 0;
@@ -1406,6 +1407,9 @@ static void xs_stream_data_ready(struct sock *sk)
read = 0;
goto out;
}
+
+ transport = container_of(xprt, struct sock_xprt, xprt);
+
/* Any data means we had a useful conversation, so
* the we don't need to delay the next reconnect
*/
@@ -1416,7 +1420,8 @@ static void xs_stream_data_ready(struct sock *sk)
rd_desc.arg.data = xprt;
do {
rd_desc.count = 65536;
- read = tcp_read_sock(sk, &rd_desc, xs_stream_data_recv);
+ read = transport->stream_read_sock(sk, &rd_desc,
+ xs_stream_data_recv);
if (read > 0)
total += read;
} while (read > 0);
@@ -1458,6 +1463,7 @@ static void xs_tcp_state_change(struct sock *sk)
transport->stream_copied = 0;
transport->stream_flags =
STREAM_RCV_COPY_FRAGHDR | STREAM_RCV_COPY_XID;
+ transport->stream_read_sock = tcp_read_sock;
xprt->connect_cookie++;

xprt_wake_pending_tasks(xprt, -EAGAIN);
--
2.4.2


2015-06-04 16:46:07

by Stefan Hajnoczi

[permalink] [raw]
Subject: [RFC 05/10] VSOCK: add tcp_read_sock()-like vsock_read_sock() function

The tcp_read_sock() interface dequeues skbs and gives them to the
caller's callback function for processing. This interface can avoid
data copies since the caller accesses the skb instead of using its own
receive buffer.

This patch implements vsock_read_sock() for AF_VSOCK SOCK_STREAM
sockets. The implementation is only for virtio-vsock at this time, not
for the VMware VMCI transport. It is not zero-copy yet because the
virtio-vsock receive queue does not consist of skbs.

The tcp_read_sock()-like interface is needed for AF_VSOCK sunrpc
support.

Signed-off-by: Stefan Hajnoczi <[email protected]>
---
drivers/vhost/vsock.c | 1 +
include/linux/virtio_vsock.h | 4 +++
include/net/af_vsock.h | 5 +++
net/vmw_vsock/af_vsock.c | 15 +++++++++
net/vmw_vsock/virtio_transport.c | 1 +
net/vmw_vsock/virtio_transport_common.c | 55 +++++++++++++++++++++++++++++++++
net/vmw_vsock/vmci_transport.c | 8 +++++
7 files changed, 89 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 7a6f669..d715863 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -556,6 +556,7 @@ static struct vsock_transport vhost_transport = {
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
.stream_allow = virtio_transport_stream_allow,
+ .stream_read_sock = virtio_transport_stream_read_sock,

.notify_poll_in = virtio_transport_notify_poll_in,
.notify_poll_out = virtio_transport_notify_poll_out,
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 01d84a5..a8af8f0 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -37,6 +37,7 @@
#include <uapi/linux/virtio_vsock.h>
#include <linux/socket.h>
#include <net/sock.h>
+#include <net/tcp.h> /* for sk_read_actor_t */

#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE 128
#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE (1024 * 256)
@@ -176,6 +177,9 @@ int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
bool virtio_transport_stream_allow(u32 cid, u32 port);
+int virtio_transport_stream_read_sock(struct vsock_sock *vsk,
+ read_descriptor_t *desc,
+ sk_read_actor_t recv_actor);
int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr);
bool virtio_transport_dgram_allow(u32 cid, u32 port);
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index bc9055c..2fb7ea3 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -19,6 +19,7 @@
#include <linux/kernel.h>
#include <linux/workqueue.h>
#include <linux/vm_sockets.h>
+#include <net/tcp.h> /* for sk_read_actor_t */

#include "vsock_addr.h"

@@ -69,6 +70,8 @@ struct vsock_sock {
void *trans;
};

+int vsock_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor);
s64 vsock_stream_has_data(struct vsock_sock *vsk);
s64 vsock_stream_has_space(struct vsock_sock *vsk);
void vsock_pending_work(struct work_struct *work);
@@ -118,6 +121,8 @@ struct vsock_transport {
u64 (*stream_rcvhiwat)(struct vsock_sock *);
bool (*stream_is_active)(struct vsock_sock *);
bool (*stream_allow)(u32 cid, u32 port);
+ int (*stream_read_sock)(struct vsock_sock *, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor);

/* Notification. */
int (*notify_poll_in)(struct vsock_sock *, size_t, bool *);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 0b3c498..61b412c 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -764,6 +764,21 @@ static void vsock_sk_destruct(struct sock *sk)
put_cred(vsk->owner);
}

+int vsock_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ struct vsock_sock *vsp = vsock_sk(sk);
+
+ if (sk->sk_type != SOCK_STREAM)
+ return -EOPNOTSUPP;
+
+ if (sk->sk_state != SS_CONNECTED && sk->sk_state != SS_DISCONNECTING)
+ return -ENOTCONN;
+
+ return transport->stream_read_sock(vsp, desc, recv_actor);
+}
+EXPORT_SYMBOL(vsock_read_sock);
+
static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
int err;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 5c35b31..365e8a6 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -315,6 +315,7 @@ static struct vsock_transport virtio_transport = {
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
.stream_allow = virtio_transport_stream_allow,
+ .stream_read_sock = virtio_transport_stream_read_sock,

.notify_poll_in = virtio_transport_notify_poll_in,
.notify_poll_out = virtio_transport_notify_poll_out,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 1153d29..28122e2 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -320,6 +320,61 @@ virtio_transport_stream_dequeue(struct vsock_sock *vsk,
}
EXPORT_SYMBOL_GPL(virtio_transport_stream_dequeue);

+int
+virtio_transport_stream_read_sock(struct vsock_sock *vsk,
+ read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ struct virtio_transport *trans;
+ int ret = 0;
+
+ trans = vsk->trans;
+
+ mutex_lock(&trans->rx_lock);
+ while (trans->rx_bytes) {
+ struct virtio_vsock_pkt *pkt;
+ struct sk_buff *skb;
+ size_t len;
+ int used;
+
+ pkt = list_first_entry(&trans->rx_queue,
+ struct virtio_vsock_pkt, list);
+
+ len = pkt->len - pkt->off;
+ skb = alloc_skb(len, GFP_KERNEL);
+ if (!skb)
+ break;
+
+ memcpy(skb_put(skb, len),
+ pkt->buf + pkt->off,
+ len);
+
+ used = recv_actor(desc, skb, 0, len);
+
+ kfree_skb(skb);
+
+ if (used > 0) {
+ ret += used;
+ pkt->off += used;
+ if (pkt->off == pkt->len) {
+ virtio_transport_dec_rx_pkt(pkt);
+ list_del(&pkt->list);
+ virtio_transport_free_pkt(pkt);
+ }
+ }
+
+ if (used <= 0 || !desc->count)
+ break;
+ }
+ mutex_unlock(&trans->rx_lock);
+
+ if (ret > 0)
+ virtio_transport_send_credit_update(vsk, SOCK_STREAM, NULL);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_read_sock);
+
struct dgram_skb {
struct list_head list;
struct sk_buff *skb;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index c294da0..d329564 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -654,6 +654,13 @@ static bool vmci_transport_stream_allow(u32 cid, u32 port)
return true;
}

+static int vmci_transport_stream_read_sock(struct vsock_sock *vsk,
+ read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+ return -EOPNOTSUPP; /* not yet implemented */
+}
+
/* This is invoked as part of a tasklet that's scheduled when the VMCI
* interrupt fires. This is run in bottom-half context but it defers most of
* its work to the packet handling work queue.
@@ -2083,6 +2090,7 @@ static struct vsock_transport vmci_transport = {
.stream_rcvhiwat = vmci_transport_stream_rcvhiwat,
.stream_is_active = vmci_transport_stream_is_active,
.stream_allow = vmci_transport_stream_allow,
+ .stream_read_sock = vmci_transport_stream_read_sock,
.notify_poll_in = vmci_transport_notify_poll_in,
.notify_poll_out = vmci_transport_notify_poll_out,
.notify_recv_init = vmci_transport_notify_recv_init,
--
2.4.2


2015-06-08 21:02:54

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC 00/10] NFS: add AF_VSOCK support to NFS client

On Thu, Jun 04, 2015 at 05:45:43PM +0100, Stefan Hajnoczi wrote:
> This patch series enables AF_VSOCK address family support in the NFS client.
> Please use the https://github.com/stefanha/linux.git vsock-nfs branch, which
> contains the dependencies for this series.
>
> The AF_VSOCK address family provides dgram and stream socket communication
> between virtual machines and hypervisors. A VMware VMCI transport is currently
> available in-tree (see net/vmw_vsock) and I have posted virtio-vsock patches
> for use with QEMU/KVM: http://thread.gmane.org/gmane.linux.network/365205
>
> The goal of this work is sharing files between virtual machines and
> hypervisors. AF_VSOCK is well-suited to this because it requires no
> configuration inside the virtual machine, making it simple to manage and
> reliable.
>
> Why NFS over AF_VSOCK?
> ----------------------
> It is unusual to add a new NFS transport, only TCP, RDMA, and UDP are currently
> supported. Here is the rationale for adding AF_VSOCK.
>
> Sharing files with a virtual machine can be configured manually:
> 1. Add a dedicated network card to the virtual machine. It will be used for
> NFS traffic.
> 2. Configure a local subnet and assign IP addresses to the virtual machine and
> hypervisor
> 3. Configure an NFS export on the hypervisor and start the NFS server
> 4. Mount the export inside the virtual machine
>
> Automating these steps poses a problem: modifying network configuration inside
> the virtual machine is invasive. It's hard to add a network interface to an
> arbitrary running system in an automated fashion, considering the network
> management tools, firewall rules, IP address usage, etc.
>
> Furthermore, the user may disrupt file sharing by accident when they add
> firewall rules, restart networking, etc because the NFS network interface is
> visible alongside the network interfaces managed by the user.
>
> AF_VSOCK is a zero-configuration network transport that avoids these problems.
> Adding it to a virtual machine is non-invasive. It also avoids accidental
> misconfiguration by the user. This is why "guest agents" and other services in
> various hypervisors (KVM, Xen, VMware, VirtualBox) do not use regular network
> interfaces.
>
> This is why AF_VSOCK is appropriate for providing shared files as a hypervisor
> service.
>
> The approach in this series
> ---------------------------
> AF_VSOCK stream sockets can be used for NFSv4.1 much in the same way as TCP.
> RFC 1831 record fragments divide messages since SOCK_STREAM semantics are
> present. The backchannel shares the connection just like the default TCP
> configuration.

So the NFSv4 backchannel isn't handled for now, I assume. And I guess
NFSv2/v3 is out too thanks to rpcbind? Which maybe is fine.

Do we need an IETF draft or similar to document how NFS should work over
AF_VSOCK?

NFS developers rely heavily on wireshark (and similar tools) for
debugging. Is that still possible over AF_VSOCK?

> Addresses are <Context ID, Port Number> pairs. These patches use "vsock:<cid>"
> string representation to distinguish AF_VSOCK addresses from IPv4 and IPv6
> numeric addresses.
>
> The patches cover the following areas:
>
> Patch 1 - support struct sockaddr_vm in sunrpc addr.h
>
> Patch 2-4 - make sunrpc TCP record fragment parser reusable for any stream
> socket
>
> Patch 5 - add tcp_read_sock()-like interface to AF_VSOCK sockets
>
> Patch 6 - extend sunrpc xprtsock.c for AF_VSOCK RPC clients
>
> Patch 7-9 - AF_VSOCK backchannel support
>
> Patch 10 - add AF_VSOCK support to NFS client
>
> The following example mounts /export from the hypervisor (CID 2) inside the
> virtual machine (CID 3):
>
> # /sbin/mount.nfs 2:/export /mnt -o clientaddr=3,proto=vsock
>
> Status
> ------
> I am looking for feedback on this approach. There are TODOs remaining in the code.
>
> Hopefully the way I add AF_VSOCK support to sunrpc is reasonable and something
> that can be standardized (a netid assigned and the uaddr string format decided).
>
> See below for the nfs-utils patch. It can be made nice once glibc
> getnameinfo()/getaddrinfo() support AF_VSOCK.
>
> The vsock_read_sock() implementation is dumb. Less of a NFS/SUNRPC issue and
> more of a vsock issue, but perhaps virtio_transport.c should use skbs for its
> receive queue instead of a custom packet struct. That would eliminate memory
> allocation and copying in vsock_read_sock().
>
> The next step is tackling NFS server. In the meantime, I have tested the
> patches using the nc-vsock netcat-like utility that is available in my Linux
> kernel repo below.

So by a netcat-like utility, you mean it's proxying between client and a
server so the client thinks the server is communicating over AF_VSOCK
and the server thinks the client is using TCP? (Sorry, I haven't looked
at the code.)

Once we have a server and client, how will you recommend testing them?
(Will the server side need to run on real hardware?)

I guess if it works then the main question is whether it's worth
supporting another transport type in order to get the zero-configuration
host<->guest NFS setup. Or whether there's another way to get the same
gains.

Seems like a useful thing to have.

--b.

>
> Repositories
> ------------
> * Linux kernel: https://github.com/stefanha/linux.git vsock-nfs
> * QEMU virtio-vsock device: https://github.com/stefanha/qemu.git vsock
> * nfs-utils vsock: https://github.com/stefanha/nfs-utils.git vsock
>
> Stefan Hajnoczi (10):
> SUNRPC: add AF_VSOCK support to addr.h
> SUNRPC: rename "TCP" record parser to "stream" parser
> SUNRPC: abstract tcp_read_sock() in record fragment parser
> SUNRPC: extract xs_stream_reset_state()
> VSOCK: add tcp_read_sock()-like vsock_read_sock() function
> SUNRPC: add AF_VSOCK support to xprtsock.c
> SUNRPC: restrict backchannel svc IPPROTO_TCP check to IP
> SUNRPC: add vsock-bc backchannel
> SUNRPC: add AF_VSOCK support to svc_xprt.c
> NFS: add AF_VSOCK support to NFS client
>
> drivers/vhost/vsock.c | 1 +
> fs/nfs/callback.c | 7 +-
> fs/nfs/client.c | 16 +
> fs/nfs/super.c | 10 +
> include/linux/sunrpc/addr.h | 6 +
> include/linux/sunrpc/svc_xprt.h | 12 +
> include/linux/sunrpc/xprt.h | 1 +
> include/linux/sunrpc/xprtsock.h | 37 +-
> include/linux/virtio_vsock.h | 4 +
> include/net/af_vsock.h | 5 +
> include/trace/events/sunrpc.h | 30 +-
> net/sunrpc/addr.c | 57 +++
> net/sunrpc/svc.c | 13 +-
> net/sunrpc/svc_xprt.c | 13 +
> net/sunrpc/svcsock.c | 48 ++-
> net/sunrpc/xprtsock.c | 693 +++++++++++++++++++++++++-------
> net/vmw_vsock/af_vsock.c | 15 +
> net/vmw_vsock/virtio_transport.c | 1 +
> net/vmw_vsock/virtio_transport_common.c | 55 +++
> net/vmw_vsock/vmci_transport.c | 8 +
> 20 files changed, 825 insertions(+), 207 deletions(-)
>
> --
> 2.4.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-06-10 16:43:18

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [RFC 00/10] NFS: add AF_VSOCK support to NFS client

On Mon, Jun 08, 2015 at 05:02:47PM -0400, J. Bruce Fields wrote:
> On Thu, Jun 04, 2015 at 05:45:43PM +0100, Stefan Hajnoczi wrote:
> > The approach in this series
> > ---------------------------
> > AF_VSOCK stream sockets can be used for NFSv4.1 much in the same way as TCP.
> > RFC 1831 record fragments divide messages since SOCK_STREAM semantics are
> > present. The backchannel shares the connection just like the default TCP
> > configuration.
>
> So the NFSv4 backchannel isn't handled for now, I assume.

Right, I did not touch nfs4_callback_up_net(), only
nfs41_callback_up_net().

If I'm reading the code right NFSv4 uses a separate listen port for the
backchannel instead of sharing the client's socket?

This is possible to implement with AF_VSOCK but I have only tested
NFSv4.1 so far. Should I go ahead and do this?

> And I guess
> NFSv2/v3 is out too thanks to rpcbind? Which maybe is fine.

Yes, I ignored rpcbind and didn't test NFSv2/v3.

> Do we need an IETF draft or similar to document how NFS should work over
> AF_VSOCK?

I am not familiar with the standards process but I came across a few
places where it makes sense to have a standard:

* SUNRPC netid for AF_VSOCK (currently "tcp", "udp", and others exist)
* The uaddr string format ("vsock:...")
* Use of RFC 1831 record fragments (just like TCP) over AF_VSOCK
SOCK_STREAM sockets

These are all at the SUNRPC level rather than at the NFS protocol level.

Any idea who I need to talk to?

> NFS developers rely heavily on wireshark (and similar tools) for
> debugging. Is that still possible over AF_VSOCK?

No, this will require kernel and libpcap patches. Something like
drivers/net/nlmon.c is needed for AF_VSOCK. Basically a dummy network
interface and code that clones skbs when monitoring is enabled.

It's on the TODO list and will be very useful.

> > The next step is tackling NFS server. In the meantime, I have tested the
> > patches using the nc-vsock netcat-like utility that is available in my Linux
> > kernel repo below.
>
> So by a netcat-like utility, you mean it's proxying between client and a
> server so the client thinks the server is communicating over AF_VSOCK
> and the server thinks the client is using TCP? (Sorry, I haven't looked
> at the code.)

Yes, exactly. It works since the TCP and AF_VSOCK streams are almost
bit-compatible. I think the difference between the streams occurs when
network addresses are transmitted (e.g. SUNRPC netids), but I haven't
encountered that with NFSv4.1 and no pnfs or fancy features in use.

> Once we have a server and client, how will you recommend testing them?
> (Will the server side need to run on real hardware?)

I have been testing nfsd on the host and nfs client in a virtual
machine. Vice versa should work in the same way.

It's also possible to run nfsd in VM #1 and nfs client in VM #2 and use
the netcat-like utility on the host to forward the traffic. That way
any kernel panic happens in a VM and doesn't bring down the machine.
I'll probably begin using this approach when I start nfsd work.

> I guess if it works then the main question is whether it's worth
> supporting another transport type in order to get the zero-configuration
> host<->guest NFS setup. Or whether there's another way to get the same
> gains.

Thanks! If anyone has suggestions to avoid adding the AF_VSOCK
transport I'd be interested to learn about that.

Stefan


Attachments:
(No filename) (3.34 kB)
(No filename) (473.00 B)
Download all attachments

2015-06-10 18:09:28

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC 00/10] NFS: add AF_VSOCK support to NFS client

On Wed, Jun 10, 2015 at 05:43:15PM +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 08, 2015 at 05:02:47PM -0400, J. Bruce Fields wrote:
> > On Thu, Jun 04, 2015 at 05:45:43PM +0100, Stefan Hajnoczi wrote:
> > > The approach in this series
> > > ---------------------------
> > > AF_VSOCK stream sockets can be used for NFSv4.1 much in the same way as TCP.
> > > RFC 1831 record fragments divide messages since SOCK_STREAM semantics are
> > > present. The backchannel shares the connection just like the default TCP
> > > configuration.
> >
> > So the NFSv4 backchannel isn't handled for now, I assume.
>
> Right, I did not touch nfs4_callback_up_net(), only
> nfs41_callback_up_net().
>
> If I'm reading the code right NFSv4 uses a separate listen port for the
> backchannel instead of sharing the client's socket?

Right.

> This is possible to implement with AF_VSOCK but I have only tested
> NFSv4.1 so far. Should I go ahead and do this?

Personally I'd make it a lower priority--I don't see why you can't make
4.1 a requirement for the new transport--but I'd be curious what others
have to say.

> > And I guess
> > NFSv2/v3 is out too thanks to rpcbind? Which maybe is fine.
>
> Yes, I ignored rpcbind and didn't test NFSv2/v3.
>
> > Do we need an IETF draft or similar to document how NFS should work over
> > AF_VSOCK?
>
> I am not familiar with the standards process but I came across a few
> places where it makes sense to have a standard:
>
> * SUNRPC netid for AF_VSOCK (currently "tcp", "udp", and others exist)
> * The uaddr string format ("vsock:...")

Off the top of my head I can't remember where else that's used in the
protocol other than in setting up the 4.0 callback connection (and in
rpcbind).

> * Use of RFC 1831 record fragments (just like TCP) over AF_VSOCK
> SOCK_STREAM sockets

As far as I can tell, 1831 claims to be independent of any transport
protocol details: "The RPC protocol can be implemented on several
different transport protocols. The RPC protocol does not care how a
message is passed from one process to another, but only with
specification and interpretation of messages." And: "When RPC messages
are passed on top of a byte stream transport protocol (like TCP)"....
So perhaps there's nothing more to say here.

> These are all at the SUNRPC level rather than at the NFS protocol level.
>
> Any idea who I need to talk to?

Anyay, if there is anything to be worked out, [email protected] is the
place to go.

--b.

2015-06-11 09:19:14

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [RFC 00/10] NFS: add AF_VSOCK support to NFS client

On Wed, Jun 10, 2015 at 02:09:26PM -0400, J. Bruce Fields wrote:
> On Wed, Jun 10, 2015 at 05:43:15PM +0100, Stefan Hajnoczi wrote:
> > These are all at the SUNRPC level rather than at the NFS protocol level.
> >
> > Any idea who I need to talk to?
>
> Anyay, if there is anything to be worked out, [email protected] is the
> place to go.

Thanks, I can write a summary and send it there.

Stefan


Attachments:
(No filename) (396.00 B)
(No filename) (473.00 B)
Download all attachments