2008-07-07 20:51:22

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27

On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:

> If you would like connected UDP, I won't object to you implementing
> it. However, I never tested whether a connected UDP socket will give
> the desired semantics without extra code in the UDP transport (for
> example, an ->sk_error callback). I don't think it's worth the hassle
> if we have to add code to UDP that only this tiny use case would need.
>

OK. I'll set these patches aside until I have time to look into adding
connected UDP support.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2008-07-07 21:19:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27

On Mon, Jul 07, 2008 at 04:51:17PM -0400, Trond Myklebust wrote:
> On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:
>
> > If you would like connected UDP, I won't object to you implementing
> > it. However, I never tested whether a connected UDP socket will give
> > the desired semantics without extra code in the UDP transport (for
> > example, an ->sk_error callback). I don't think it's worth the hassle
> > if we have to add code to UDP that only this tiny use case would need.
> >
>
> OK. I'll set these patches aside until I have time to look into adding
> connected UDP support.

I'm curious--why weren't you convinced by this argument?:

"You're talking about the difference between supporting say 1358
mounts at boot time versus 1357 mounts at boot time.

"In most cases, a client with hundreds of mounts will use up
exactly one extra privileged TCP port to register NLM during the
first lockd_up() call. If these are all NFSv4 mounts, it will
use exactly zero extra ports, since the NFSv4 callback service
is not even registered.

"Considering that _each_ mount operation can take between 2 and
5 privileged ports, while registering NFSD and NLM both would
take exactly two ports at boot time, I think that registration
is wrong place to optimize."

I'll admit to not following this carefully, but that seemed reasonable
to me.

--b.

2008-07-07 22:13:44

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27

On Mon, 2008-07-07 at 17:19 -0400, J. Bruce Fields wrote:
> On Mon, Jul 07, 2008 at 04:51:17PM -0400, Trond Myklebust wrote:
> > On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:
> >
> > > If you would like connected UDP, I won't object to you implementing
> > > it. However, I never tested whether a connected UDP socket will give
> > > the desired semantics without extra code in the UDP transport (for
> > > example, an ->sk_error callback). I don't think it's worth the hassle
> > > if we have to add code to UDP that only this tiny use case would need.
> > >
> >
> > OK. I'll set these patches aside until I have time to look into adding
> > connected UDP support.
>
> I'm curious--why weren't you convinced by this argument?:
>
> "You're talking about the difference between supporting say 1358
> mounts at boot time versus 1357 mounts at boot time.

Where did you get those figures from? Firstly, the total number of
privileged ports is much smaller. Secondly, the number of _free_
privileged ports can vary wildly depending on the user's setup.

> "In most cases, a client with hundreds of mounts will use up
> exactly one extra privileged TCP port to register NLM during the
> first lockd_up() call. If these are all NFSv4 mounts, it will
> use exactly zero extra ports, since the NFSv4 callback service
> is not even registered.
>
> "Considering that _each_ mount operation can take between 2 and
> 5 privileged ports, while registering NFSD and NLM both would
> take exactly two ports at boot time, I think that registration
> is wrong place to optimize."
>
> I'll admit to not following this carefully, but that seemed reasonable
> to me.

Like it or not, this _is_ a user interface change: if someone has set up
their iptables firewall or is using the tcp_wrapper library to limit
access to the portmapper (a common practice), then this change is
forcing them to change that.

It is not as if UDP connections are prohibitively difficult to implement
either. The entire framework is already there for the TCP case, and so
the following patch (as yet untested) should be close...


--------------------------------------------------------------------------
commit 161c60bc13899b0def4251cffa492ae6faa00b93
Author: Trond Myklebust <[email protected]>
Date: Mon Jul 7 17:43:12 2008 -0400

SUNRPC: Add connected sockets for UDP

Signed-off-by: Trond Myklebust <[email protected]>

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4486c59..2e49f5a 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -580,8 +580,8 @@ static int xs_udp_send_request(struct rpc_task *task)
req->rq_svec->iov_len);

status = xs_sendpages(transport->sock,
- xs_addr(xprt),
- xprt->addrlen, xdr,
+ NULL,
+ 0, xdr,
req->rq_bytes_sent);

dprintk("RPC: xs_udp_send_request(%u) = %d\n",
@@ -1445,13 +1445,13 @@ static inline void xs_reclassify_socket6(struct socket *sock)
}
#endif

-static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
+static int xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
{
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ struct sock *sk = sock->sk;
+ int ret;

if (!transport->inet) {
- struct sock *sk = sock->sk;
-
write_lock_bh(&sk->sk_callback_lock);

sk->sk_user_data = xprt;
@@ -1463,8 +1463,6 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
sk->sk_no_check = UDP_CSUM_NORCV;
sk->sk_allocation = GFP_ATOMIC;

- xprt_set_connected(xprt);
-
/* Reset to new socket */
transport->sock = sock;
transport->inet = sk;
@@ -1472,6 +1470,39 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
write_unlock_bh(&sk->sk_callback_lock);
}
xs_udp_do_set_buffer_size(xprt);
+ ret = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, 0);
+
+ if (ret == 0) {
+ spin_lock_bh(&xprt->transport_lock);
+ if (sk->sk_state == TCP_ESTABLISHED)
+ xprt_set_connected(xprt);
+ spin_unlock_bh(&xprt->transport_lock);
+ }
+ return ret;
+}
+
+/*
+ * We need to preserve the port number so the reply cache on the server can
+ * find our cached RPC replies when we get around to reconnecting.
+ */
+static void xs_sock_reuse_connection(struct rpc_xprt *xprt)
+{
+ int result;
+ struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ struct sockaddr any;
+
+ dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt);
+
+ /*
+ * Disconnect the transport socket by doing a connect operation
+ * with AF_UNSPEC. This should return immediately...
+ */
+ memset(&any, 0, sizeof(any));
+ any.sa_family = AF_UNSPEC;
+ result = kernel_connect(transport->sock, &any, sizeof(any), 0);
+ if (result)
+ dprintk("RPC: AF_UNSPEC connect return code %d\n",
+ result);
}

/**
@@ -1491,25 +1522,35 @@ static void xs_udp_connect_worker4(struct work_struct *work)
if (xprt->shutdown || !xprt_bound(xprt))
goto out;

- /* Start by resetting any existing state */
- xs_close(xprt);
-
- if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
- dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
- goto out;
- }
- xs_reclassify_socket4(sock);
+ if (!sock) {
+ if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
+ dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
+ goto out;
+ }
+ xs_reclassify_socket4(sock);

- if (xs_bind4(transport, sock)) {
- sock_release(sock);
- goto out;
- }
+ if (xs_bind4(transport, sock)) {
+ sock_release(sock);
+ goto out;
+ }
+ } else
+ xs_sock_reuse_connection(xprt);

dprintk("RPC: worker connecting xprt %p to address: %s\n",
xprt, xprt->address_strings[RPC_DISPLAY_ALL]);

- xs_udp_finish_connecting(xprt, sock);
- status = 0;
+ status = xs_udp_finish_connecting(xprt, sock);
+ if (status < 0) {
+ switch (status) {
+ case -ECONNREFUSED:
+ case -ECONNRESET:
+ /* retry with existing socket, after a delay */
+ break;
+ default:
+ /* get rid of existing socket, and retry */
+ xs_close(xprt);
+ }
+ }
out:
xprt_wake_pending_tasks(xprt, status);
xprt_clear_connecting(xprt);
@@ -1532,54 +1573,40 @@ static void xs_udp_connect_worker6(struct work_struct *work)
if (xprt->shutdown || !xprt_bound(xprt))
goto out;

- /* Start by resetting any existing state */
- xs_close(xprt);
-
- if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
- dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
- goto out;
- }
- xs_reclassify_socket6(sock);
+ if (!sock) {
+ if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
+ dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
+ goto out;
+ }
+ xs_reclassify_socket6(sock);

- if (xs_bind6(transport, sock) < 0) {
- sock_release(sock);
- goto out;
- }
+ if (xs_bind6(transport, sock) < 0) {
+ sock_release(sock);
+ goto out;
+ }
+ } else
+ xs_sock_reuse_connection(xprt);

dprintk("RPC: worker connecting xprt %p to address: %s\n",
xprt, xprt->address_strings[RPC_DISPLAY_ALL]);

- xs_udp_finish_connecting(xprt, sock);
- status = 0;
+ status = xs_udp_finish_connecting(xprt, sock);
+ if (status < 0) {
+ switch (status) {
+ case -ECONNREFUSED:
+ case -ECONNRESET:
+ /* retry with existing socket, after a delay */
+ break;
+ default:
+ /* get rid of existing socket, and retry */
+ xs_close(xprt);
+ }
+ }
out:
xprt_wake_pending_tasks(xprt, status);
xprt_clear_connecting(xprt);
}

-/*
- * We need to preserve the port number so the reply cache on the server can
- * find our cached RPC replies when we get around to reconnecting.
- */
-static void xs_tcp_reuse_connection(struct rpc_xprt *xprt)
-{
- int result;
- struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
- struct sockaddr any;
-
- dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt);
-
- /*
- * Disconnect the transport socket by doing a connect operation
- * with AF_UNSPEC. This should return immediately...
- */
- memset(&any, 0, sizeof(any));
- any.sa_family = AF_UNSPEC;
- result = kernel_connect(transport->sock, &any, sizeof(any), 0);
- if (result)
- dprintk("RPC: AF_UNSPEC connect return code %d\n",
- result);
-}
-
static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
{
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
@@ -1650,7 +1677,7 @@ static void xs_tcp_connect_worker4(struct work_struct *work)
}
} else
/* "close" the socket, preserving the local port */
- xs_tcp_reuse_connection(xprt);
+ xs_sock_reuse_connection(xprt);

dprintk("RPC: worker connecting xprt %p to address: %s\n",
xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
@@ -1710,7 +1737,7 @@ static void xs_tcp_connect_worker6(struct work_struct *work)
}
} else
/* "close" the socket, preserving the local port */
- xs_tcp_reuse_connection(xprt);
+ xs_sock_reuse_connection(xprt);

dprintk("RPC: worker connecting xprt %p to address: %s\n",
xprt, xprt->address_strings[RPC_DISPLAY_ALL]);


--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2008-07-07 22:56:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27

On Mon, Jul 07, 2008 at 06:13:42PM -0400, Trond Myklebust wrote:
> On Mon, 2008-07-07 at 17:19 -0400, J. Bruce Fields wrote:
> > On Mon, Jul 07, 2008 at 04:51:17PM -0400, Trond Myklebust wrote:
> > > On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:
> > >
> > > > If you would like connected UDP, I won't object to you implementing
> > > > it. However, I never tested whether a connected UDP socket will give
> > > > the desired semantics without extra code in the UDP transport (for
> > > > example, an ->sk_error callback). I don't think it's worth the hassle
> > > > if we have to add code to UDP that only this tiny use case would need.
> > > >
> > >
> > > OK. I'll set these patches aside until I have time to look into adding
> > > connected UDP support.
> >
> > I'm curious--why weren't you convinced by this argument?:
> >
> > "You're talking about the difference between supporting say 1358
> > mounts at boot time versus 1357 mounts at boot time.
>
> Where did you get those figures from? Firstly, the total number of
> privileged ports is much smaller. Secondly, the number of _free_
> privileged ports can vary wildly depending on the user's setup.

So by default (from min/max_resvport and Chuck's "between 2 and 5"
estimate of privileged ports per mount) you'd get (1024-665)/(2 to 5)
mounts, so between 71 and 179 mounts, not taking into account what else
they might be used for. So that's a bit closer to the point where 1
port plus or minus might make a difference, OK.

> > "In most cases, a client with hundreds of mounts will use up
> > exactly one extra privileged TCP port to register NLM during the
> > first lockd_up() call. If these are all NFSv4 mounts, it will
> > use exactly zero extra ports, since the NFSv4 callback service
> > is not even registered.
> >
> > "Considering that _each_ mount operation can take between 2 and
> > 5 privileged ports, while registering NFSD and NLM both would
> > take exactly two ports at boot time, I think that registration
> > is wrong place to optimize."
> >
> > I'll admit to not following this carefully, but that seemed reasonable
> > to me.
>
> Like it or not, this _is_ a user interface change: if someone has set up
> their iptables firewall or is using the tcp_wrapper library to limit
> access to the portmapper (a common practice), then this change is
> forcing them to change that.

Yeah, I can buy that. OK, thanks for the explanation.

--b.

>
> It is not as if UDP connections are prohibitively difficult to implement
> either. The entire framework is already there for the TCP case, and so
> the following patch (as yet untested) should be close...
>
>
> --------------------------------------------------------------------------
> commit 161c60bc13899b0def4251cffa492ae6faa00b93
> Author: Trond Myklebust <[email protected]>
> Date: Mon Jul 7 17:43:12 2008 -0400
>
> SUNRPC: Add connected sockets for UDP
>
> Signed-off-by: Trond Myklebust <[email protected]>
>
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 4486c59..2e49f5a 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -580,8 +580,8 @@ static int xs_udp_send_request(struct rpc_task *task)
> req->rq_svec->iov_len);
>
> status = xs_sendpages(transport->sock,
> - xs_addr(xprt),
> - xprt->addrlen, xdr,
> + NULL,
> + 0, xdr,
> req->rq_bytes_sent);
>
> dprintk("RPC: xs_udp_send_request(%u) = %d\n",
> @@ -1445,13 +1445,13 @@ static inline void xs_reclassify_socket6(struct socket *sock)
> }
> #endif
>
> -static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
> +static int xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
> {
> struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> + struct sock *sk = sock->sk;
> + int ret;
>
> if (!transport->inet) {
> - struct sock *sk = sock->sk;
> -
> write_lock_bh(&sk->sk_callback_lock);
>
> sk->sk_user_data = xprt;
> @@ -1463,8 +1463,6 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
> sk->sk_no_check = UDP_CSUM_NORCV;
> sk->sk_allocation = GFP_ATOMIC;
>
> - xprt_set_connected(xprt);
> -
> /* Reset to new socket */
> transport->sock = sock;
> transport->inet = sk;
> @@ -1472,6 +1470,39 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
> write_unlock_bh(&sk->sk_callback_lock);
> }
> xs_udp_do_set_buffer_size(xprt);
> + ret = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, 0);
> +
> + if (ret == 0) {
> + spin_lock_bh(&xprt->transport_lock);
> + if (sk->sk_state == TCP_ESTABLISHED)
> + xprt_set_connected(xprt);
> + spin_unlock_bh(&xprt->transport_lock);
> + }
> + return ret;
> +}
> +
> +/*
> + * We need to preserve the port number so the reply cache on the server can
> + * find our cached RPC replies when we get around to reconnecting.
> + */
> +static void xs_sock_reuse_connection(struct rpc_xprt *xprt)
> +{
> + int result;
> + struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> + struct sockaddr any;
> +
> + dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt);
> +
> + /*
> + * Disconnect the transport socket by doing a connect operation
> + * with AF_UNSPEC. This should return immediately...
> + */
> + memset(&any, 0, sizeof(any));
> + any.sa_family = AF_UNSPEC;
> + result = kernel_connect(transport->sock, &any, sizeof(any), 0);
> + if (result)
> + dprintk("RPC: AF_UNSPEC connect return code %d\n",
> + result);
> }
>
> /**
> @@ -1491,25 +1522,35 @@ static void xs_udp_connect_worker4(struct work_struct *work)
> if (xprt->shutdown || !xprt_bound(xprt))
> goto out;
>
> - /* Start by resetting any existing state */
> - xs_close(xprt);
> -
> - if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
> - dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
> - goto out;
> - }
> - xs_reclassify_socket4(sock);
> + if (!sock) {
> + if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
> + dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
> + goto out;
> + }
> + xs_reclassify_socket4(sock);
>
> - if (xs_bind4(transport, sock)) {
> - sock_release(sock);
> - goto out;
> - }
> + if (xs_bind4(transport, sock)) {
> + sock_release(sock);
> + goto out;
> + }
> + } else
> + xs_sock_reuse_connection(xprt);
>
> dprintk("RPC: worker connecting xprt %p to address: %s\n",
> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
>
> - xs_udp_finish_connecting(xprt, sock);
> - status = 0;
> + status = xs_udp_finish_connecting(xprt, sock);
> + if (status < 0) {
> + switch (status) {
> + case -ECONNREFUSED:
> + case -ECONNRESET:
> + /* retry with existing socket, after a delay */
> + break;
> + default:
> + /* get rid of existing socket, and retry */
> + xs_close(xprt);
> + }
> + }
> out:
> xprt_wake_pending_tasks(xprt, status);
> xprt_clear_connecting(xprt);
> @@ -1532,54 +1573,40 @@ static void xs_udp_connect_worker6(struct work_struct *work)
> if (xprt->shutdown || !xprt_bound(xprt))
> goto out;
>
> - /* Start by resetting any existing state */
> - xs_close(xprt);
> -
> - if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
> - dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
> - goto out;
> - }
> - xs_reclassify_socket6(sock);
> + if (!sock) {
> + if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
> + dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
> + goto out;
> + }
> + xs_reclassify_socket6(sock);
>
> - if (xs_bind6(transport, sock) < 0) {
> - sock_release(sock);
> - goto out;
> - }
> + if (xs_bind6(transport, sock) < 0) {
> + sock_release(sock);
> + goto out;
> + }
> + } else
> + xs_sock_reuse_connection(xprt);
>
> dprintk("RPC: worker connecting xprt %p to address: %s\n",
> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
>
> - xs_udp_finish_connecting(xprt, sock);
> - status = 0;
> + status = xs_udp_finish_connecting(xprt, sock);
> + if (status < 0) {
> + switch (status) {
> + case -ECONNREFUSED:
> + case -ECONNRESET:
> + /* retry with existing socket, after a delay */
> + break;
> + default:
> + /* get rid of existing socket, and retry */
> + xs_close(xprt);
> + }
> + }
> out:
> xprt_wake_pending_tasks(xprt, status);
> xprt_clear_connecting(xprt);
> }
>
> -/*
> - * We need to preserve the port number so the reply cache on the server can
> - * find our cached RPC replies when we get around to reconnecting.
> - */
> -static void xs_tcp_reuse_connection(struct rpc_xprt *xprt)
> -{
> - int result;
> - struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> - struct sockaddr any;
> -
> - dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt);
> -
> - /*
> - * Disconnect the transport socket by doing a connect operation
> - * with AF_UNSPEC. This should return immediately...
> - */
> - memset(&any, 0, sizeof(any));
> - any.sa_family = AF_UNSPEC;
> - result = kernel_connect(transport->sock, &any, sizeof(any), 0);
> - if (result)
> - dprintk("RPC: AF_UNSPEC connect return code %d\n",
> - result);
> -}
> -
> static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
> {
> struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> @@ -1650,7 +1677,7 @@ static void xs_tcp_connect_worker4(struct work_struct *work)
> }
> } else
> /* "close" the socket, preserving the local port */
> - xs_tcp_reuse_connection(xprt);
> + xs_sock_reuse_connection(xprt);
>
> dprintk("RPC: worker connecting xprt %p to address: %s\n",
> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
> @@ -1710,7 +1737,7 @@ static void xs_tcp_connect_worker6(struct work_struct *work)
> }
> } else
> /* "close" the socket, preserving the local port */
> - xs_tcp_reuse_connection(xprt);
> + xs_sock_reuse_connection(xprt);
>
> dprintk("RPC: worker connecting xprt %p to address: %s\n",
> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
>
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com

2008-07-08 01:56:31

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27

On Mon, Jul 7, 2008 at 6:56 PM, J. Bruce Fields <[email protected]> wrote:
> On Mon, Jul 07, 2008 at 06:13:42PM -0400, Trond Myklebust wrote:
>> On Mon, 2008-07-07 at 17:19 -0400, J. Bruce Fields wrote:
>> > On Mon, Jul 07, 2008 at 04:51:17PM -0400, Trond Myklebust wrote:
>> > > On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:
>> > >
>> > > > If you would like connected UDP, I won't object to you implementing
>> > > > it. However, I never tested whether a connected UDP socket will give
>> > > > the desired semantics without extra code in the UDP transport (for
>> > > > example, an ->sk_error callback). I don't think it's worth the hassle
>> > > > if we have to add code to UDP that only this tiny use case would need.
>> > > >
>> > >
>> > > OK. I'll set these patches aside until I have time to look into adding
>> > > connected UDP support.
>> >
>> > I'm curious--why weren't you convinced by this argument?:
>> >
>> > "You're talking about the difference between supporting say 1358
>> > mounts at boot time versus 1357 mounts at boot time.
>>
>> Where did you get those figures from?

It's just an example.

>> Firstly, the total number of
>> privileged ports is much smaller. Secondly, the number of _free_
>> privileged ports can vary wildly depending on the user's setup.

That's my point. An admin could _maybe_ get one more NFS mount if she
disables a network service above port 665. Are we going to stop users
from enabling non-NFS network services that might use an extra
privileged port?

The right way to fix this is to fix the way Linux does NFS mounts. It
would be nice if we had a transport capability for performing RPCs
against multiple remote servers from one datagram socket. (not
broadcast RPC, but simply sharing one socket amongst multiple datagram
RPC consumers on the client).

This kind of transport would definitely help the mount port limitation
-- the kernel mount client and the kernel rpcbind client could do all
their UDP-based traffic through a single privileged socket. Would it
be hard to share an rpc_xprt between several rpc_clients for a group
of low-volume RPC services?

The RPC client could also provide a small pool of stream transport
sockets for the same purpose. They can stay connected to a remote
until they time out, or until another local consumer needs to send to
a unique unconnected remote and there aren't any free sockets in the
pool.

Nfs-utils uses the whole range of privileged ports, being careful to
avoid well-known ports for long-lived services. The kernel should be
able to do this also for short-lived sockets like making a MNT or
rpcbind request.

We could even force the RPC service registration code to use a low
numbered privileged port to avoid the ports normally used by
long-lived RPC sockets.

> So by default (from min/max_resvport and Chuck's "between 2 and 5"
> estimate of privileged ports per mount) you'd get (1024-665)/(2 to 5)
> mounts, so between 71 and 179 mounts, not taking into account what else
> they might be used for. So that's a bit closer to the point where 1
> port plus or minus might make a difference, OK.

Doing a mount takes some time. If it takes as long as a second, for
example, for each mount to complete, that gives enough time for one or
more TCP ports used and abandoned for the first mounts to become
available again for later ones. If some of the ports are UDP and some
TCP, that means even more ports are available, since the port range is
effectively twice as large, and only the TCP ports remain in
TIME_WAIT. A MNT request, for instance, goes over UDP by default.

So it's not as simple as this calculation implies -- it depends on
timing, other enabled non-NFS services, and which transports are used
for NFS. One extra port is noise.

In many cases someone who needs more than a hundred or so mounts can
use unprivileged source ports, as Trond suggested on this list last
week.

>> > "In most cases, a client with hundreds of mounts will use up
>> > exactly one extra privileged TCP port to register NLM during the
>> > first lockd_up() call. If these are all NFSv4 mounts, it will
>> > use exactly zero extra ports, since the NFSv4 callback service
>> > is not even registered.
>> >
>> > "Considering that _each_ mount operation can take between 2 and
>> > 5 privileged ports, while registering NFSD and NLM both would
>> > take exactly two ports at boot time, I think that registration
>> > is wrong place to optimize."
>> >
>> > I'll admit to not following this carefully, but that seemed reasonable
>> > to me.
>>
>> Like it or not, this _is_ a user interface change: if someone has set up
>> their iptables firewall or is using the tcp_wrapper library to limit
>> access to the portmapper (a common practice), then this change is
>> forcing them to change that.

It would have been nice to hear such a specific complaint months ago
when I first suggested doing this on linux-nfs.

>> It is not as if UDP connections are prohibitively difficult to implement
>> either. The entire framework is already there for the TCP case, and so
>> the following patch (as yet untested) should be close...

I don't think CUDP is hard to implement or a bad idea. It just seemed
like a lot to do for a simple change that can be accomplished with an
existing transport implementation.

>> --------------------------------------------------------------------------
>> commit 161c60bc13899b0def4251cffa492ae6faa00b93
>> Author: Trond Myklebust <[email protected]>
>> Date: Mon Jul 7 17:43:12 2008 -0400
>>
>> SUNRPC: Add connected sockets for UDP

I haven't looked closely at this in a while, but datagram socket
disconnection events may be reported to kernel datagram sockets via
->sk_error(), and not via an error code from sendmsg() or connect().

Also, do we need a set of shorter reconnect timeout values especially
for UDP sockets?

Is there any possibility that UDP-based RPC servers may see a difference?

A supplemental change might be to make the kernel's mount and rpcbind
clients use RPC_TASK_ONESHOT or whatever we call it.

>> Signed-off-by: Trond Myklebust <[email protected]>
>>
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 4486c59..2e49f5a 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -580,8 +580,8 @@ static int xs_udp_send_request(struct rpc_task *task)
>> req->rq_svec->iov_len);
>>
>> status = xs_sendpages(transport->sock,
>> - xs_addr(xprt),
>> - xprt->addrlen, xdr,
>> + NULL,
>> + 0, xdr,
>> req->rq_bytes_sent);

This is the only place an address and length are passed to
xs_sendpages(), so you could eliminate these arguments and simplify
xs_sendpages() as a result.

However, if we want both a connected UDP capability and a shared UDP
socket capability, we may need to keep these arguments, and make the
UDP connection logic added below conditional.

>> dprintk("RPC: xs_udp_send_request(%u) = %d\n",
>> @@ -1445,13 +1445,13 @@ static inline void xs_reclassify_socket6(struct socket *sock)
>> }
>> #endif
>>
>> -static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
>> +static int xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
>> {
>> struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
>> + struct sock *sk = sock->sk;
>> + int ret;
>>
>> if (!transport->inet) {
>> - struct sock *sk = sock->sk;
>> -
>> write_lock_bh(&sk->sk_callback_lock);
>>
>> sk->sk_user_data = xprt;
>> @@ -1463,8 +1463,6 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
>> sk->sk_no_check = UDP_CSUM_NORCV;
>> sk->sk_allocation = GFP_ATOMIC;
>>
>> - xprt_set_connected(xprt);
>> -
>> /* Reset to new socket */
>> transport->sock = sock;
>> transport->inet = sk;
>> @@ -1472,6 +1470,39 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
>> write_unlock_bh(&sk->sk_callback_lock);
>> }
>> xs_udp_do_set_buffer_size(xprt);
>> + ret = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, 0);
>> +
>> + if (ret == 0) {
>> + spin_lock_bh(&xprt->transport_lock);
>> + if (sk->sk_state == TCP_ESTABLISHED)
>> + xprt_set_connected(xprt);
>> + spin_unlock_bh(&xprt->transport_lock);
>> + }
>> + return ret;
>> +}
>> +
>> +/*
>> + * We need to preserve the port number so the reply cache on the server can
>> + * find our cached RPC replies when we get around to reconnecting.
>> + */
>> +static void xs_sock_reuse_connection(struct rpc_xprt *xprt)
>> +{
>> + int result;
>> + struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
>> + struct sockaddr any;
>> +
>> + dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt);
>> +
>> + /*
>> + * Disconnect the transport socket by doing a connect operation
>> + * with AF_UNSPEC. This should return immediately...
>> + */
>> + memset(&any, 0, sizeof(any));
>> + any.sa_family = AF_UNSPEC;
>> + result = kernel_connect(transport->sock, &any, sizeof(any), 0);
>> + if (result)
>> + dprintk("RPC: AF_UNSPEC connect return code %d\n",
>> + result);
>> }
>>
>> /**
>> @@ -1491,25 +1522,35 @@ static void xs_udp_connect_worker4(struct work_struct *work)
>> if (xprt->shutdown || !xprt_bound(xprt))
>> goto out;
>>
>> - /* Start by resetting any existing state */
>> - xs_close(xprt);
>> -
>> - if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
>> - dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
>> - goto out;
>> - }
>> - xs_reclassify_socket4(sock);
>> + if (!sock) {
>> + if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
>> + dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
>> + goto out;
>> + }
>> + xs_reclassify_socket4(sock);
>>
>> - if (xs_bind4(transport, sock)) {
>> - sock_release(sock);
>> - goto out;
>> - }
>> + if (xs_bind4(transport, sock)) {
>> + sock_release(sock);
>> + goto out;
>> + }
>> + } else
>> + xs_sock_reuse_connection(xprt);
>>
>> dprintk("RPC: worker connecting xprt %p to address: %s\n",
>> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
>>
>> - xs_udp_finish_connecting(xprt, sock);
>> - status = 0;
>> + status = xs_udp_finish_connecting(xprt, sock);
>> + if (status < 0) {
>> + switch (status) {
>> + case -ECONNREFUSED:
>> + case -ECONNRESET:
>> + /* retry with existing socket, after a delay */
>> + break;
>> + default:
>> + /* get rid of existing socket, and retry */
>> + xs_close(xprt);
>> + }
>> + }
>> out:
>> xprt_wake_pending_tasks(xprt, status);
>> xprt_clear_connecting(xprt);
>> @@ -1532,54 +1573,40 @@ static void xs_udp_connect_worker6(struct work_struct *work)
>> if (xprt->shutdown || !xprt_bound(xprt))
>> goto out;
>>
>> - /* Start by resetting any existing state */
>> - xs_close(xprt);
>> -
>> - if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
>> - dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
>> - goto out;
>> - }
>> - xs_reclassify_socket6(sock);
>> + if (!sock) {
>> + if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
>> + dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
>> + goto out;
>> + }
>> + xs_reclassify_socket6(sock);
>>
>> - if (xs_bind6(transport, sock) < 0) {
>> - sock_release(sock);
>> - goto out;
>> - }
>> + if (xs_bind6(transport, sock) < 0) {
>> + sock_release(sock);
>> + goto out;
>> + }
>> + } else
>> + xs_sock_reuse_connection(xprt);
>>
>> dprintk("RPC: worker connecting xprt %p to address: %s\n",
>> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
>>
>> - xs_udp_finish_connecting(xprt, sock);
>> - status = 0;
>> + status = xs_udp_finish_connecting(xprt, sock);
>> + if (status < 0) {
>> + switch (status) {
>> + case -ECONNREFUSED:
>> + case -ECONNRESET:
>> + /* retry with existing socket, after a delay */
>> + break;
>> + default:
>> + /* get rid of existing socket, and retry */
>> + xs_close(xprt);
>> + }
>> + }
>> out:
>> xprt_wake_pending_tasks(xprt, status);
>> xprt_clear_connecting(xprt);
>> }
>>
>> -/*
>> - * We need to preserve the port number so the reply cache on the server can
>> - * find our cached RPC replies when we get around to reconnecting.
>> - */
>> -static void xs_tcp_reuse_connection(struct rpc_xprt *xprt)
>> -{
>> - int result;
>> - struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
>> - struct sockaddr any;
>> -
>> - dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt);
>> -
>> - /*
>> - * Disconnect the transport socket by doing a connect operation
>> - * with AF_UNSPEC. This should return immediately...
>> - */
>> - memset(&any, 0, sizeof(any));
>> - any.sa_family = AF_UNSPEC;
>> - result = kernel_connect(transport->sock, &any, sizeof(any), 0);
>> - if (result)
>> - dprintk("RPC: AF_UNSPEC connect return code %d\n",
>> - result);
>> -}
>> -
>> static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
>> {
>> struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
>> @@ -1650,7 +1677,7 @@ static void xs_tcp_connect_worker4(struct work_struct *work)
>> }
>> } else
>> /* "close" the socket, preserving the local port */
>> - xs_tcp_reuse_connection(xprt);
>> + xs_sock_reuse_connection(xprt);
>>
>> dprintk("RPC: worker connecting xprt %p to address: %s\n",
>> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
>> @@ -1710,7 +1737,7 @@ static void xs_tcp_connect_worker6(struct work_struct *work)
>> }
>> } else
>> /* "close" the socket, preserving the local port */
>> - xs_tcp_reuse_connection(xprt);
>> + xs_sock_reuse_connection(xprt);
>>
>> dprintk("RPC: worker connecting xprt %p to address: %s\n",
>> xprt, xprt->address_strings[RPC_DISPLAY_ALL]);

--
Chuck Lever

2008-07-10 17:29:09

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27


On Jul 7, 2008, at 4:51 PM, Trond Myklebust wrote:

> On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:
>
>> If you would like connected UDP, I won't object to you implementing
>> it. However, I never tested whether a connected UDP socket will give
>> the desired semantics without extra code in the UDP transport (for
>> example, an ->sk_error callback). I don't think it's worth the
>> hassle
>> if we have to add code to UDP that only this tiny use case would
>> need.
>>
>
> OK. I'll set these patches aside until I have time to look into adding
> connected UDP support.

That's not completely necessary... the one-shot + TCP changes just
make it nicer when the local rpcbind is not listening. Without these,
the cases where the rpcbind daemon isn't running, or doesn't support
rpcbind v3/v4 and the kernel was built with CONFIG_SUNRPC_REGISTER_V4,
will cause some delays before failing, but otherwise shouldn't be a
problem.

I think you can drop the patch to change rpcb registration to go over
TCP for now unless you already have a CUDP implementation you are
happy with.

I know a lot of folks are waiting for IPv6 support to appear, and I
don't want this detail to hang it up.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-07-11 18:41:08

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27

On Thu, Jul 10, 2008 at 01:27:47PM -0400, Chuck Lever wrote:
>
> On Jul 7, 2008, at 4:51 PM, Trond Myklebust wrote:
>
>> On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:
>>
>>> If you would like connected UDP, I won't object to you implementing
>>> it. However, I never tested whether a connected UDP socket will give
>>> the desired semantics without extra code in the UDP transport (for
>>> example, an ->sk_error callback). I don't think it's worth the
>>> hassle
>>> if we have to add code to UDP that only this tiny use case would
>>> need.
>>>
>>
>> OK. I'll set these patches aside until I have time to look into adding
>> connected UDP support.
>
> That's not completely necessary... the one-shot + TCP changes just make
> it nicer when the local rpcbind is not listening. Without these, the
> cases where the rpcbind daemon isn't running, or doesn't support rpcbind
> v3/v4 and the kernel was built with CONFIG_SUNRPC_REGISTER_V4, will cause
> some delays before failing, but otherwise shouldn't be a problem.
>
> I think you can drop the patch to change rpcb registration to go over
> TCP for now unless you already have a CUDP implementation you are happy
> with.

So actually in your original series of 7 I think that'd mean dropping
numbers 5 and 6 and keeping the rest?

I've lost track of the status of the 3 series you submitted on the 30th:

"Remaining rpcbind patches for 2.6.27"
- this one, probably ready after dropping 2 patches
"rpcbind v4 support in net/sunrpc/svc*"
- Do you still want this considered for 2.6.27?
"NLM clean-ups for IPv6 support"
- I think you were saying there's still a bug being
tracked down in this series?

--b.

2008-07-11 19:11:33

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 0/7] Remaining rpcbind patches for 2.6.27

On Fri, Jul 11, 2008 at 2:40 PM, J. Bruce Fields <[email protected]> wrote:
> On Thu, Jul 10, 2008 at 01:27:47PM -0400, Chuck Lever wrote:
>>
>> On Jul 7, 2008, at 4:51 PM, Trond Myklebust wrote:
>>
>>> On Mon, 2008-07-07 at 15:44 -0400, Chuck Lever wrote:
>>>
>>>> If you would like connected UDP, I won't object to you implementing
>>>> it. However, I never tested whether a connected UDP socket will give
>>>> the desired semantics without extra code in the UDP transport (for
>>>> example, an ->sk_error callback). I don't think it's worth the
>>>> hassle
>>>> if we have to add code to UDP that only this tiny use case would
>>>> need.
>>>>
>>>
>>> OK. I'll set these patches aside until I have time to look into adding
>>> connected UDP support.
>>
>> That's not completely necessary... the one-shot + TCP changes just make
>> it nicer when the local rpcbind is not listening. Without these, the
>> cases where the rpcbind daemon isn't running, or doesn't support rpcbind
>> v3/v4 and the kernel was built with CONFIG_SUNRPC_REGISTER_V4, will cause
>> some delays before failing, but otherwise shouldn't be a problem.
>>
>> I think you can drop the patch to change rpcb registration to go over
>> TCP for now unless you already have a CUDP implementation you are happy
>> with.
>
> So actually in your original series of 7 I think that'd mean dropping
> numbers 5 and 6 and keeping the rest?

So, 5/7 adds "one shot" support to the RPC client. I think that might
be interesting for other kernel services, like making rpcbind queries
over TCP, or NFSv4 callback. I'd like to advocate for keeping that
one so others can build on it (with whatever name for the create flag
we can agree on), but it's not really necessary for subsequent
patches.

6/7 changes the rpcb_register logic to use "one shot" + TCP -- that's
the one that is controversial and can be dropped.

> I've lost track of the status of the 3 series you submitted on the 30th:
>
> "Remaining rpcbind patches for 2.6.27"
> - this one, probably ready after dropping 2 patches

Yes.

> "rpcbind v4 support in net/sunrpc/svc*"
> - Do you still want this considered for 2.6.27?

Yes, please.

The default CONFIG setting added in this patch set should cause the
kernel to continue using portmap instead of rpcbindv3/v4 for RPC
service registration, so by default there shouldn't be any change in
behavior.

> "NLM clean-ups for IPv6 support"
> - I think you were saying there's still a bug being
> tracked down in this series?

There are probably a few bugs in this series, but I'd still like to
consider it for 2.6.27. I think the architecture that is laid out
here is pretty solid, and we will have time to exercise this and get
it right in linux-next and during .27's -rc period. Even if the whole
series is not included, I think there are good cleanups here that
should be solid enough to include.

The bug I mentioned last night is with lock recovery with NFSv3 over
IPv4, so it's not something that prevents NFSv2/v3 from working in
general. We haven't had a decent lock recovery test until just
recently.

Can we assume this is going in for now, and start the review and
integration process? I've already made a few minor changes in this
series since I posted these, so I'm sure I will have to post at least
one refresh. But it would be useful to review this series carefully
now even if some or all of it is not going into 2.6.27.

Thanks for asking!

--
Chuck Lever