Unregistering an RPC service is not essential - but it is tidy.
So it is unpleasant to wait a long time for it to complete.
As unregistering is always directed to localhost, the most likely
reason for any delay is the portmap (or rpcbind) is not running.
In that case, waiting it totally pointless. In any case, we should
expect a quick response, and zero packet loss.
So reduce the timeouts to a total of half a second.
This means that if we try to stop nfsd while portmap is not running,
it takes 3 seconds instead of 3.5 minutes.
Note that stopping portmap before the last nfsd thread has died is
not simply a configuration error. When we kill nfsd threads they
could take a while to die, and it is possible that portmap could
then be killed before the last nfsd thread has had a change to finish
up.
[An alternate might be to make the sunrpc code always "connect"
udp sockets so that "port not reachable" errors would get reported
back. This requires a more intrusive change though and might have
other consequences]
Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/rpcb_clnt.c | 17 +++++++++++++++--
1 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index beee6da..31f7e2e 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -132,7 +132,8 @@ static const struct sockaddr_in rpcb_inaddr_loopback = {
};
static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr,
- size_t addrlen, u32 version)
+ size_t addrlen, u32 version,
+ int unregister)
{
struct rpc_create_args args = {
.protocol = XPRT_TRANSPORT_UDP,
@@ -144,6 +145,16 @@ static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr,
.authflavor = RPC_AUTH_UNIX,
.flags = RPC_CLNT_CREATE_NOPING,
};
+ const struct rpc_timeout unreg_timeout = {
+ .to_initval = HZ/2,
+ .to_maxval = HZ/2,
+ .to_increment = 0,
+ .to_retries = 0,
+ .to_exponential = 0,
+ };
+
+ if (unregister)
+ args.timeout = &unreg_timeout;
return rpc_create(&args);
}
@@ -183,10 +194,12 @@ static int rpcb_register_call(const u32 version, struct rpc_message *msg)
size_t addrlen = sizeof(rpcb_inaddr_loopback);
struct rpc_clnt *rpcb_clnt;
int result, error = 0;
+ int unregister;
msg->rpc_resp = &result;
- rpcb_clnt = rpcb_create_local(addr, addrlen, version);
+ unregister = (msg->rpc_proc->p_proc == RPCBPROC_UNSET);
+ rpcb_clnt = rpcb_create_local(addr, addrlen, version, unregister);
if (!IS_ERR(rpcb_clnt)) {
error = rpc_call_sync(rpcb_clnt, msg, 0);
rpc_shutdown_client(rpcb_clnt);
At 02:33 AM 5/28/2009, NeilBrown wrote:
>Unregistering an RPC service is not essential - but it is tidy.
>So it is unpleasant to wait a long time for it to complete.
>
>As unregistering is always directed to localhost, the most likely
>reason for any delay is the portmap (or rpcbind) is not running.
>In that case, waiting it totally pointless. In any case, we should
>expect a quick response, and zero packet loss.
>
>So reduce the timeouts to a total of half a second.
Why wait for the response at all in this case? With zero retries
and nothing to do with the result, it seems pointless to even wait
for half a second.
Tom.
>
>This means that if we try to stop nfsd while portmap is not running,
>it takes 3 seconds instead of 3.5 minutes.
>
>Note that stopping portmap before the last nfsd thread has died is
>not simply a configuration error. When we kill nfsd threads they
>could take a while to die, and it is possible that portmap could
>then be killed before the last nfsd thread has had a change to finish
>up.
>
>[An alternate might be to make the sunrpc code always "connect"
>udp sockets so that "port not reachable" errors would get reported
>back. This requires a more intrusive change though and might have
>other consequences]
>
>Signed-off-by: NeilBrown <[email protected]>
>---
>
> net/sunrpc/rpcb_clnt.c | 17 +++++++++++++++--
> 1 files changed, 15 insertions(+), 2 deletions(-)
>
>diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
>index beee6da..31f7e2e 100644
>--- a/net/sunrpc/rpcb_clnt.c
>+++ b/net/sunrpc/rpcb_clnt.c
>@@ -132,7 +132,8 @@ static const struct sockaddr_in rpcb_inaddr_loopback = {
> };
>
> static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr,
>- size_t addrlen, u32 version)
>+ size_t addrlen, u32 version,
>+ int unregister)
> {
> struct rpc_create_args args = {
> .protocol = XPRT_TRANSPORT_UDP,
>@@ -144,6 +145,16 @@ static struct rpc_clnt *rpcb_create_local(struct
>sockaddr *addr,
> .authflavor = RPC_AUTH_UNIX,
> .flags = RPC_CLNT_CREATE_NOPING,
> };
>+ const struct rpc_timeout unreg_timeout = {
>+ .to_initval = HZ/2,
>+ .to_maxval = HZ/2,
>+ .to_increment = 0,
>+ .to_retries = 0,
>+ .to_exponential = 0,
>+ };
>+
>+ if (unregister)
>+ args.timeout = &unreg_timeout;
>
> return rpc_create(&args);
> }
>@@ -183,10 +194,12 @@ static int rpcb_register_call(const u32 version,
>struct rpc_message *msg)
> size_t addrlen = sizeof(rpcb_inaddr_loopback);
> struct rpc_clnt *rpcb_clnt;
> int result, error = 0;
>+ int unregister;
>
> msg->rpc_resp = &result;
>
>- rpcb_clnt = rpcb_create_local(addr, addrlen, version);
>+ unregister = (msg->rpc_proc->p_proc == RPCBPROC_UNSET);
>+ rpcb_clnt = rpcb_create_local(addr, addrlen, version, unregister);
> if (!IS_ERR(rpcb_clnt)) {
> error = rpc_call_sync(rpcb_clnt, msg, 0);
> rpc_shutdown_client(rpcb_clnt);
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
On May 28, 2009, at 2:33 AM, NeilBrown wrote:
> Unregistering an RPC service is not essential - but it is tidy.
> So it is unpleasant to wait a long time for it to complete.
>
> As unregistering is always directed to localhost, the most likely
> reason for any delay is the portmap (or rpcbind) is not running.
> In that case, waiting it totally pointless. In any case, we should
> expect a quick response, and zero packet loss.
>
> So reduce the timeouts to a total of half a second.
A much better solution here would be to use a connected UDP socket.
This would make start-up in the absense of rpcbind/portmap fail much
quicker, we could keep the longer, safer timeout value, and the kernel
could warn about exactly what the problem is. It would also have
similar benefits for the kernel's mount and NSM clients, and when
starting up services. It would additionally make UDP RPC ping
checking fail very quickly.
> This means that if we try to stop nfsd while portmap is not running,
> it takes 3 seconds instead of 3.5 minutes.
>
> Note that stopping portmap before the last nfsd thread has died is
> not simply a configuration error. When we kill nfsd threads they
> could take a while to die, and it is possible that portmap could
> then be killed before the last nfsd thread has had a change to finish
> up.
>
> [An alternate might be to make the sunrpc code always "connect"
> udp sockets so that "port not reachable" errors would get reported
> back. This requires a more intrusive change though and might have
> other consequences]
We had discussed this about a year ago when I started adding IPv6
support. I had suggested switching the local rpc client to use TCP
instead of UDP to solve exactly this time-out problem during start-
up. There was some resistance to the idea because TCP would leave
privileged ports in TIMEWAIT (at shutdown, this is probably not a
significant concern).
Trond had intended to introduce connected UDP socket support to the
RPC client, although we were also interested in someday having a
single UDP socket for all RPC traffic... the design never moved on
from there.
My feeling at this point is that having a connected UDP socket
transport would be simpler and have broader benefits than waiting for
an eventual design that can accommodate multiple transport instances
sharing a single socket.
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> net/sunrpc/rpcb_clnt.c | 17 +++++++++++++++--
> 1 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
> index beee6da..31f7e2e 100644
> --- a/net/sunrpc/rpcb_clnt.c
> +++ b/net/sunrpc/rpcb_clnt.c
> @@ -132,7 +132,8 @@ static const struct sockaddr_in
> rpcb_inaddr_loopback = {
> };
>
> static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr,
> - size_t addrlen, u32 version)
> + size_t addrlen, u32 version,
> + int unregister)
> {
It might be cleaner to define the timeout structures as global
statics, and pass in a pointer to the preferred structure instead of
checking a passed-in an integer. One conditional branch instead of two.
> struct rpc_create_args args = {
> .protocol = XPRT_TRANSPORT_UDP,
> @@ -144,6 +145,16 @@ static struct rpc_clnt
> *rpcb_create_local(struct sockaddr *addr,
> .authflavor = RPC_AUTH_UNIX,
> .flags = RPC_CLNT_CREATE_NOPING,
> };
> + const struct rpc_timeout unreg_timeout = {
> + .to_initval = HZ/2,
> + .to_maxval = HZ/2,
> + .to_increment = 0,
> + .to_retries = 0,
> + .to_exponential = 0,
> + };
> +
> + if (unregister)
> + args.timeout = &unreg_timeout;
>
> return rpc_create(&args);
> }
> @@ -183,10 +194,12 @@ static int rpcb_register_call(const u32
> version, struct rpc_message *msg)
> size_t addrlen = sizeof(rpcb_inaddr_loopback);
> struct rpc_clnt *rpcb_clnt;
> int result, error = 0;
> + int unregister;
>
> msg->rpc_resp = &result;
>
> - rpcb_clnt = rpcb_create_local(addr, addrlen, version);
> + unregister = (msg->rpc_proc->p_proc == RPCBPROC_UNSET);
> + rpcb_clnt = rpcb_create_local(addr, addrlen, version, unregister);
> if (!IS_ERR(rpcb_clnt)) {
> error = rpc_call_sync(rpcb_clnt, msg, 0);
> rpc_shutdown_client(rpcb_clnt);
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Thursday May 28, [email protected] wrote:
> On May 28, 2009, at 2:33 AM, NeilBrown wrote:
> >
> > [An alternate might be to make the sunrpc code always "connect"
> > udp sockets so that "port not reachable" errors would get reported
> > back. This requires a more intrusive change though and might have
> > other consequences]
>
> We had discussed this about a year ago when I started adding IPv6
> support. I had suggested switching the local rpc client to use TCP
> instead of UDP to solve exactly this time-out problem during start-
> up. There was some resistance to the idea because TCP would leave
> privileged ports in TIMEWAIT (at shutdown, this is probably not a
> significant concern).
>
> Trond had intended to introduce connected UDP socket support to the
> RPC client, although we were also interested in someday having a
> single UDP socket for all RPC traffic... the design never moved on
> from there.
>
> My feeling at this point is that having a connected UDP socket
> transport would be simpler and have broader benefits than waiting for
> an eventual design that can accommodate multiple transport instances
> sharing a single socket.
The use of connected UDP would have to be limited to known-safe cases
such as contacting the local portmap. I believe there are still NFS
servers out there that - if multihomed - can reply from a different
address to the one the request was sent to.
And we would need to check that rpcbind does the right thing. I
recently discovered that rpcbind is buggy and will sometimes respond
from the wrong interface - I suspect localhost addresses are safe, but
we would need to check, or fix it (I fixed that bug in portmap (glibc
actually) 6 years ago and now it appears again in rpcbind - groan!).
How hard would it be to add (optional) connected UDP support? Would
we just make the code more like the TCP version, or are there any
gotchas that you know of that we would need to be careful of?
Thanks,
NeilBrown
On Thursday May 28, [email protected] wrote:
> At 02:33 AM 5/28/2009, NeilBrown wrote:
> >Unregistering an RPC service is not essential - but it is tidy.
> >So it is unpleasant to wait a long time for it to complete.
> >
> >As unregistering is always directed to localhost, the most likely
> >reason for any delay is the portmap (or rpcbind) is not running.
> >In that case, waiting it totally pointless. In any case, we should
> >expect a quick response, and zero packet loss.
> >
> >So reduce the timeouts to a total of half a second.
>
> Why wait for the response at all in this case? With zero retries
> and nothing to do with the result, it seems pointless to even wait
> for half a second.
That's a fair point, thanks. I'll keep that in mind if I revisit
these patches.
Thanks,
NeilBrown
On Jun 11, 2009, at 12:49 AM, Neil Brown wrote:
> On Thursday May 28, [email protected] wrote:
>> At 02:33 AM 5/28/2009, NeilBrown wrote:
>>> Unregistering an RPC service is not essential - but it is tidy.
>>> So it is unpleasant to wait a long time for it to complete.
>>>
>>> As unregistering is always directed to localhost, the most likely
>>> reason for any delay is the portmap (or rpcbind) is not running.
>>> In that case, waiting it totally pointless. In any case, we should
>>> expect a quick response, and zero packet loss.
>>>
>>> So reduce the timeouts to a total of half a second.
>>
>> Why wait for the response at all in this case? With zero retries
>> and nothing to do with the result, it seems pointless to even wait
>> for half a second.
>
> That's a fair point, thanks. I'll keep that in mind if I revisit
> these patches.
While not waiting is probably OK in the "shutdown" case, it could be a
problem during normal operation. The kernel unregisters RPC services
before it attempts to register them (to clear any old registrations).
Not waiting for the unregister reply might be racy.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Jun 11, 2009, at 12:48 AM, Neil Brown wrote:
> On Thursday May 28, [email protected] wrote:
>> On May 28, 2009, at 2:33 AM, NeilBrown wrote:
>>>
>>> [An alternate might be to make the sunrpc code always "connect"
>>> udp sockets so that "port not reachable" errors would get reported
>>> back. This requires a more intrusive change though and might have
>>> other consequences]
>>
>> We had discussed this about a year ago when I started adding IPv6
>> support. I had suggested switching the local rpc client to use TCP
>> instead of UDP to solve exactly this time-out problem during start-
>> up. There was some resistance to the idea because TCP would leave
>> privileged ports in TIMEWAIT (at shutdown, this is probably not a
>> significant concern).
>>
>> Trond had intended to introduce connected UDP socket support to the
>> RPC client, although we were also interested in someday having a
>> single UDP socket for all RPC traffic... the design never moved on
>> from there.
>>
>> My feeling at this point is that having a connected UDP socket
>> transport would be simpler and have broader benefits than waiting for
>> an eventual design that can accommodate multiple transport instances
>> sharing a single socket.
>
> The use of connected UDP would have to be limited to known-safe cases
> such as contacting the local portmap. I believe there are still NFS
> servers out there that - if multihomed - can reply from a different
> address to the one the request was sent to.
I think I advocated for adding an entirely new transport capability
called CUDP at the time. But this is definitely something to remember
as we test.
If a new transport capability is added, at this point we would likely
need some additional logic in the NFS mount parsing logic to expose
such a transport to user space. So, leaving that parsing logic alone
should insulate the NFS client from the new transport until we have
more confidence.
> And we would need to check that rpcbind does the right thing. I
> recently discovered that rpcbind is buggy and will sometimes respond
> from the wrong interface - I suspect localhost addresses are safe, but
> we would need to check, or fix it (I fixed that bug in portmap (glibc
> actually) 6 years ago and now it appears again in rpcbind - groan!).
Details welcome. We will probably need to fix libtirpc.
> How hard would it be to add (optional) connected UDP support? Would
> we just make the code more like the TCP version, or are there any
> gotchas that you know of that we would need to be careful of?
The code in net/sunrpc/xprtsock.c is a bunch of transport methods,
many of which are shared between the UDP and TCP transport
capabilities. You could probably do this easily by creating a new
xprt_class structure and a new ops vector, then reuse as many UDP
methods as possible. The TCP connect method could be usable as is,
but it would be simple to copy-n-paste a new one if some variation is
required. Then, define a new XPRT_ value, and use that in
rpcb_create_local().
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote:
> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote:
>> On Thursday May 28, [email protected] wrote:
>>> On May 28, 2009, at 2:33 AM, NeilBrown wrote:
>>>>
>>>> [An alternate might be to make the sunrpc code always "connect"
>>>> udp sockets so that "port not reachable" errors would get reported
>>>> back. This requires a more intrusive change though and might have
>>>> other consequences]
>>>
>>> We had discussed this about a year ago when I started adding IPv6
>>> support. I had suggested switching the local rpc client to use TCP
>>> instead of UDP to solve exactly this time-out problem during start-
>>> up. There was some resistance to the idea because TCP would leave
>>> privileged ports in TIMEWAIT (at shutdown, this is probably not a
>>> significant concern).
>>>
>>> Trond had intended to introduce connected UDP socket support to the
>>> RPC client, although we were also interested in someday having a
>>> single UDP socket for all RPC traffic... the design never moved on
>>> from there.
>>>
>>> My feeling at this point is that having a connected UDP socket
>>> transport would be simpler and have broader benefits than waiting
>>> for
>>> an eventual design that can accommodate multiple transport instances
>>> sharing a single socket.
>>
>> The use of connected UDP would have to be limited to known-safe cases
>> such as contacting the local portmap. I believe there are still NFS
>> servers out there that - if multihomed - can reply from a different
>> address to the one the request was sent to.
>
> I think I advocated for adding an entirely new transport capability
> called CUDP at the time. But this is definitely something to
> remember as we test.
>
> If a new transport capability is added, at this point we would
> likely need some additional logic in the NFS mount parsing logic to
> expose such a transport to user space. So, leaving that parsing
> logic alone should insulate the NFS client from the new transport
> until we have more confidence.
>
>> And we would need to check that rpcbind does the right thing. I
>> recently discovered that rpcbind is buggy and will sometimes respond
>> from the wrong interface - I suspect localhost addresses are safe,
>> but
>> we would need to check, or fix it (I fixed that bug in portmap (glibc
>> actually) 6 years ago and now it appears again in rpcbind - groan!).
>
> Details welcome. We will probably need to fix libtirpc.
>
>> How hard would it be to add (optional) connected UDP support? Would
>> we just make the code more like the TCP version, or are there any
>> gotchas that you know of that we would need to be careful of?
>
> The code in net/sunrpc/xprtsock.c is a bunch of transport methods,
> many of which are shared between the UDP and TCP transport
> capabilities. You could probably do this easily by creating a new
> xprt_class structure and a new ops vector, then reuse as many UDP
> methods as possible. The TCP connect method could be usable as is,
> but it would be simple to copy-n-paste a new one if some variation
> is required. Then, define a new XPRT_ value, and use that in
> rpcb_create_local().
I've thought about this some more...
It seems to me that you might be better off using the existing UDP
transport code, but adding a new RPC_CLNT_CREATE_ flag to enable
connected UDP semantics. The two transports are otherwise exactly the
same.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Chuck Lever wrote:
> On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote:
>> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote:
>>
>>> How hard would it be to add (optional) connected UDP support? Would
>>> we just make the code more like the TCP version, or are there any
>>> gotchas that you know of that we would need to be careful of?
>>
>> The code in net/sunrpc/xprtsock.c is a bunch of transport methods,
>> many of which are shared between the UDP and TCP transport
>> capabilities. You could probably do this easily by creating a new
>> xprt_class structure and a new ops vector, then reuse as many UDP
>> methods as possible. The TCP connect method could be usable as is,
>> but it would be simple to copy-n-paste a new one if some variation is
>> required. Then, define a new XPRT_ value, and use that in
>> rpcb_create_local().
I attempted a patch based on your suggestions, while the socket seems
to be getting the -ECONNREFUSED error, but it isn't propagating all the
way up (yet to debug, why)
include/linux/sunrpc/xprtsock.h | 6 ++
net/sunrpc/rpcb_clnt.c | 2 +-
net/sunrpc/xprt.c | 5 +
net/sunrpc/xprtsock.c | 186 +++++++++++++++++++++++++++++++++++++++
4 files changed, 198 insertions(+), 1 deletions(-)
diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index c2a46c4..3f05a45 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -22,6 +22,12 @@ void cleanup_socket_xprt(void);
*/
#define XPRT_TRANSPORT_UDP IPPROTO_UDP
#define XPRT_TRANSPORT_TCP IPPROTO_TCP
+/*
+ * Connected UDP doesn't seem to be a well-defined protocol, picking a number
+ * other than 17 and 6 should be OK.
+ * FIXME: If the above assumption is wrong.
+ */
+#define XPRT_TRANSPORT_CUDP 18
/*
* RPC slot table sizes for UDP, TCP transports
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index beee6da..73f8048 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -135,7 +135,7 @@ static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr,
size_t addrlen, u32 version)
{
struct rpc_create_args args = {
- .protocol = XPRT_TRANSPORT_UDP,
+ .protocol = XPRT_TRANSPORT_CUDP,
.address = addr,
.addrsize = addrlen,
.servername = "localhost",
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index f412a85..7e605f1 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -898,6 +898,11 @@ void xprt_transmit(struct rpc_task *task)
req->rq_connect_cookie = xprt->connect_cookie;
req->rq_xtime = jiffies;
status = xprt->ops->send_request(task);
+ if (status == -ECONNREFUSED) {
+ dprintk("RPC: send_request returned (%d) in xprt_transmit\n",
+ status);
+ rpc_exit(task, status);
+ }
if (status != 0) {
task->tk_status = status;
return;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 83c73c4..6cc24c3 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -193,6 +193,8 @@ static ctl_table sunrpc_table[] = {
*/
#define XS_IDLE_DISC_TO (5U * 60 * HZ)
+#define IPPROTO_CUDP XPRT_TRANSPORT_CUDP
+
#ifdef RPC_DEBUG
# undef RPC_DEBUG_DATA
# define RPCDBG_FACILITY RPCDBG_TRANS
@@ -1832,6 +1834,100 @@ out:
xprt_wake_pending_tasks(xprt, status);
}
+/**
+ * xs_cudp_connect_worker4 - set up a connected UDP socket
+ * @work: RPC transport to connect
+ *
+ * Invoked by a work queue tasklet.
+ */
+static void xs_cudp_connect_worker4(struct work_struct *work)
+{
+ struct sock_xprt *transport =
+ container_of(work, struct sock_xprt, connect_worker.work);
+ struct rpc_xprt *xprt = &transport->xprt;
+ struct socket *sock = transport->sock;
+ int err, status = -EIO;
+
+ if (xprt->shutdown)
+ goto out;
+
+ /* Start by resetting any existing state */
+ xs_reset_transport(transport);
+
+ err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
+ if (err < 0) {
+ dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
+ goto out;
+ }
+ xs_reclassify_socket4(sock);
+
+ if (xs_bind4(transport, sock)) {
+ sock_release(sock);
+ goto out;
+ }
+
+ dprintk("RPC: worker connecting xprt %p to address: %s\n",
+ xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
+
+ xs_udp_finish_connecting(xprt, sock);
+ err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, O_NONBLOCK);
+ if (err < 0) {
+ dprintk("RPC: connect on UDP socket.. failed (%d).\n", -err);
+ goto out;
+ }
+ status = 0;
+out:
+ xprt_clear_connecting(xprt);
+ xprt_wake_pending_tasks(xprt, status);
+}
+
+/**
+ * xs_cudp_connect_worker6 - set up a Connected UDP socket
+ * @work: RPC transport to connect
+ *
+ * Invoked by a work queue tasklet.
+ */
+static void xs_cudp_connect_worker6(struct work_struct *work)
+{
+ struct sock_xprt *transport =
+ container_of(work, struct sock_xprt, connect_worker.work);
+ struct rpc_xprt *xprt = &transport->xprt;
+ struct socket *sock = transport->sock;
+ int err, status = -EIO;
+
+ if (xprt->shutdown)
+ goto out;
+
+ /* Start by resetting any existing state */
+ xs_reset_transport(transport);
+
+ err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock);
+ if (err < 0) {
+ dprintk("RPC: can't create UDP transport socket (%d).\n", -err);
+ goto out;
+ }
+ xs_reclassify_socket6(sock);
+
+ if (xs_bind6(transport, sock) < 0) {
+ sock_release(sock);
+ goto out;
+ }
+
+ dprintk("RPC: worker connecting xprt %p to address: %s\n",
+ xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
+
+ xs_udp_finish_connecting(xprt, sock);
+ err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, O_NONBLOCK);
+ if (err < 0) {
+ dprintk("RPC: connect on UDP socket... failed (%d).\n", -err);
+ goto out;
+ }
+ status = 0;
+out:
+ xprt_clear_connecting(xprt);
+ xprt_wake_pending_tasks(xprt, status);
+}
+
/*
* We need to preserve the port number so the reply cache on the server can
* find our cached RPC replies when we get around to reconnecting.
@@ -2192,6 +2288,24 @@ static struct rpc_xprt_ops xs_tcp_ops = {
.print_stats = xs_tcp_print_stats,
};
+static struct rpc_xprt_ops xs_cudp_ops = {
+ .set_buffer_size = xs_udp_set_buffer_size,
+ .reserve_xprt = xprt_reserve_xprt_cong,
+ .release_xprt = xprt_release_xprt_cong,
+ .rpcbind = rpcb_getport_async,
+ .set_port = xs_set_port,
+ .connect = xs_tcp_connect,
+ .buf_alloc = rpc_malloc,
+ .buf_free = rpc_free,
+ .send_request = xs_udp_send_request,
+ .set_retrans_timeout = xprt_set_retrans_timeout_rtt,
+ .timer = xs_udp_timer,
+ .release_request = xprt_release_rqst_cong,
+ .close = xs_close,
+ .destroy = xs_destroy,
+ .print_stats = xs_udp_print_stats,
+};
+
static struct rpc_xprt *xs_setup_xprt(struct xprt_create *args,
unsigned int slot_table_size)
{
@@ -2298,6 +2412,68 @@ static struct rpc_xprt *xs_setup_udp(struct xprt_create *args)
return ERR_PTR(-EINVAL);
}
+/**
+ * xs_setup_cudp - Set up transport to use a connected UDP socket
+ * @args: rpc transport creation arguments
+ *
+ */
+static struct rpc_xprt *xs_setup_cudp(struct xprt_create *args)
+{
+ struct sockaddr *addr = args->dstaddr;
+ struct rpc_xprt *xprt;
+ struct sock_xprt *transport;
+
+ xprt = xs_setup_xprt(args, xprt_udp_slot_table_entries);
+ if (IS_ERR(xprt))
+ return xprt;
+ transport = container_of(xprt, struct sock_xprt, xprt);
+
+ xprt->prot = IPPROTO_CUDP;
+ xprt->tsh_size = 0;
+ /* XXX: header size can vary due to auth type, IPv6, etc. */
+ xprt->max_payload = (1U << 16) - (MAX_HEADER << 3);
+
+ xprt->bind_timeout = XS_BIND_TO;
+ xprt->connect_timeout = XS_UDP_CONN_TO;
+ xprt->reestablish_timeout = XS_UDP_REEST_TO;
+ xprt->idle_timeout = XS_IDLE_DISC_TO;
+
+ xprt->ops = &xs_cudp_ops;
+ xprt->timeout = &xs_udp_default_timeout;
+
+ switch (addr->sa_family) {
+ case AF_INET:
+ if (((struct sockaddr_in *)addr)->sin_port != htons(0))
+ xprt_set_bound(xprt);
+
+ INIT_DELAYED_WORK(&transport->connect_worker,
+ xs_cudp_connect_worker4);
+ xs_format_ipv4_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP);
+ break;
+ case AF_INET6:
+ if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0))
+ xprt_set_bound(xprt);
+
+ INIT_DELAYED_WORK(&transport->connect_worker,
+ xs_cudp_connect_worker6);
+ xs_format_ipv6_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP);
+ break;
+ default:
+ kfree(xprt);
+ return ERR_PTR(-EAFNOSUPPORT);
+ }
+
+ dprintk("RPC: set up transport to address %s\n",
+ xprt->address_strings[RPC_DISPLAY_ALL]);
+
+ if (try_module_get(THIS_MODULE))
+ return xprt;
+
+ kfree(xprt->slot);
+ kfree(xprt);
+ return ERR_PTR(-EINVAL);
+}
+
static const struct rpc_timeout xs_tcp_default_timeout = {
.to_initval = 60 * HZ,
.to_maxval = 60 * HZ,
@@ -2379,6 +2555,14 @@ static struct xprt_class xs_tcp_transport = {
.setup = xs_setup_tcp,
};
+static struct xprt_class xs_cudp_transport = {
+ .list = LIST_HEAD_INIT(xs_cudp_transport.list),
+ .name = "cudp",
+ .owner = THIS_MODULE,
+ .ident = IPPROTO_CUDP,
+ .setup = xs_setup_cudp,
+};
+
/**
* init_socket_xprt - set up xprtsock's sysctls, register with RPC client
*
@@ -2392,6 +2576,7 @@ int init_socket_xprt(void)
xprt_register_transport(&xs_udp_transport);
xprt_register_transport(&xs_tcp_transport);
+ xprt_register_transport(&xs_cudp_transport);
return 0;
}
@@ -2411,4 +2596,5 @@ void cleanup_socket_xprt(void)
xprt_unregister_transport(&xs_udp_transport);
xprt_unregister_transport(&xs_tcp_transport);
+ xprt_unregister_transport(&xs_cudp_transport);
}
> I've thought about this some more...
>
> It seems to me that you might be better off using the existing UDP
> transport code, but adding a new RPC_CLNT_CREATE_ flag to enable
> connected UDP semantics. The two transports are otherwise exactly the
> same.
>
It doesn't seem that there is a clean way of doing this. The function
xs_setup_udp() sets up the corresponding connect_worker function which
actually sets up the UDP socket. There doesn't seem to be a way to check
whether this flag (or a new rpc_clnt->cl_ flag) is set or not in either of
the functions.
OTOH, why not use AF_LOCAL sockets since it's for local communication only?
Thanks,
--
Suresh Jayaraman
On Jul 6, 2009, at 8:42 AM, Suresh Jayaraman wrote:
> Chuck Lever wrote:
>> On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote:
>>> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote:
>
>>>
>>>> How hard would it be to add (optional) connected UDP support?
>>>> Would
>>>> we just make the code more like the TCP version, or are there any
>>>> gotchas that you know of that we would need to be careful of?
>>>
>>> The code in net/sunrpc/xprtsock.c is a bunch of transport methods,
>>> many of which are shared between the UDP and TCP transport
>>> capabilities. You could probably do this easily by creating a new
>>> xprt_class structure and a new ops vector, then reuse as many UDP
>>> methods as possible. The TCP connect method could be usable as is,
>>> but it would be simple to copy-n-paste a new one if some variation
>>> is
>>> required. Then, define a new XPRT_ value, and use that in
>>> rpcb_create_local().
>
> I attempted a patch based on your suggestions, while the socket seems
> to be getting the -ECONNREFUSED error, but it isn't propagating all
> the
> way up (yet to debug, why)
I suspect it's because a while ago Trond changed the connect logic to
retry everything, including ECONNREFUSED.
I've hit this problem recently as well. The kernel's NFS mount client
needs rpcbind to recognize when a port is not active so it can stop
retrying that port and switch to a different transport, just as
mount.nfs does.
We will need to add a mechanism to allow ECONNREFUSED to be propagated
up the stack again. Trond suggested adding a per-RPC flag (on
rpc_call_sync()) to do this. Connection semantics seem to me to be an
attribute of the transport, not of a single RPC, though. And, the
protocol where we really need this is rpcbind, which usually creates a
transport to send just a single RPC.
> include/linux/sunrpc/xprtsock.h | 6 ++
> net/sunrpc/rpcb_clnt.c | 2 +-
> net/sunrpc/xprt.c | 5 +
> net/sunrpc/xprtsock.c | 186 ++++++++++++++++++++++++++++++
> +++++++++
> 4 files changed, 198 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/
> xprtsock.h
> index c2a46c4..3f05a45 100644
> --- a/include/linux/sunrpc/xprtsock.h
> +++ b/include/linux/sunrpc/xprtsock.h
> @@ -22,6 +22,12 @@ void cleanup_socket_xprt(void);
> */
> #define XPRT_TRANSPORT_UDP IPPROTO_UDP
> #define XPRT_TRANSPORT_TCP IPPROTO_TCP
> +/*
> + * Connected UDP doesn't seem to be a well-defined protocol,
> picking a number
> + * other than 17 and 6 should be OK.
> + * FIXME: If the above assumption is wrong.
> + */
> +#define XPRT_TRANSPORT_CUDP 18
>
> /*
> * RPC slot table sizes for UDP, TCP transports
> diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
> index beee6da..73f8048 100644
> --- a/net/sunrpc/rpcb_clnt.c
> +++ b/net/sunrpc/rpcb_clnt.c
> @@ -135,7 +135,7 @@ static struct rpc_clnt *rpcb_create_local(struct
> sockaddr *addr,
> size_t addrlen, u32 version)
> {
> struct rpc_create_args args = {
> - .protocol = XPRT_TRANSPORT_UDP,
> + .protocol = XPRT_TRANSPORT_CUDP,
> .address = addr,
> .addrsize = addrlen,
> .servername = "localhost",
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index f412a85..7e605f1 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -898,6 +898,11 @@ void xprt_transmit(struct rpc_task *task)
> req->rq_connect_cookie = xprt->connect_cookie;
> req->rq_xtime = jiffies;
> status = xprt->ops->send_request(task);
> + if (status == -ECONNREFUSED) {
> + dprintk("RPC: send_request returned (%d) in xprt_transmit\n",
> + status);
> + rpc_exit(task, status);
> + }
> if (status != 0) {
> task->tk_status = status;
> return;
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 83c73c4..6cc24c3 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -193,6 +193,8 @@ static ctl_table sunrpc_table[] = {
> */
> #define XS_IDLE_DISC_TO (5U * 60 * HZ)
>
> +#define IPPROTO_CUDP XPRT_TRANSPORT_CUDP
> +
> #ifdef RPC_DEBUG
> # undef RPC_DEBUG_DATA
> # define RPCDBG_FACILITY RPCDBG_TRANS
> @@ -1832,6 +1834,100 @@ out:
> xprt_wake_pending_tasks(xprt, status);
> }
>
> +/**
> + * xs_cudp_connect_worker4 - set up a connected UDP socket
> + * @work: RPC transport to connect
> + *
> + * Invoked by a work queue tasklet.
> + */
> +static void xs_cudp_connect_worker4(struct work_struct *work)
> +{
> + struct sock_xprt *transport =
> + container_of(work, struct sock_xprt, connect_worker.work);
> + struct rpc_xprt *xprt = &transport->xprt;
> + struct socket *sock = transport->sock;
> + int err, status = -EIO;
> +
> + if (xprt->shutdown)
> + goto out;
> +
> + /* Start by resetting any existing state */
> + xs_reset_transport(transport);
> +
> + err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
> + if (err < 0) {
> + dprintk("RPC: can't create UDP transport socket (%d).\n", -
> err);
> + goto out;
> + }
> + xs_reclassify_socket4(sock);
> +
> + if (xs_bind4(transport, sock)) {
> + sock_release(sock);
> + goto out;
> + }
> +
> + dprintk("RPC: worker connecting xprt %p to address: %s\n",
> + xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
> +
> + xs_udp_finish_connecting(xprt, sock);
> + err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen,
> O_NONBLOCK);
> + if (err < 0) {
> + dprintk("RPC: connect on UDP socket.. failed (%d).\n", -err);
> + goto out;
> + }
> + status = 0;
> +out:
> + xprt_clear_connecting(xprt);
> + xprt_wake_pending_tasks(xprt, status);
> +}
> +
> +/**
> + * xs_cudp_connect_worker6 - set up a Connected UDP socket
> + * @work: RPC transport to connect
> + *
> + * Invoked by a work queue tasklet.
> + */
> +static void xs_cudp_connect_worker6(struct work_struct *work)
> +{
> + struct sock_xprt *transport =
> + container_of(work, struct sock_xprt, connect_worker.work);
> + struct rpc_xprt *xprt = &transport->xprt;
> + struct socket *sock = transport->sock;
> + int err, status = -EIO;
> +
> + if (xprt->shutdown)
> + goto out;
> +
> + /* Start by resetting any existing state */
> + xs_reset_transport(transport);
> +
> + err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock);
> + if (err < 0) {
> + dprintk("RPC: can't create UDP transport socket (%d).\n", -
> err);
> + goto out;
> + }
> + xs_reclassify_socket6(sock);
> +
> + if (xs_bind6(transport, sock) < 0) {
> + sock_release(sock);
> + goto out;
> + }
> +
> + dprintk("RPC: worker connecting xprt %p to address: %s\n",
> + xprt, xprt->address_strings[RPC_DISPLAY_ALL]);
> +
> + xs_udp_finish_connecting(xprt, sock);
> + err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen,
> O_NONBLOCK);
> + if (err < 0) {
> + dprintk("RPC: connect on UDP socket... failed (%d).\n", -err);
> + goto out;
> + }
> + status = 0;
> +out:
> + xprt_clear_connecting(xprt);
> + xprt_wake_pending_tasks(xprt, status);
> +}
> +
> /*
> * We need to preserve the port number so the reply cache on the
> server can
> * find our cached RPC replies when we get around to reconnecting.
> @@ -2192,6 +2288,24 @@ static struct rpc_xprt_ops xs_tcp_ops = {
> .print_stats = xs_tcp_print_stats,
> };
>
> +static struct rpc_xprt_ops xs_cudp_ops = {
> + .set_buffer_size = xs_udp_set_buffer_size,
> + .reserve_xprt = xprt_reserve_xprt_cong,
> + .release_xprt = xprt_release_xprt_cong,
> + .rpcbind = rpcb_getport_async,
> + .set_port = xs_set_port,
> + .connect = xs_tcp_connect,
> + .buf_alloc = rpc_malloc,
> + .buf_free = rpc_free,
> + .send_request = xs_udp_send_request,
> + .set_retrans_timeout = xprt_set_retrans_timeout_rtt,
> + .timer = xs_udp_timer,
> + .release_request = xprt_release_rqst_cong,
> + .close = xs_close,
> + .destroy = xs_destroy,
> + .print_stats = xs_udp_print_stats,
> +};
> +
> static struct rpc_xprt *xs_setup_xprt(struct xprt_create *args,
> unsigned int slot_table_size)
> {
> @@ -2298,6 +2412,68 @@ static struct rpc_xprt *xs_setup_udp(struct
> xprt_create *args)
> return ERR_PTR(-EINVAL);
> }
>
> +/**
> + * xs_setup_cudp - Set up transport to use a connected UDP socket
> + * @args: rpc transport creation arguments
> + *
> + */
> +static struct rpc_xprt *xs_setup_cudp(struct xprt_create *args)
> +{
> + struct sockaddr *addr = args->dstaddr;
> + struct rpc_xprt *xprt;
> + struct sock_xprt *transport;
> +
> + xprt = xs_setup_xprt(args, xprt_udp_slot_table_entries);
> + if (IS_ERR(xprt))
> + return xprt;
> + transport = container_of(xprt, struct sock_xprt, xprt);
> +
> + xprt->prot = IPPROTO_CUDP;
> + xprt->tsh_size = 0;
> + /* XXX: header size can vary due to auth type, IPv6, etc. */
> + xprt->max_payload = (1U << 16) - (MAX_HEADER << 3);
> +
> + xprt->bind_timeout = XS_BIND_TO;
> + xprt->connect_timeout = XS_UDP_CONN_TO;
> + xprt->reestablish_timeout = XS_UDP_REEST_TO;
> + xprt->idle_timeout = XS_IDLE_DISC_TO;
> +
> + xprt->ops = &xs_cudp_ops;
> + xprt->timeout = &xs_udp_default_timeout;
> +
> + switch (addr->sa_family) {
> + case AF_INET:
> + if (((struct sockaddr_in *)addr)->sin_port != htons(0))
> + xprt_set_bound(xprt);
> +
> + INIT_DELAYED_WORK(&transport->connect_worker,
> + xs_cudp_connect_worker4);
> + xs_format_ipv4_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP);
> + break;
> + case AF_INET6:
> + if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0))
> + xprt_set_bound(xprt);
> +
> + INIT_DELAYED_WORK(&transport->connect_worker,
> + xs_cudp_connect_worker6);
> + xs_format_ipv6_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP);
> + break;
> + default:
> + kfree(xprt);
> + return ERR_PTR(-EAFNOSUPPORT);
> + }
> +
> + dprintk("RPC: set up transport to address %s\n",
> + xprt->address_strings[RPC_DISPLAY_ALL]);
> +
> + if (try_module_get(THIS_MODULE))
> + return xprt;
> +
> + kfree(xprt->slot);
> + kfree(xprt);
> + return ERR_PTR(-EINVAL);
> +}
> +
> static const struct rpc_timeout xs_tcp_default_timeout = {
> .to_initval = 60 * HZ,
> .to_maxval = 60 * HZ,
> @@ -2379,6 +2555,14 @@ static struct xprt_class xs_tcp_transport = {
> .setup = xs_setup_tcp,
> };
>
> +static struct xprt_class xs_cudp_transport = {
> + .list = LIST_HEAD_INIT(xs_cudp_transport.list),
> + .name = "cudp",
> + .owner = THIS_MODULE,
> + .ident = IPPROTO_CUDP,
> + .setup = xs_setup_cudp,
> +};
> +
> /**
> * init_socket_xprt - set up xprtsock's sysctls, register with RPC
> client
> *
> @@ -2392,6 +2576,7 @@ int init_socket_xprt(void)
>
> xprt_register_transport(&xs_udp_transport);
> xprt_register_transport(&xs_tcp_transport);
> + xprt_register_transport(&xs_cudp_transport);
>
> return 0;
> }
> @@ -2411,4 +2596,5 @@ void cleanup_socket_xprt(void)
>
> xprt_unregister_transport(&xs_udp_transport);
> xprt_unregister_transport(&xs_tcp_transport);
> + xprt_unregister_transport(&xs_cudp_transport);
> }
>
>> I've thought about this some more...
>>
>> It seems to me that you might be better off using the existing UDP
>> transport code, but adding a new RPC_CLNT_CREATE_ flag to enable
>> connected UDP semantics. The two transports are otherwise exactly
>> the
>> same.
>>
>
> It doesn't seem that there is a clean way of doing this. The function
> xs_setup_udp() sets up the corresponding connect_worker function which
> actually sets up the UDP socket. There doesn't seem to be a way to
> check
> whether this flag (or a new rpc_clnt->cl_ flag) is set or not in
> either of
> the functions.
>
> OTOH, why not use AF_LOCAL sockets since it's for local
> communication only?
>
>
> Thanks,
>
>
> --
> Suresh Jayaraman
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
Chuck Lever wrote:
> On Jul 6, 2009, at 8:42 AM, Suresh Jayaraman wrote:
>> Chuck Lever wrote:
>>> On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote:
>>>> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote:
>>
>>>>
>>>>> How hard would it be to add (optional) connected UDP support? Would
>>>>> we just make the code more like the TCP version, or are there any
>>>>> gotchas that you know of that we would need to be careful of?
>>>>
>>>> The code in net/sunrpc/xprtsock.c is a bunch of transport methods,
>>>> many of which are shared between the UDP and TCP transport
>>>> capabilities. You could probably do this easily by creating a new
>>>> xprt_class structure and a new ops vector, then reuse as many UDP
>>>> methods as possible. The TCP connect method could be usable as is,
>>>> but it would be simple to copy-n-paste a new one if some variation is
>>>> required. Then, define a new XPRT_ value, and use that in
>>>> rpcb_create_local().
>>
>> I attempted a patch based on your suggestions, while the socket seems
>> to be getting the -ECONNREFUSED error, but it isn't propagating all the
>> way up (yet to debug, why)
>
> I suspect it's because a while ago Trond changed the connect logic to
> retry everything, including ECONNREFUSED.
> I've hit this problem recently as well. The kernel's NFS mount client
> needs rpcbind to recognize when a port is not active so it can stop
> retrying that port and switch to a different transport, just as
> mount.nfs does.
>
> We will need to add a mechanism to allow ECONNREFUSED to be propagated
> up the stack again. Trond suggested adding a per-RPC flag (on
> rpc_call_sync()) to do this. Connection semantics seem to me to be an
> attribute of the transport, not of a single RPC, though. And, the
> protocol where we really need this is rpcbind, which usually creates a
> transport to send just a single RPC.
>
Ah ok, good to know this.
BTW, it seems my questions on using RPC_CLNT_CREATE_ flag and using
AF_LOCAL sockets got overshadowed (seen below) by the patch. Would
making rpcbind using AF_LOCAL sockets a good idea or connected UDP still
seems a better solution?
>>
>>> I've thought about this some more...
>>>
>>> It seems to me that you might be better off using the existing UDP
>>> transport code, but adding a new RPC_CLNT_CREATE_ flag to enable
>>> connected UDP semantics. The two transports are otherwise exactly the
>>> same.
>>>
>>
>> It doesn't seem that there is a clean way of doing this. The function
>> xs_setup_udp() sets up the corresponding connect_worker function which
>> actually sets up the UDP socket. There doesn't seem to be a way to check
>> whether this flag (or a new rpc_clnt->cl_ flag) is set or not in
>> either of
>> the functions.
>>
>> OTOH, why not use AF_LOCAL sockets since it's for local communication
>> only?
>>
Thanks,
--
Suresh Jayaraman
On Mon, 2009-07-06 at 21:38 +0530, Suresh Jayaraman wrote:
> BTW, it seems my questions on using RPC_CLNT_CREATE_ flag and using
> AF_LOCAL sockets got overshadowed (seen below) by the patch. Would
> making rpcbind using AF_LOCAL sockets a good idea or connected UDP still
> seems a better solution?
The kernel RPC client doesn't support AF_LOCAL sockets. Why would we
want to add them when we already have other mechanisms to communicate
with userland that provide the same functionality?
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Jul 6, 2009, at 12:08 PM, Suresh Jayaraman wrote:
> Chuck Lever wrote:
>> On Jul 6, 2009, at 8:42 AM, Suresh Jayaraman wrote:
>>> Chuck Lever wrote:
>>>> On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote:
>>>>> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote:
>>>
>>>>>
>>>>>> How hard would it be to add (optional) connected UDP support?
>>>>>> Would
>>>>>> we just make the code more like the TCP version, or are there any
>>>>>> gotchas that you know of that we would need to be careful of?
>>>>>
>>>>> The code in net/sunrpc/xprtsock.c is a bunch of transport methods,
>>>>> many of which are shared between the UDP and TCP transport
>>>>> capabilities. You could probably do this easily by creating a new
>>>>> xprt_class structure and a new ops vector, then reuse as many UDP
>>>>> methods as possible. The TCP connect method could be usable as
>>>>> is,
>>>>> but it would be simple to copy-n-paste a new one if some
>>>>> variation is
>>>>> required. Then, define a new XPRT_ value, and use that in
>>>>> rpcb_create_local().
>>>
>>> I attempted a patch based on your suggestions, while the socket
>>> seems
>>> to be getting the -ECONNREFUSED error, but it isn't propagating
>>> all the
>>> way up (yet to debug, why)
>>
>> I suspect it's because a while ago Trond changed the connect logic to
>> retry everything, including ECONNREFUSED.
>
>> I've hit this problem recently as well. The kernel's NFS mount
>> client
>> needs rpcbind to recognize when a port is not active so it can stop
>> retrying that port and switch to a different transport, just as
>> mount.nfs does.
>>
>> We will need to add a mechanism to allow ECONNREFUSED to be
>> propagated
>> up the stack again. Trond suggested adding a per-RPC flag (on
>> rpc_call_sync()) to do this. Connection semantics seem to me to be
>> an
>> attribute of the transport, not of a single RPC, though. And, the
>> protocol where we really need this is rpcbind, which usually
>> creates a
>> transport to send just a single RPC.
>
> Ah ok, good to know this.
>
> BTW, it seems my questions on using RPC_CLNT_CREATE_ flag and using
> AF_LOCAL sockets got overshadowed (seen below) by the patch. Would
> making rpcbind using AF_LOCAL sockets a good idea or connected UDP
> still
> seems a better solution?
See below.
>>>> I've thought about this some more...
>>>>
>>>> It seems to me that you might be better off using the existing UDP
>>>> transport code, but adding a new RPC_CLNT_CREATE_ flag to enable
>>>> connected UDP semantics. The two transports are otherwise
>>>> exactly the
>>>> same.
>>>
>>> It doesn't seem that there is a clean way of doing this. The
>>> function
>>> xs_setup_udp() sets up the corresponding connect_worker function
>>> which
>>> actually sets up the UDP socket. There doesn't seem to be a way to
>>> check
>>> whether this flag (or a new rpc_clnt->cl_ flag) is set or not in
>>> either of
>>> the functions.
See xs_tcp_connect(). On connect, you get an rpc_task structure which
contains an rpc_clnt pointer, and can use that to switch connection
semantics.
>>> OTOH, why not use AF_LOCAL sockets since it's for local
>>> communication
>>> only?
I have considered that. AF_LOCAL in fact could replace all of our
upcall mechanisms. However, portmapper, which doesn't support
AF_LOCAL, is still used in some distributions.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote:
> I have considered that. AF_LOCAL in fact could replace all of our
> upcall mechanisms. However, portmapper, which doesn't support
> AF_LOCAL, is still used in some distributions.
As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would we
want to saddle ourselves with rpc over AF_LOCAL?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Jul 6, 2009, at 12:40 PM, Trond Myklebust wrote:
> On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote:
>> I have considered that. AF_LOCAL in fact could replace all of our
>> upcall mechanisms. However, portmapper, which doesn't support
>> AF_LOCAL, is still used in some distributions.
>
> As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would we
> want to saddle ourselves with rpc over AF_LOCAL?
TI-RPC supports AF_LOCAL RPC transports.
[cel@matisse notify-one]$ rpcinfo
program version netid address service owner
100000 4 tcp6 ::.0.111 portmapper
superuser
100000 3 tcp6 ::.0.111 portmapper
superuser
100000 4 udp6 ::.0.111 portmapper
superuser
100000 3 udp6 ::.0.111 portmapper
superuser
100000 4 tcp 0.0.0.0.0.111 portmapper
superuser
100000 3 tcp 0.0.0.0.0.111 portmapper
superuser
100000 2 tcp 0.0.0.0.0.111 portmapper
superuser
100000 4 udp 0.0.0.0.0.111 portmapper
superuser
100000 3 udp 0.0.0.0.0.111 portmapper
superuser
100000 2 udp 0.0.0.0.0.111 portmapper
superuser
100000 4 local /var/run/rpcbind.sock portmapper
superuser
100000 3 local /var/run/rpcbind.sock portmapper
superuser
100024 1 udp 0.0.0.0.206.127 status 29
100024 1 tcp 0.0.0.0.166.105 status 29
100024 1 udp6 ::.141.238 status 29
100024 1 tcp6 ::.192.160 status 29
[cel@matisse notify-one]$
The listing for '/var/run/rpcbind.sock' is rpcbind's AF_LOCAL
listener. TI-RPC's rpcb_foo() calls use this method of accessing the
rpcbind database rather than going over loopback.
rpcbind scrapes the caller's effective UID off the transport socket
and uses that for authentication. Note the "owner" column... that
comes from the socket's UID, not from the r_owner field. When a
service is registered over the network, the owner column says
"unknown" and basically anyone can unset it.
If the kernel used AF_LOCAL to register its services, it would mean we
would never use a network port for local rpcbind calls between the
kernel and rpcbind, and rpcbind could automatically prevent the
kernel's RPC services from getting unset by malicious users. If /var/
run/rpcbind.sock isn't there, the kernel would know immediately that
rpcbind wasn't running.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Mon, 2009-07-06 at 12:57 -0400, Chuck Lever wrote:
> On Jul 6, 2009, at 12:40 PM, Trond Myklebust wrote:
> > On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote:
> >> I have considered that. AF_LOCAL in fact could replace all of our
> >> upcall mechanisms. However, portmapper, which doesn't support
> >> AF_LOCAL, is still used in some distributions.
> >
> > As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would we
> > want to saddle ourselves with rpc over AF_LOCAL?
>
> TI-RPC supports AF_LOCAL RPC transports.
>
> [cel@matisse notify-one]$ rpcinfo
> program version netid address service owner
> 100000 4 tcp6 ::.0.111 portmapper
> superuser
> 100000 3 tcp6 ::.0.111 portmapper
> superuser
> 100000 4 udp6 ::.0.111 portmapper
> superuser
> 100000 3 udp6 ::.0.111 portmapper
> superuser
> 100000 4 tcp 0.0.0.0.0.111 portmapper
> superuser
> 100000 3 tcp 0.0.0.0.0.111 portmapper
> superuser
> 100000 2 tcp 0.0.0.0.0.111 portmapper
> superuser
> 100000 4 udp 0.0.0.0.0.111 portmapper
> superuser
> 100000 3 udp 0.0.0.0.0.111 portmapper
> superuser
> 100000 2 udp 0.0.0.0.0.111 portmapper
> superuser
> 100000 4 local /var/run/rpcbind.sock portmapper
> superuser
> 100000 3 local /var/run/rpcbind.sock portmapper
> superuser
> 100024 1 udp 0.0.0.0.206.127 status 29
> 100024 1 tcp 0.0.0.0.166.105 status 29
> 100024 1 udp6 ::.141.238 status 29
> 100024 1 tcp6 ::.192.160 status 29
> [cel@matisse notify-one]$
>
> The listing for '/var/run/rpcbind.sock' is rpcbind's AF_LOCAL
> listener. TI-RPC's rpcb_foo() calls use this method of accessing the
> rpcbind database rather than going over loopback.
>
> rpcbind scrapes the caller's effective UID off the transport socket
> and uses that for authentication. Note the "owner" column... that
> comes from the socket's UID, not from the r_owner field. When a
> service is registered over the network, the owner column says
> "unknown" and basically anyone can unset it.
>
> If the kernel used AF_LOCAL to register its services, it would mean we
> would never use a network port for local rpcbind calls between the
> kernel and rpcbind, and rpcbind could automatically prevent the
> kernel's RPC services from getting unset by malicious users. If /var/
> run/rpcbind.sock isn't there, the kernel would know immediately that
> rpcbind wasn't running.
So what? You can achieve the same with any number of communication
channels (including the network). Just add a timeout to the current
'connect()' function, and set it to a low value when doing rpcbind
upcalls.
What's so special about libtirpc or rpcbind that we have to keep
redesigning the kernel to work around their limitations instead of the
other way round?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Jul 6, 2009, at 1:14 PM, Trond Myklebust wrote:
> On Mon, 2009-07-06 at 12:57 -0400, Chuck Lever wrote:
>> On Jul 6, 2009, at 12:40 PM, Trond Myklebust wrote:
>>> On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote:
>>>> I have considered that. AF_LOCAL in fact could replace all of our
>>>> upcall mechanisms. However, portmapper, which doesn't support
>>>> AF_LOCAL, is still used in some distributions.
>>>
>>> As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would
>>> we
>>> want to saddle ourselves with rpc over AF_LOCAL?
>>
>> TI-RPC supports AF_LOCAL RPC transports.
>>
>> [cel@matisse notify-one]$ rpcinfo
>> program version netid address service owner
>> 100000 4 tcp6 ::.0.111 portmapper
>> superuser
>> 100000 3 tcp6 ::.0.111 portmapper
>> superuser
>> 100000 4 udp6 ::.0.111 portmapper
>> superuser
>> 100000 3 udp6 ::.0.111 portmapper
>> superuser
>> 100000 4 tcp 0.0.0.0.0.111 portmapper
>> superuser
>> 100000 3 tcp 0.0.0.0.0.111 portmapper
>> superuser
>> 100000 2 tcp 0.0.0.0.0.111 portmapper
>> superuser
>> 100000 4 udp 0.0.0.0.0.111 portmapper
>> superuser
>> 100000 3 udp 0.0.0.0.0.111 portmapper
>> superuser
>> 100000 2 udp 0.0.0.0.0.111 portmapper
>> superuser
>> 100000 4 local /var/run/rpcbind.sock portmapper
>> superuser
>> 100000 3 local /var/run/rpcbind.sock portmapper
>> superuser
>> 100024 1 udp 0.0.0.0.206.127 status 29
>> 100024 1 tcp 0.0.0.0.166.105 status 29
>> 100024 1 udp6 ::.141.238 status 29
>> 100024 1 tcp6 ::.192.160 status 29
>> [cel@matisse notify-one]$
>>
>> The listing for '/var/run/rpcbind.sock' is rpcbind's AF_LOCAL
>> listener. TI-RPC's rpcb_foo() calls use this method of accessing the
>> rpcbind database rather than going over loopback.
>>
>> rpcbind scrapes the caller's effective UID off the transport socket
>> and uses that for authentication. Note the "owner" column... that
>> comes from the socket's UID, not from the r_owner field. When a
>> service is registered over the network, the owner column says
>> "unknown" and basically anyone can unset it.
>>
>> If the kernel used AF_LOCAL to register its services, it would mean
>> we
>> would never use a network port for local rpcbind calls between the
>> kernel and rpcbind, and rpcbind could automatically prevent the
>> kernel's RPC services from getting unset by malicious users. If /
>> var/
>> run/rpcbind.sock isn't there, the kernel would know immediately that
>> rpcbind wasn't running.
>
> So what? You can achieve the same with any number of communication
> channels (including the network). Just add a timeout to the current
> 'connect()' function, and set it to a low value when doing rpcbind
> upcalls.
I suggested such a scheme last year when we first discussed connected
UDP, and it was decided that especially short timeouts for local
rpcbind calls were not appropriate.
In general, however, the network layer does tell us immediately when
the service is not running (ICMP port unreachable or RST). The
kernel's RPC client is basically ignoring that information.
> What's so special about libtirpc or rpcbind that we have to keep
> redesigning the kernel to work around their limitations instead of the
> other way round?
I'm not sure what you're referring to, in specific.
However, since rpcbind is a standard network protocol, the kernel
really does have to talk the protocol correctly if we want to
interoperate with non-Linux implementations. For local-only cases, we
need to ensure that the kernel is backwards compatible with portmapper.
In this case, Suresh and Neil are dealing with a problem that occurs
whether rpcbind or portmapper is running -- basically during shutdown,
if user space has killed those processes, the kernel waits for a bit
instead of deciding immediately that it should exit. Nothing to do
with TI-RPC, though TI-RPC does offer a potential solution (AF_LOCAL).
In the mount.nfs case, user space uses RST/port unreachable
specifically for determining when the server does not support a
particular transport (see nfs_probe_port). That code is actually
baked into the mount command, it's not part of the library. If we
want to see version/transport negotiation in the kernel, then the
kernel rpcbind client has to have the ability to detect quickly when
the remote does not support the requested transport. Again, nothing
to do with TI-RPC.
In both cases, it turns out that the library implementations in user
space already fail quickly. RPC_CANTRECV is returned if an attempt is
made to send an rpcbind query to an inactive UDP port.
RPC_SYSTEMERROR/ECONNREFUSED is returned if an attempt is made to send
an rpcbind query to an inactive TCP port. In my view, the kernel is
lacking here, and should be made to emulate user space more closely.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On Mon, 2009-07-06 at 13:51 -0400, Chuck Lever wrote:
> In both cases, it turns out that the library implementations in user
> space already fail quickly. RPC_CANTRECV is returned if an attempt is
> made to send an rpcbind query to an inactive UDP port.
> RPC_SYSTEMERROR/ECONNREFUSED is returned if an attempt is made to send
> an rpcbind query to an inactive TCP port. In my view, the kernel is
> lacking here, and should be made to emulate user space more closely.
I fully agree that we can fix the above by more judicious interpretation
of the returned networking errors. What I don't see eye to eye with is
the assertion that has been floating about that the current behaviour
should be a good enough reason to add kernel AF_LOCAL support.
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Jul 6, 2009, at 1:58 PM, Trond Myklebust wrote:
> On Mon, 2009-07-06 at 13:51 -0400, Chuck Lever wrote:
>> In both cases, it turns out that the library implementations in user
>> space already fail quickly. RPC_CANTRECV is returned if an attempt
>> is
>> made to send an rpcbind query to an inactive UDP port.
>> RPC_SYSTEMERROR/ECONNREFUSED is returned if an attempt is made to
>> send
>> an rpcbind query to an inactive TCP port. In my view, the kernel is
>> lacking here, and should be made to emulate user space more closely.
>
> I fully agree that we can fix the above by more judicious
> interpretation
> of the returned networking errors. What I don't see eye to eye with is
> the assertion that has been floating about that the current behaviour
> should be a good enough reason to add kernel AF_LOCAL support.
Since portmapper doesn't support it, and the kernel (for now) needs
backwards compatibility with portmapper, AF_LOCAL for rpcbind is
rather a non-starter.
In my opinion there are other reasons to consider kernel-level
AF_LOCAL support at some later point.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com