2010-09-15 13:04:49

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

Hello everyone!

I would like to prepare the sunrpc layer for working in a containerized
environments. The ultimate goal is to make both nfs client and server
work in containers. Hopefully you won't object :)

Not to look like an idle talker I've prepared this set which makes the
/proc/net/rpc appear in net namespaces and made the ip_map_cache be per-net.

I do not have any plans about when this patches appear at Linus tree and
thus do not know which git tree to hack on. That said I prepared the patches
against the net-next tree (I have some custom netns debugging code in it
and don't want to port it around in vain).

Looking forward to your feedback.

Thanks,
Pavel


2010-09-20 16:34:15

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

>> Looking forward to your feedback.
>
> What are you thinking of as a use-case for this?

To make it possible run both NFS server and client in containers.
I know, that the NFS client is already a filesystem, but such
things as its internal servers and clients abstraction require
isolation from each other in containers terms.

> I think it would be useful to able to run what appear to be multiple NFS
> servers on a single host;

Yup, this is one of the goals.

> and for that, we would want to vary more than
> just the ip_map_cache. The export-related caches (nfsd.fh and
> nfsd.export), at least.

Sure! The thing is that the full containerization of that stuff is
too many patches and I'm not sure that you and other maintainers wish
to review the 100-patch set in one go ;)

I want to find out what git tree to hack on and prepare small patch
sets making things step-by-step. This one is just the first in a row.

> --b.
>

Thanks,
Pavel

2010-09-21 12:31:53

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

>> Now, how do I plan to solve the rpc_get_mount problem. Some time ago
>> there was similar problem with the devpts filesystem - people making
>> ptys work per-container tried to solve the same problem and they
>> ended up (with Al's help) with a yet another devpts mount option which
>> explicitly stated that a new instance should be created. How do you
>> think if we do the same for rpc_pipefs (a newinstance mount option) and
>> add yet another mount option for its only client (nfs) telling it where
>> to look for the rpc mount for (e.g. rpcmount=/var/...) ?
>
> As long as we have some mechanism to ensure that rpc.gssd from one net
> namespace doesn't try to establish a kerberos security context on behalf
> of a NFS mount that resides in a different net namespace.
> My point is if the rpc.gssd resides in a different net namespace, then
> we have no guarantee that the IP address we pass in the upcall even
> points to the same server, so we must ensure that the namespaces match.

Sure, but for doing so there's no need in strict aliasing rpcpipefs-s
super blocks with struct net. By strict I mean, that not only the
superblock knows which struct net it works with, but also a struct net
references rpcpipefs-s supers.

>>> 3) Convert the nfs_client and superblock to be per-net namespace
>>
>> Ack about the nfs_client, but as far as the superblock is
>> concerned - I think we should tag only the nfs_server with
>> net for the same reasons as in the item 2) above.
>
> You should tag the nfs_server, and then make sure that nfs_compare_super
> does not match something that is tagged with a different net namespace
> than the current one. Otherwise, you can end up mounting the wrong NFS
> server (for the same reason as above).

Yes, of course.

>>> 4) Convert lockd's struct host to be per-net namespace
>
> Cheers
> Trond
>
>


2010-09-15 13:04:56

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 7/9] sunrpc: The per-net skeleton

Register empty per-net operations for the sunrpc layer.

Signed-off-by: Pavel Emelyanov <[email protected]>

---
net/sunrpc/netns.h | 12 ++++++++++++
net/sunrpc/sunrpc_syms.c | 31 ++++++++++++++++++++++++++++++-
2 files changed, 42 insertions(+), 1 deletions(-)
create mode 100644 net/sunrpc/netns.h

diff --git a/net/sunrpc/netns.h b/net/sunrpc/netns.h
new file mode 100644
index 0000000..b2d18af
--- /dev/null
+++ b/net/sunrpc/netns.h
@@ -0,0 +1,12 @@
+#ifndef __SUNRPC_NETNS_H__
+#define __SUNRPC_NETNS_H__
+
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+struct sunrpc_net {
+};
+
+extern int sunrpc_net_id;
+
+#endif
diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index c0d0850..d552a6a 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -22,6 +22,26 @@
#include <linux/sunrpc/rpc_pipe_fs.h>
#include <linux/sunrpc/xprtsock.h>

+#include "netns.h"
+
+int sunrpc_net_id;
+
+static __net_init int sunrpc_init_net(struct net *net)
+{
+ return 0;
+}
+
+static __net_exit void sunrpc_exit_net(struct net *net)
+{
+}
+
+static struct pernet_operations sunrpc_net_ops = {
+ .init = sunrpc_init_net,
+ .exit = sunrpc_exit_net,
+ .id = &sunrpc_net_id,
+ .size = sizeof(struct sunrpc_net),
+};
+
extern struct cache_detail ip_map_cache, unix_gid_cache;

extern void cleanup_rpcb_clnt(void);
@@ -38,18 +58,26 @@ init_sunrpc(void)
err = rpcauth_init_module();
if (err)
goto out3;
+
+ cache_initialize();
+
+ err = register_pernet_subsys(&sunrpc_net_ops);
+ if (err)
+ goto out4;
#ifdef RPC_DEBUG
rpc_register_sysctl();
#endif
#ifdef CONFIG_PROC_FS
rpc_proc_init();
#endif
- cache_initialize();
cache_register(&ip_map_cache);
cache_register(&unix_gid_cache);
svc_init_xprt_sock(); /* svc sock transport */
init_socket_xprt(); /* clnt sock transport */
return 0;
+
+out4:
+ unregister_pernet_subsys(&sunrpc_net_ops);
out3:
rpc_destroy_mempool();
out2:
@@ -69,6 +97,7 @@ cleanup_sunrpc(void)
rpc_destroy_mempool();
cache_unregister(&ip_map_cache);
cache_unregister(&unix_gid_cache);
+ unregister_pernet_subsys(&sunrpc_net_ops);
#ifdef RPC_DEBUG
rpc_unregister_sysctl();
#endif
--
1.5.5.6


2010-09-20 18:05:44

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, Sep 20, 2010 at 08:33:42PM +0400, Pavel Emelyanov wrote:
> >> Looking forward to your feedback.
> >
> > What are you thinking of as a use-case for this?
>
> To make it possible run both NFS server and client in containers.

Could you describe that in user-visible terms? (Currently if I create a
new network namespace, what happens, and what will happen differently
afterwards?)

> I know, that the NFS client is already a filesystem, but such
> things as its internal servers and clients abstraction require
> isolation from each other in containers terms.
>
> > I think it would be useful to able to run what appear to be multiple NFS
> > servers on a single host;
>
> Yup, this is one of the goals.

OK, good.

> > and for that, we would want to vary more than
> > just the ip_map_cache. The export-related caches (nfsd.fh and
> > nfsd.export), at least.
>
> Sure! The thing is that the full containerization of that stuff is
> too many patches and I'm not sure that you and other maintainers wish
> to review the 100-patch set in one go ;)

Well, if it's really all ready....

Better, though, would be an outline of the work to be done and what you
expect to be working at the end.

> I want to find out what git tree to hack on and prepare small patch
> sets making things step-by-step. This one is just the first in a row.

For the server side you can use

git://linux-nfs.org/~bfields/linux.git nfsd-next

though generally the latest upstream will likely work as well.

On a quick skim, those patches look fine (and brokenly up nicely for
review, thanks). My main concern is just being sure I understand where
this all ends up.

--b.

2010-09-15 13:04:58

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 2/9] sunrpc: Make xprt auth cache release work with the xprt

This is done in order to facilitate getting the ip_map_cache from
which to put the ip_map.

Signed-off-by: Pavel Emelyanov <[email protected]>

---
include/linux/sunrpc/svcauth.h | 3 ++-
net/sunrpc/svc_xprt.c | 5 ++---
net/sunrpc/svcauth_unix.c | 9 ++++++---
3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/sunrpc/svcauth.h b/include/linux/sunrpc/svcauth.h
index d39dbdc..f656667 100644
--- a/include/linux/sunrpc/svcauth.h
+++ b/include/linux/sunrpc/svcauth.h
@@ -112,6 +112,7 @@ struct auth_ops {
#define SVC_PENDING 8
#define SVC_COMPLETE 9

+struct svc_xprt;

extern int svc_authenticate(struct svc_rqst *rqstp, __be32 *authp);
extern int svc_authorise(struct svc_rqst *rqstp);
@@ -127,7 +128,7 @@ extern struct auth_domain *auth_domain_find(char *name);
extern struct auth_domain *auth_unix_lookup(struct in6_addr *addr);
extern int auth_unix_forget_old(struct auth_domain *dom);
extern void svcauth_unix_purge(void);
-extern void svcauth_unix_info_release(void *);
+extern void svcauth_unix_info_release(struct svc_xprt *xpt);
extern int svcauth_unix_set_client(struct svc_rqst *rqstp);

static inline unsigned long hash_str(char *name, int bits)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index cbc0849..57703ac 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -128,9 +128,8 @@ static void svc_xprt_free(struct kref *kref)
struct svc_xprt *xprt =
container_of(kref, struct svc_xprt, xpt_ref);
struct module *owner = xprt->xpt_class->xcl_owner;
- if (test_bit(XPT_CACHE_AUTH, &xprt->xpt_flags) &&
- xprt->xpt_auth_cache != NULL)
- svcauth_unix_info_release(xprt->xpt_auth_cache);
+ if (test_bit(XPT_CACHE_AUTH, &xprt->xpt_flags))
+ svcauth_unix_info_release(xprt);
xprt->xpt_ops->xpo_free(xprt);
module_put(owner);
}
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 1fe37be..aef0feb 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -472,10 +472,13 @@ ip_map_cached_put(struct svc_rqst *rqstp, struct ip_map *ipm)
}

void
-svcauth_unix_info_release(void *info)
+svcauth_unix_info_release(struct svc_xprt *xpt)
{
- struct ip_map *ipm = info;
- cache_put(&ipm->h, &ip_map_cache);
+ struct ip_map *ipm;
+
+ ipm = xpt->xpt_auth_cache;
+ if (ipm != NULL)
+ cache_put(&ipm->h, &ip_map_cache);
}

/****************************************************************************
--
1.5.5.6


2010-09-20 16:14:53

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Wed, Sep 15, 2010 at 04:23:55PM +0400, Pavel Emelyanov wrote:
> Hello everyone!
>
> I would like to prepare the sunrpc layer for working in a containerized
> environments. The ultimate goal is to make both nfs client and server
> work in containers. Hopefully you won't object :)

Neat, thanks.

> Not to look like an idle talker I've prepared this set which makes the
> /proc/net/rpc appear in net namespaces and made the ip_map_cache be per-net.
>
> I do not have any plans about when this patches appear at Linus tree and
> thus do not know which git tree to hack on. That said I prepared the patches
> against the net-next tree (I have some custom netns debugging code in it
> and don't want to port it around in vain).
>
> Looking forward to your feedback.

What are you thinking of as a use-case for this?

I think it would be useful to able to run what appear to be multiple NFS
servers on a single host; and for that, we would want to vary more than
just the ip_map_cache. The export-related caches (nfsd.fh and
nfsd.export), at least.

--b.

2010-09-20 20:37:46

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers


On Sep 20, 2010, at 3:56 PM, J. Bruce Fields wrote:

> On Mon, Sep 20, 2010 at 03:28:00PM -0400, Chuck Lever wrote:
>>
>> On Sep 20, 2010, at 3:13 PM, Pavel Emelyanov wrote:
>>> The nearest plan is
>>>
>>> 1. Prepare the sunrpc layer to work in net namespaces 2. Make
>>> rpcpipefs and nfsd filesystems be mountable multiple times 3. Make
>>> support for multiple instances of the nfsd caches 4. Make suuport
>>> for multiple instances of the nfsd_serv
>>>
>>> After this several NFSd-s can be used in containers (hopefully I
>>> didn't miss anything).
>>
>> Are you assuming NFSv4 only? Something needs to be done about NLM and
>> NSM to make this work right.
>>
>> Is there an issue for idmapper and svcgssd? Probably not, but worth
>> exploring.
>>
>> And, how about AUTH_SYS certs? These contain the host's name in them,
>> and that depends on the net namespace. NLM uses AUTH_SYS, and I
>> believe the NFS server can make NLM calls to the client.
>
> The client probably can't use the auth_sys cred on nlm callbacks in any
> sensible way, so this may not be a big deal.

I doubt anything looks at that hostname, really. My worry is that it could leak information (like the wrong hostname) onto the network.

--
chuck[dot]lever[at]oracle[dot]com





2010-09-20 20:05:21

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, 2010-09-20 at 23:13 +0400, Pavel Emelyanov wrote:
> On 09/20/2010 10:04 PM, J. Bruce Fields wrote:
> > On Mon, Sep 20, 2010 at 08:33:42PM +0400, Pavel Emelyanov wrote:
> >>>> Looking forward to your feedback.
> >>>
> >>> What are you thinking of as a use-case for this?
> >>
> >> To make it possible run both NFS server and client in containers.
> >
> > Could you describe that in user-visible terms? (Currently if I create a
> > new network namespace, what happens, and what will happen differently
> > afterwards?)
>
> This is not about the network namespace only I believe. E.g. the
> nfsd filesystem is a filesystem already and shouldn't be tied to
> any task-driven context.
>
> E.g. as far as the net namespace part is concerned. First of all
> the TCP/UDP socket used by transport will be per-namespace. User
> will "feel" this for example by different routing and netfilter
> rules applied to connections. Besides the rpc service sockets will
> be per namespace as well.
>
> >> Sure! The thing is that the full containerization of that stuff is
> >> too many patches and I'm not sure that you and other maintainers wish
> >> to review the 100-patch set in one go ;)
> >
> > Well, if it's really all ready....
> >
> > Better, though, would be an outline of the work to be done and what you
> > expect to be working at the end.
>
> The nearest plan is
>
> 1. Prepare the sunrpc layer to work in net namespaces
> 2. Make rpcpipefs and nfsd filesystems be mountable multiple times
> 3. Make support for multiple instances of the nfsd caches
> 4. Make suuport for multiple instances of the nfsd_serv
>
> After this several NFSd-s can be used in containers (hopefully I
> didn't miss anything).
>
> Plans about the nfs client are much more obscure for now.

The client should be something like the following:

1) Ensure sunrpc sockets are created using the correct net namespace
2) Convert rpc_pipefs to be per-net namespace.
3) Convert the nfs_client and superblock to be per-net namespace
4) Convert lockd's struct host to be per-net namespace

Cheers
Trond

2010-09-15 13:05:03

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 1/9] sunrpc: Pass the ip_map_parse's cd to lower calls

The target is to have many ip_map_cache-s in the system. This particular
patch handles its usage by the ip_map_parse callback.

Signed-off-by: Pavel Emelyanov <[email protected]>

---
net/sunrpc/svcauth_unix.c | 31 +++++++++++++++++++++----------
1 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 2073116..1fe37be 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -178,8 +178,8 @@ static int ip_map_upcall(struct cache_detail *cd, struct cache_head *h)
return sunrpc_cache_pipe_upcall(cd, h, ip_map_request);
}

-static struct ip_map *ip_map_lookup(char *class, struct in6_addr *addr);
-static int ip_map_update(struct ip_map *ipm, struct unix_domain *udom, time_t expiry);
+static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class, struct in6_addr *addr);
+static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm, struct unix_domain *udom, time_t expiry);

static int ip_map_parse(struct cache_detail *cd,
char *mesg, int mlen)
@@ -249,9 +249,9 @@ static int ip_map_parse(struct cache_detail *cd,
dom = NULL;

/* IPv6 scope IDs are ignored for now */
- ipmp = ip_map_lookup(class, &sin6.sin6_addr);
+ ipmp = __ip_map_lookup(cd, class, &sin6.sin6_addr);
if (ipmp) {
- err = ip_map_update(ipmp,
+ err = __ip_map_update(cd, ipmp,
container_of(dom, struct unix_domain, h),
expiry);
} else
@@ -309,14 +309,15 @@ struct cache_detail ip_map_cache = {
.alloc = ip_map_alloc,
};

-static struct ip_map *ip_map_lookup(char *class, struct in6_addr *addr)
+static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
+ struct in6_addr *addr)
{
struct ip_map ip;
struct cache_head *ch;

strcpy(ip.m_class, class);
ipv6_addr_copy(&ip.m_addr, addr);
- ch = sunrpc_cache_lookup(&ip_map_cache, &ip.h,
+ ch = sunrpc_cache_lookup(cd, &ip.h,
hash_str(class, IP_HASHBITS) ^
hash_ip6(*addr));

@@ -326,7 +327,13 @@ static struct ip_map *ip_map_lookup(char *class, struct in6_addr *addr)
return NULL;
}

-static int ip_map_update(struct ip_map *ipm, struct unix_domain *udom, time_t expiry)
+static inline struct ip_map *ip_map_lookup(char *class, struct in6_addr *addr)
+{
+ return __ip_map_lookup(&ip_map_cache, class, addr);
+}
+
+static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
+ struct unix_domain *udom, time_t expiry)
{
struct ip_map ip;
struct cache_head *ch;
@@ -344,16 +351,20 @@ static int ip_map_update(struct ip_map *ipm, struct unix_domain *udom, time_t ex
ip.m_add_change++;
}
ip.h.expiry_time = expiry;
- ch = sunrpc_cache_update(&ip_map_cache,
- &ip.h, &ipm->h,
+ ch = sunrpc_cache_update(cd, &ip.h, &ipm->h,
hash_str(ipm->m_class, IP_HASHBITS) ^
hash_ip6(ipm->m_addr));
if (!ch)
return -ENOMEM;
- cache_put(ch, &ip_map_cache);
+ cache_put(ch, cd);
return 0;
}

+static inline int ip_map_update(struct ip_map *ipm, struct unix_domain *udom, time_t expiry)
+{
+ return __ip_map_update(&ip_map_cache, ipm, udom, expiry);
+}
+
int auth_unix_add_addr(struct in6_addr *addr, struct auth_domain *dom)
{
struct unix_domain *udom;
--
1.5.5.6


2010-09-20 20:13:59

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, 2010-09-20 at 15:56 -0400, J. Bruce Fields wrote:
> On Mon, Sep 20, 2010 at 03:28:00PM -0400, Chuck Lever wrote:
> >
> > On Sep 20, 2010, at 3:13 PM, Pavel Emelyanov wrote:
> > > The nearest plan is
> > >
> > > 1. Prepare the sunrpc layer to work in net namespaces 2. Make
> > > rpcpipefs and nfsd filesystems be mountable multiple times 3. Make
> > > support for multiple instances of the nfsd caches 4. Make suuport
> > > for multiple instances of the nfsd_serv
> > >
> > > After this several NFSd-s can be used in containers (hopefully I
> > > didn't miss anything).
> >
> > Are you assuming NFSv4 only? Something needs to be done about NLM and
> > NSM to make this work right.
> >
> > Is there an issue for idmapper and svcgssd? Probably not, but worth
> > exploring.
> >
> > And, how about AUTH_SYS certs? These contain the host's name in them,
> > and that depends on the net namespace. NLM uses AUTH_SYS, and I
> > believe the NFS server can make NLM calls to the client.
>
> The client probably can't use the auth_sys cred on nlm callbacks in any
> sensible way, so this may not be a big deal.

If clients are per-net namespace, then the cl_nodename can and should be
converted to reflect the utsname()->nodename. We currently force it to
be the init_utsname()->nodename, precisely because we don't support
namespaces.

Cheers
Trond

2010-09-15 13:05:00

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 5/9] sunrpc: Add routines that allow registering per-net caches

Existing calls do the same, but for the init_net.

Signed-off-by: Pavel Emelyanov <[email protected]>

---
include/linux/sunrpc/cache.h | 2 ++
net/sunrpc/cache.c | 26 ++++++++++++++++++--------
2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 7bf3e84..c486487 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -194,7 +194,9 @@ extern void cache_purge(struct cache_detail *detail);
#define NEVER (0x7FFFFFFF)
extern void __init cache_initialize(void);
extern int cache_register(struct cache_detail *cd);
+extern int cache_register_net(struct cache_detail *cd, struct net *net);
extern void cache_unregister(struct cache_detail *cd);
+extern void cache_unregister_net(struct cache_detail *cd, struct net *net);

extern int sunrpc_cache_register_pipefs(struct dentry *parent, const char *,
mode_t, struct cache_detail *);
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 2b06410..27e12ae 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -1443,7 +1443,7 @@ static const struct file_operations cache_flush_operations_procfs = {
.release = release_flush_procfs,
};

-static void remove_cache_proc_entries(struct cache_detail *cd)
+static void remove_cache_proc_entries(struct cache_detail *cd, struct net *net)
{
if (cd->u.procfs.proc_ent == NULL)
return;
@@ -1458,7 +1458,7 @@ static void remove_cache_proc_entries(struct cache_detail *cd)
}

#ifdef CONFIG_PROC_FS
-static int create_cache_proc_entries(struct cache_detail *cd)
+static int create_cache_proc_entries(struct cache_detail *cd, struct net *net)
{
struct proc_dir_entry *p;

@@ -1493,11 +1493,11 @@ static int create_cache_proc_entries(struct cache_detail *cd)
}
return 0;
out_nomem:
- remove_cache_proc_entries(cd);
+ remove_cache_proc_entries(cd, net);
return -ENOMEM;
}
#else /* CONFIG_PROC_FS */
-static int create_cache_proc_entries(struct cache_detail *cd)
+static int create_cache_proc_entries(struct cache_detail *cd, struct net *net)
{
return 0;
}
@@ -1508,23 +1508,33 @@ void __init cache_initialize(void)
INIT_DELAYED_WORK_DEFERRABLE(&cache_cleaner, do_cache_clean);
}

-int cache_register(struct cache_detail *cd)
+int cache_register_net(struct cache_detail *cd, struct net *net)
{
int ret;

sunrpc_init_cache_detail(cd);
- ret = create_cache_proc_entries(cd);
+ ret = create_cache_proc_entries(cd, net);
if (ret)
sunrpc_destroy_cache_detail(cd);
return ret;
}
+
+int cache_register(struct cache_detail *cd)
+{
+ return cache_register_net(cd, &init_net);
+}
EXPORT_SYMBOL_GPL(cache_register);

-void cache_unregister(struct cache_detail *cd)
+void cache_unregister_net(struct cache_detail *cd, struct net *net)
{
- remove_cache_proc_entries(cd);
+ remove_cache_proc_entries(cd, net);
sunrpc_destroy_cache_detail(cd);
}
+
+void cache_unregister(struct cache_detail *cd)
+{
+ cache_unregister_net(cd, &init_net);
+}
EXPORT_SYMBOL_GPL(cache_unregister);

static ssize_t cache_read_pipefs(struct file *filp, char __user *buf,
--
1.5.5.6


2010-09-20 17:20:52

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 7/9] sunrpc: The per-net skeleton

On Wed, Sep 15, 2010 at 04:28:25PM +0400, Pavel Emelyanov wrote:
> @@ -38,18 +58,26 @@ init_sunrpc(void)
> err = rpcauth_init_module();
> if (err)
> goto out3;
> +
> + cache_initialize();
> +
> + err = register_pernet_subsys(&sunrpc_net_ops);
> + if (err)
> + goto out4;
> #ifdef RPC_DEBUG
> rpc_register_sysctl();
> #endif
> #ifdef CONFIG_PROC_FS
> rpc_proc_init();
> #endif
> - cache_initialize();
> cache_register(&ip_map_cache);
> cache_register(&unix_gid_cache);
> svc_init_xprt_sock(); /* svc sock transport */
> init_socket_xprt(); /* clnt sock transport */
> return 0;
> +
> +out4:
> + unregister_pernet_subsys(&sunrpc_net_ops);

If register_pernet_subsys() failed, then shouldn't this be unnecessary?
Maybe this should be rpcauth_remove_module()?

--b.

> out3:
> rpc_destroy_mempool();
> out2:
> @@ -69,6 +97,7 @@ cleanup_sunrpc(void)
> rpc_destroy_mempool();
> cache_unregister(&ip_map_cache);
> cache_unregister(&unix_gid_cache);
> + unregister_pernet_subsys(&sunrpc_net_ops);
> #ifdef RPC_DEBUG
> rpc_unregister_sysctl();
> #endif
> --
> 1.5.5.6
>

2010-09-15 13:05:07

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 8/9] sunrpc: Make the /proc/net/rpc appear in net namespaces

Signed-off-by: Pavel Emelyanov <[email protected]>

---
include/linux/sunrpc/stats.h | 23 ++++++++++++++-------
net/sunrpc/cache.c | 10 +++++++-
net/sunrpc/netns.h | 1 +
net/sunrpc/stats.c | 43 ++++++++++++++++++++++++-----------------
net/sunrpc/sunrpc_syms.c | 16 +++++++++-----
5 files changed, 59 insertions(+), 34 deletions(-)

diff --git a/include/linux/sunrpc/stats.h b/include/linux/sunrpc/stats.h
index 5fa0f20..680471d 100644
--- a/include/linux/sunrpc/stats.h
+++ b/include/linux/sunrpc/stats.h
@@ -38,8 +38,21 @@ struct svc_stat {
rpcbadclnt;
};

-void rpc_proc_init(void);
-void rpc_proc_exit(void);
+struct net;
+#ifdef CONFIG_PROC_FS
+int rpc_proc_init(struct net *);
+void rpc_proc_exit(struct net *);
+#else
+static inline int rpc_proc_init(struct net *net)
+{
+ return 0;
+}
+
+static inline void rpc_proc_exit(struct net *net)
+{
+}
+#endif
+
#ifdef MODULE
void rpc_modcount(struct inode *, int);
#endif
@@ -54,9 +67,6 @@ void svc_proc_unregister(const char *);

void svc_seq_show(struct seq_file *,
const struct svc_stat *);
-
-extern struct proc_dir_entry *proc_net_rpc;
-
#else

static inline struct proc_dir_entry *rpc_proc_register(struct rpc_stat *s) { return NULL; }
@@ -69,9 +79,6 @@ static inline void svc_proc_unregister(const char *p) {}

static inline void svc_seq_show(struct seq_file *seq,
const struct svc_stat *st) {}
-
-#define proc_net_rpc NULL
-
#endif

#endif /* _LINUX_SUNRPC_STATS_H */
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 27e12ae..8e647a5 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -34,6 +34,7 @@
#include <linux/sunrpc/cache.h>
#include <linux/sunrpc/stats.h>
#include <linux/sunrpc/rpc_pipe_fs.h>
+#include "netns.h"

#define RPCDBG_FACILITY RPCDBG_CACHE

@@ -1445,6 +1446,8 @@ static const struct file_operations cache_flush_operations_procfs = {

static void remove_cache_proc_entries(struct cache_detail *cd, struct net *net)
{
+ struct sunrpc_net *sn;
+
if (cd->u.procfs.proc_ent == NULL)
return;
if (cd->u.procfs.flush_ent)
@@ -1454,15 +1457,18 @@ static void remove_cache_proc_entries(struct cache_detail *cd, struct net *net)
if (cd->u.procfs.content_ent)
remove_proc_entry("content", cd->u.procfs.proc_ent);
cd->u.procfs.proc_ent = NULL;
- remove_proc_entry(cd->name, proc_net_rpc);
+ sn = net_generic(net, sunrpc_net_id);
+ remove_proc_entry(cd->name, sn->proc_net_rpc);
}

#ifdef CONFIG_PROC_FS
static int create_cache_proc_entries(struct cache_detail *cd, struct net *net)
{
struct proc_dir_entry *p;
+ struct sunrpc_net *sn;

- cd->u.procfs.proc_ent = proc_mkdir(cd->name, proc_net_rpc);
+ sn = net_generic(net, sunrpc_net_id);
+ cd->u.procfs.proc_ent = proc_mkdir(cd->name, sn->proc_net_rpc);
if (cd->u.procfs.proc_ent == NULL)
goto out_nomem;
cd->u.procfs.channel_ent = NULL;
diff --git a/net/sunrpc/netns.h b/net/sunrpc/netns.h
index b2d18af..e52ce89 100644
--- a/net/sunrpc/netns.h
+++ b/net/sunrpc/netns.h
@@ -5,6 +5,7 @@
#include <net/netns/generic.h>

struct sunrpc_net {
+ struct proc_dir_entry *proc_net_rpc;
};

extern int sunrpc_net_id;
diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index ea1046f..f71a731 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -22,11 +22,10 @@
#include <linux/sunrpc/clnt.h>
#include <linux/sunrpc/svcsock.h>
#include <linux/sunrpc/metrics.h>
-#include <net/net_namespace.h>

-#define RPCDBG_FACILITY RPCDBG_MISC
+#include "netns.h"

-struct proc_dir_entry *proc_net_rpc = NULL;
+#define RPCDBG_FACILITY RPCDBG_MISC

/*
* Get RPC client stats
@@ -218,10 +217,11 @@ EXPORT_SYMBOL_GPL(rpc_print_iostats);
static inline struct proc_dir_entry *
do_register(const char *name, void *data, const struct file_operations *fops)
{
- rpc_proc_init();
- dprintk("RPC: registering /proc/net/rpc/%s\n", name);
+ struct sunrpc_net *sn;

- return proc_create_data(name, 0, proc_net_rpc, fops, data);
+ dprintk("RPC: registering /proc/net/rpc/%s\n", name);
+ sn = net_generic(&init_net, sunrpc_net_id);
+ return proc_create_data(name, 0, sn->proc_net_rpc, fops, data);
}

struct proc_dir_entry *
@@ -234,7 +234,10 @@ EXPORT_SYMBOL_GPL(rpc_proc_register);
void
rpc_proc_unregister(const char *name)
{
- remove_proc_entry(name, proc_net_rpc);
+ struct sunrpc_net *sn;
+
+ sn = net_generic(&init_net, sunrpc_net_id);
+ remove_proc_entry(name, sn->proc_net_rpc);
}
EXPORT_SYMBOL_GPL(rpc_proc_unregister);

@@ -248,25 +251,29 @@ EXPORT_SYMBOL_GPL(svc_proc_register);
void
svc_proc_unregister(const char *name)
{
- remove_proc_entry(name, proc_net_rpc);
+ struct sunrpc_net *sn;
+
+ sn = net_generic(&init_net, sunrpc_net_id);
+ remove_proc_entry(name, sn->proc_net_rpc);
}
EXPORT_SYMBOL_GPL(svc_proc_unregister);

-void
-rpc_proc_init(void)
+int rpc_proc_init(struct net *net)
{
+ struct sunrpc_net *sn;
+
dprintk("RPC: registering /proc/net/rpc\n");
- if (!proc_net_rpc)
- proc_net_rpc = proc_mkdir("rpc", init_net.proc_net);
+ sn = net_generic(net, sunrpc_net_id);
+ sn->proc_net_rpc = proc_mkdir("rpc", net->proc_net);
+ if (sn->proc_net_rpc == NULL)
+ return -ENOMEM;
+
+ return 0;
}

-void
-rpc_proc_exit(void)
+void rpc_proc_exit(struct net *net)
{
dprintk("RPC: unregistering /proc/net/rpc\n");
- if (proc_net_rpc) {
- proc_net_rpc = NULL;
- remove_proc_entry("rpc", init_net.proc_net);
- }
+ remove_proc_entry("rpc", net->proc_net);
}

diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index d552a6a..7d894e5 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -28,11 +28,21 @@ int sunrpc_net_id;

static __net_init int sunrpc_init_net(struct net *net)
{
+ int err;
+
+ err = rpc_proc_init(net);
+ if (err)
+ goto err_proc;
+
return 0;
+
+err_proc:
+ return err;
}

static __net_exit void sunrpc_exit_net(struct net *net)
{
+ rpc_proc_exit(net);
}

static struct pernet_operations sunrpc_net_ops = {
@@ -67,9 +77,6 @@ init_sunrpc(void)
#ifdef RPC_DEBUG
rpc_register_sysctl();
#endif
-#ifdef CONFIG_PROC_FS
- rpc_proc_init();
-#endif
cache_register(&ip_map_cache);
cache_register(&unix_gid_cache);
svc_init_xprt_sock(); /* svc sock transport */
@@ -101,9 +108,6 @@ cleanup_sunrpc(void)
#ifdef RPC_DEBUG
rpc_unregister_sysctl();
#endif
-#ifdef CONFIG_PROC_FS
- rpc_proc_exit();
-#endif
rcu_barrier(); /* Wait for completion of call_rcu()'s */
}
MODULE_LICENSE("GPL");
--
1.5.5.6


2010-09-20 20:11:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, Sep 20, 2010 at 04:05:16PM -0400, Trond Myklebust wrote:
> On Mon, 2010-09-20 at 23:13 +0400, Pavel Emelyanov wrote:
> > On 09/20/2010 10:04 PM, J. Bruce Fields wrote:
> > > On Mon, Sep 20, 2010 at 08:33:42PM +0400, Pavel Emelyanov wrote:
> > >>>> Looking forward to your feedback.
> > >>>
> > >>> What are you thinking of as a use-case for this?
> > >>
> > >> To make it possible run both NFS server and client in containers.
> > >
> > > Could you describe that in user-visible terms? (Currently if I create a
> > > new network namespace, what happens, and what will happen differently
> > > afterwards?)
> >
> > This is not about the network namespace only I believe. E.g. the
> > nfsd filesystem is a filesystem already and shouldn't be tied to
> > any task-driven context.
> >
> > E.g. as far as the net namespace part is concerned. First of all
> > the TCP/UDP socket used by transport will be per-namespace. User
> > will "feel" this for example by different routing and netfilter
> > rules applied to connections. Besides the rpc service sockets will
> > be per namespace as well.
> >
> > >> Sure! The thing is that the full containerization of that stuff is
> > >> too many patches and I'm not sure that you and other maintainers wish
> > >> to review the 100-patch set in one go ;)
> > >
> > > Well, if it's really all ready....
> > >
> > > Better, though, would be an outline of the work to be done and what you
> > > expect to be working at the end.
> >
> > The nearest plan is
> >
> > 1. Prepare the sunrpc layer to work in net namespaces
> > 2. Make rpcpipefs and nfsd filesystems be mountable multiple times
> > 3. Make support for multiple instances of the nfsd caches
> > 4. Make suuport for multiple instances of the nfsd_serv
> >
> > After this several NFSd-s can be used in containers (hopefully I
> > didn't miss anything).
> >
> > Plans about the nfs client are much more obscure for now.
>
> The client should be something like the following:
>
> 1) Ensure sunrpc sockets are created using the correct net namespace

For the client, that's initially the net namespace of the mount? (What
about submounts?)

> 2) Convert rpc_pipefs to be per-net namespace.
> 3) Convert the nfs_client and superblock to be per-net namespace
> 4) Convert lockd's struct host to be per-net namespace

What do we expect behavior to actually look like from the point of view
of somebody on the client?

I'd like to see someone write some kind of spec for how this should all
work. That worries me a lot more than the code.....

--b.

2010-09-15 16:05:46

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On 09/15/2010 07:31 PM, Boaz Harrosh wrote:
> On 09/15/2010 02:23 PM, Pavel Emelyanov wrote:
>> Hello everyone!
>>
>> I would like to prepare the sunrpc layer for working in a containerized
>> environments. The ultimate goal is to make both nfs client and server
>> work in containers. Hopefully you won't object :)
>>
>> Not to look like an idle talker I've prepared this set which makes the
>> /proc/net/rpc appear in net namespaces and made the ip_map_cache be per-net.
>>
>> I do not have any plans about when this patches appear at Linus tree and
>> thus do not know which git tree to hack on. That said I prepared the patches
>> against the net-next tree (I have some custom netns debugging code in it
>> and don't want to port it around in vain).
>>
>> Looking forward to your feedback.
>>
>
> We are very curios people and would like to know more.

Sure.

> What does it mean to work in containers? Why is it better for
> the client why is it better for the server? who and how to enjoy
> this benefits. Does it have any effects on the operation of nfs/nfsd?
> What problem does it solve?

That's an old question actually. What does it mean, why is it better, who
and how to enjoy etc. were answered years ago. Do you want me to google these
articles for you?

If I misunderstood your question, please rephrase.

Thanks
Pavel

> Thanks
> Boaz
>
>> Thanks,
>> Pavel
>


2010-09-15 12:29:41

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 9/9] sunrpc: Make the ip_map_cache be per-net

Everything that is required for that already exists:
* the per-net cache registration with respective proc entries
* the context (struct net) is available in all the users

Signed-off-by: Pavel Emelyanov <[email protected]>

---
net/sunrpc/netns.h | 6 ++
net/sunrpc/sunrpc_syms.c | 11 +++-
net/sunrpc/svcauth_unix.c | 122 ++++++++++++++++++++++++++++++++++----------
3 files changed, 108 insertions(+), 31 deletions(-)

diff --git a/net/sunrpc/netns.h b/net/sunrpc/netns.h
index e52ce89..d013bf2 100644
--- a/net/sunrpc/netns.h
+++ b/net/sunrpc/netns.h
@@ -4,10 +4,16 @@
#include <net/net_namespace.h>
#include <net/netns/generic.h>

+struct cache_detail;
+
struct sunrpc_net {
struct proc_dir_entry *proc_net_rpc;
+ struct cache_detail *ip_map_cache;
};

extern int sunrpc_net_id;

+int ip_map_cache_create(struct net *);
+void ip_map_cache_destroy(struct net *);
+
#endif
diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index 7d894e5..ca3cd9c 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -34,14 +34,21 @@ static __net_init int sunrpc_init_net(struct net *net)
if (err)
goto err_proc;

+ err = ip_map_cache_create(net);
+ if (err)
+ goto err_ipmap;
+
return 0;

+err_ipmap:
+ rpc_proc_exit(net);
err_proc:
return err;
}

static __net_exit void sunrpc_exit_net(struct net *net)
{
+ ip_map_cache_destroy(net);
rpc_proc_exit(net);
}

@@ -52,7 +59,7 @@ static struct pernet_operations sunrpc_net_ops = {
.size = sizeof(struct sunrpc_net),
};

-extern struct cache_detail ip_map_cache, unix_gid_cache;
+extern struct cache_detail unix_gid_cache;

extern void cleanup_rpcb_clnt(void);

@@ -77,7 +84,6 @@ init_sunrpc(void)
#ifdef RPC_DEBUG
rpc_register_sysctl();
#endif
- cache_register(&ip_map_cache);
cache_register(&unix_gid_cache);
svc_init_xprt_sock(); /* svc sock transport */
init_socket_xprt(); /* clnt sock transport */
@@ -102,7 +108,6 @@ cleanup_sunrpc(void)
svc_cleanup_xprt_sock();
unregister_rpc_pipefs();
rpc_destroy_mempool();
- cache_unregister(&ip_map_cache);
cache_unregister(&unix_gid_cache);
unregister_pernet_subsys(&sunrpc_net_ops);
#ifdef RPC_DEBUG
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 4500fd8..b24cdcb 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -18,6 +18,8 @@

#include <linux/sunrpc/clnt.h>

+#include "netns.h"
+
/*
* AUTHUNIX and AUTHNULL credentials are both handled here.
* AUTHNULL is treated just like AUTHUNIX except that the uid/gid
@@ -92,7 +94,6 @@ struct ip_map {
struct unix_domain *m_client;
int m_add_change;
};
-static struct cache_head *ip_table[IP_HASHMAX];

static void ip_map_put(struct kref *kref)
{
@@ -294,21 +295,6 @@ static int ip_map_show(struct seq_file *m,
}


-struct cache_detail ip_map_cache = {
- .owner = THIS_MODULE,
- .hash_size = IP_HASHMAX,
- .hash_table = ip_table,
- .name = "auth.unix.ip",
- .cache_put = ip_map_put,
- .cache_upcall = ip_map_upcall,
- .cache_parse = ip_map_parse,
- .cache_show = ip_map_show,
- .match = ip_map_match,
- .init = ip_map_init,
- .update = update,
- .alloc = ip_map_alloc,
-};
-
static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
struct in6_addr *addr)
{
@@ -330,7 +316,10 @@ static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
static inline struct ip_map *ip_map_lookup(struct net *net, char *class,
struct in6_addr *addr)
{
- return __ip_map_lookup(&ip_map_cache, class, addr);
+ struct sunrpc_net *sn;
+
+ sn = net_generic(net, sunrpc_net_id);
+ return __ip_map_lookup(sn->ip_map_cache, class, addr);
}

static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
@@ -364,7 +353,10 @@ static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
static inline int ip_map_update(struct net *net, struct ip_map *ipm,
struct unix_domain *udom, time_t expiry)
{
- return __ip_map_update(&ip_map_cache, ipm, udom, expiry);
+ struct sunrpc_net *sn;
+
+ sn = net_generic(net, sunrpc_net_id);
+ return __ip_map_update(sn->ip_map_cache, ipm, udom, expiry);
}

int auth_unix_add_addr(struct net *net, struct in6_addr *addr, struct auth_domain *dom)
@@ -400,12 +392,14 @@ struct auth_domain *auth_unix_lookup(struct net *net, struct in6_addr *addr)
{
struct ip_map *ipm;
struct auth_domain *rv;
+ struct sunrpc_net *sn;

+ sn = net_generic(net, sunrpc_net_id);
ipm = ip_map_lookup(net, "nfsd", addr);

if (!ipm)
return NULL;
- if (cache_check(&ip_map_cache, &ipm->h, NULL))
+ if (cache_check(sn->ip_map_cache, &ipm->h, NULL))
return NULL;

if ((ipm->m_client->addr_changes - ipm->m_add_change) >0) {
@@ -416,14 +410,21 @@ struct auth_domain *auth_unix_lookup(struct net *net, struct in6_addr *addr)
rv = &ipm->m_client->h;
kref_get(&rv->ref);
}
- cache_put(&ipm->h, &ip_map_cache);
+ cache_put(&ipm->h, sn->ip_map_cache);
return rv;
}
EXPORT_SYMBOL_GPL(auth_unix_lookup);

void svcauth_unix_purge(void)
{
- cache_purge(&ip_map_cache);
+ struct net *net;
+
+ for_each_net(net) {
+ struct sunrpc_net *sn;
+
+ sn = net_generic(net, sunrpc_net_id);
+ cache_purge(sn->ip_map_cache);
+ }
}
EXPORT_SYMBOL_GPL(svcauth_unix_purge);

@@ -431,6 +432,7 @@ static inline struct ip_map *
ip_map_cached_get(struct svc_xprt *xprt)
{
struct ip_map *ipm = NULL;
+ struct sunrpc_net *sn;

if (test_bit(XPT_CACHE_AUTH, &xprt->xpt_flags)) {
spin_lock(&xprt->xpt_lock);
@@ -442,9 +444,10 @@ ip_map_cached_get(struct svc_xprt *xprt)
* remembered, e.g. by a second mount from the
* same IP address.
*/
+ sn = net_generic(xprt->xpt_net, sunrpc_net_id);
xprt->xpt_auth_cache = NULL;
spin_unlock(&xprt->xpt_lock);
- cache_put(&ipm->h, &ip_map_cache);
+ cache_put(&ipm->h, sn->ip_map_cache);
return NULL;
}
cache_get(&ipm->h);
@@ -466,8 +469,12 @@ ip_map_cached_put(struct svc_xprt *xprt, struct ip_map *ipm)
}
spin_unlock(&xprt->xpt_lock);
}
- if (ipm)
- cache_put(&ipm->h, &ip_map_cache);
+ if (ipm) {
+ struct sunrpc_net *sn;
+
+ sn = net_generic(xprt->xpt_net, sunrpc_net_id);
+ cache_put(&ipm->h, sn->ip_map_cache);
+ }
}

void
@@ -476,8 +483,12 @@ svcauth_unix_info_release(struct svc_xprt *xpt)
struct ip_map *ipm;

ipm = xpt->xpt_auth_cache;
- if (ipm != NULL)
- cache_put(&ipm->h, &ip_map_cache);
+ if (ipm != NULL) {
+ struct sunrpc_net *sn;
+
+ sn = net_generic(xpt->xpt_net, sunrpc_net_id);
+ cache_put(&ipm->h, sn->ip_map_cache);
+ }
}

/****************************************************************************
@@ -705,6 +716,8 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
struct group_info *gi;
struct svc_cred *cred = &rqstp->rq_cred;
struct svc_xprt *xprt = rqstp->rq_xprt;
+ struct net *net = xprt->xpt_net;
+ struct sunrpc_net *sn = net_generic(net, sunrpc_net_id);

switch (rqstp->rq_addr.ss_family) {
case AF_INET:
@@ -725,13 +738,13 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)

ipm = ip_map_cached_get(xprt);
if (ipm == NULL)
- ipm = ip_map_lookup(&init_net, rqstp->rq_server->sv_program->pg_class,
+ ipm = __ip_map_lookup(sn->ip_map_cache, rqstp->rq_server->sv_program->pg_class,
&sin6->sin6_addr);

if (ipm == NULL)
return SVC_DENIED;

- switch (cache_check(&ip_map_cache, &ipm->h, &rqstp->rq_chandle)) {
+ switch (cache_check(sn->ip_map_cache, &ipm->h, &rqstp->rq_chandle)) {
default:
BUG();
case -EAGAIN:
@@ -900,3 +913,56 @@ struct auth_ops svcauth_unix = {
.set_client = svcauth_unix_set_client,
};

+int ip_map_cache_create(struct net *net)
+{
+ int err = -ENOMEM;
+ struct cache_detail *cd;
+ struct cache_head **tbl;
+ struct sunrpc_net *sn = net_generic(net, sunrpc_net_id);
+
+ cd = kzalloc(sizeof(struct cache_detail), GFP_KERNEL);
+ if (cd == NULL)
+ goto err_cd;
+
+ tbl = kzalloc(IP_HASHMAX * sizeof(struct cache_head *), GFP_KERNEL);
+ if (tbl == NULL)
+ goto err_tbl;
+
+ cd->owner = THIS_MODULE,
+ cd->hash_size = IP_HASHMAX,
+ cd->hash_table = tbl,
+ cd->name = "auth.unix.ip",
+ cd->cache_put = ip_map_put,
+ cd->cache_upcall = ip_map_upcall,
+ cd->cache_parse = ip_map_parse,
+ cd->cache_show = ip_map_show,
+ cd->match = ip_map_match,
+ cd->init = ip_map_init,
+ cd->update = update,
+ cd->alloc = ip_map_alloc,
+
+ err = cache_register_net(cd, net);
+ if (err)
+ goto err_reg;
+
+ sn->ip_map_cache = cd;
+ return 0;
+
+err_reg:
+ kfree(tbl);
+err_tbl:
+ kfree(cd);
+err_cd:
+ return err;
+}
+
+void ip_map_cache_destroy(struct net *net)
+{
+ struct sunrpc_net *sn;
+
+ sn = net_generic(net, sunrpc_net_id);
+ cache_purge(sn->ip_map_cache);
+ cache_unregister_net(sn->ip_map_cache, net);
+ kfree(sn->ip_map_cache->hash_table);
+ kfree(sn->ip_map_cache);
+}
--
1.5.5.6


2010-09-15 13:05:05

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 4/9] sunrpc: Add net to pure API calls

There are two calls that operate on ip_map_cache and are
directly called from the nfsd code. Other places will be
handled in a different way.

Signed-off-by: Pavel Emelyanov <[email protected]>

---
fs/nfsd/export.c | 2 +-
fs/nfsd/nfsctl.c | 4 ++--
include/linux/sunrpc/svcauth.h | 4 ++--
net/sunrpc/svcauth_unix.c | 18 ++++++++++--------
4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index c2a4f71..3195e8b 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -1563,7 +1563,7 @@ exp_addclient(struct nfsctl_client *ncp)
/* Insert client into hashtable. */
for (i = 0; i < ncp->cl_naddr; i++) {
ipv6_addr_set_v4mapped(ncp->cl_addrlist[i].s_addr, &addr6);
- auth_unix_add_addr(&addr6, dom);
+ auth_unix_add_addr(&init_net, &addr6, dom);
}
auth_unix_forget_old(dom);
auth_domain_put(dom);
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index b53b1d0..b31fe5a 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -401,7 +401,7 @@ static ssize_t write_getfs(struct file *file, char *buf, size_t size)

ipv6_addr_set_v4mapped(sin->sin_addr.s_addr, &in6);

- clp = auth_unix_lookup(&in6);
+ clp = auth_unix_lookup(&init_net, &in6);
if (!clp)
err = -EPERM;
else {
@@ -464,7 +464,7 @@ static ssize_t write_getfd(struct file *file, char *buf, size_t size)

ipv6_addr_set_v4mapped(sin->sin_addr.s_addr, &in6);

- clp = auth_unix_lookup(&in6);
+ clp = auth_unix_lookup(&init_net, &in6);
if (!clp)
err = -EPERM;
else {
diff --git a/include/linux/sunrpc/svcauth.h b/include/linux/sunrpc/svcauth.h
index f656667..fb1bb2c 100644
--- a/include/linux/sunrpc/svcauth.h
+++ b/include/linux/sunrpc/svcauth.h
@@ -122,10 +122,10 @@ extern void svc_auth_unregister(rpc_authflavor_t flavor);

extern struct auth_domain *unix_domain_find(char *name);
extern void auth_domain_put(struct auth_domain *item);
-extern int auth_unix_add_addr(struct in6_addr *addr, struct auth_domain *dom);
+extern int auth_unix_add_addr(struct net *net, struct in6_addr *addr, struct auth_domain *dom);
extern struct auth_domain *auth_domain_lookup(char *name, struct auth_domain *new);
extern struct auth_domain *auth_domain_find(char *name);
-extern struct auth_domain *auth_unix_lookup(struct in6_addr *addr);
+extern struct auth_domain *auth_unix_lookup(struct net *net, struct in6_addr *addr);
extern int auth_unix_forget_old(struct auth_domain *dom);
extern void svcauth_unix_purge(void);
extern void svcauth_unix_info_release(struct svc_xprt *xpt);
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index f0017ca..4500fd8 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -327,7 +327,8 @@ static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
return NULL;
}

-static inline struct ip_map *ip_map_lookup(char *class, struct in6_addr *addr)
+static inline struct ip_map *ip_map_lookup(struct net *net, char *class,
+ struct in6_addr *addr)
{
return __ip_map_lookup(&ip_map_cache, class, addr);
}
@@ -360,12 +361,13 @@ static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
return 0;
}

-static inline int ip_map_update(struct ip_map *ipm, struct unix_domain *udom, time_t expiry)
+static inline int ip_map_update(struct net *net, struct ip_map *ipm,
+ struct unix_domain *udom, time_t expiry)
{
return __ip_map_update(&ip_map_cache, ipm, udom, expiry);
}

-int auth_unix_add_addr(struct in6_addr *addr, struct auth_domain *dom)
+int auth_unix_add_addr(struct net *net, struct in6_addr *addr, struct auth_domain *dom)
{
struct unix_domain *udom;
struct ip_map *ipmp;
@@ -373,10 +375,10 @@ int auth_unix_add_addr(struct in6_addr *addr, struct auth_domain *dom)
if (dom->flavour != &svcauth_unix)
return -EINVAL;
udom = container_of(dom, struct unix_domain, h);
- ipmp = ip_map_lookup("nfsd", addr);
+ ipmp = ip_map_lookup(net, "nfsd", addr);

if (ipmp)
- return ip_map_update(ipmp, udom, NEVER);
+ return ip_map_update(net, ipmp, udom, NEVER);
else
return -ENOMEM;
}
@@ -394,12 +396,12 @@ int auth_unix_forget_old(struct auth_domain *dom)
}
EXPORT_SYMBOL_GPL(auth_unix_forget_old);

-struct auth_domain *auth_unix_lookup(struct in6_addr *addr)
+struct auth_domain *auth_unix_lookup(struct net *net, struct in6_addr *addr)
{
struct ip_map *ipm;
struct auth_domain *rv;

- ipm = ip_map_lookup("nfsd", addr);
+ ipm = ip_map_lookup(net, "nfsd", addr);

if (!ipm)
return NULL;
@@ -723,7 +725,7 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)

ipm = ip_map_cached_get(xprt);
if (ipm == NULL)
- ipm = ip_map_lookup(rqstp->rq_server->sv_program->pg_class,
+ ipm = ip_map_lookup(&init_net, rqstp->rq_server->sv_program->pg_class,
&sin6->sin6_addr);

if (ipm == NULL)
--
1.5.5.6


2010-09-15 15:31:33

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On 09/15/2010 02:23 PM, Pavel Emelyanov wrote:
> Hello everyone!
>
> I would like to prepare the sunrpc layer for working in a containerized
> environments. The ultimate goal is to make both nfs client and server
> work in containers. Hopefully you won't object :)
>
> Not to look like an idle talker I've prepared this set which makes the
> /proc/net/rpc appear in net namespaces and made the ip_map_cache be per-net.
>
> I do not have any plans about when this patches appear at Linus tree and
> thus do not know which git tree to hack on. That said I prepared the patches
> against the net-next tree (I have some custom netns debugging code in it
> and don't want to port it around in vain).
>
> Looking forward to your feedback.
>

We are very curios people and would like to know more.

What does it mean to work in containers? Why is it better for
the client why is it better for the server? who and how to enjoy
this benefits. Does it have any effects on the operation of nfs/nfsd?
What problem does it solve?

Thanks
Boaz

> Thanks,
> Pavel

2010-09-20 21:36:09

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, 2010-09-20 at 16:09 -0400, J. Bruce Fields wrote:
> On Mon, Sep 20, 2010 at 04:05:16PM -0400, Trond Myklebust wrote:
> > On Mon, 2010-09-20 at 23:13 +0400, Pavel Emelyanov wrote:
> > > On 09/20/2010 10:04 PM, J. Bruce Fields wrote:
> > > > On Mon, Sep 20, 2010 at 08:33:42PM +0400, Pavel Emelyanov wrote:
> > > >>>> Looking forward to your feedback.
> > > >>>
> > > >>> What are you thinking of as a use-case for this?
> > > >>
> > > >> To make it possible run both NFS server and client in containers.
> > > >
> > > > Could you describe that in user-visible terms? (Currently if I create a
> > > > new network namespace, what happens, and what will happen differently
> > > > afterwards?)
> > >
> > > This is not about the network namespace only I believe. E.g. the
> > > nfsd filesystem is a filesystem already and shouldn't be tied to
> > > any task-driven context.
> > >
> > > E.g. as far as the net namespace part is concerned. First of all
> > > the TCP/UDP socket used by transport will be per-namespace. User
> > > will "feel" this for example by different routing and netfilter
> > > rules applied to connections. Besides the rpc service sockets will
> > > be per namespace as well.
> > >
> > > >> Sure! The thing is that the full containerization of that stuff is
> > > >> too many patches and I'm not sure that you and other maintainers wish
> > > >> to review the 100-patch set in one go ;)
> > > >
> > > > Well, if it's really all ready....
> > > >
> > > > Better, though, would be an outline of the work to be done and what you
> > > > expect to be working at the end.
> > >
> > > The nearest plan is
> > >
> > > 1. Prepare the sunrpc layer to work in net namespaces
> > > 2. Make rpcpipefs and nfsd filesystems be mountable multiple times
> > > 3. Make support for multiple instances of the nfsd caches
> > > 4. Make suuport for multiple instances of the nfsd_serv
> > >
> > > After this several NFSd-s can be used in containers (hopefully I
> > > didn't miss anything).
> > >
> > > Plans about the nfs client are much more obscure for now.
> >
> > The client should be something like the following:
> >
> > 1) Ensure sunrpc sockets are created using the correct net namespace
>
> For the client, that's initially the net namespace of the mount? (What
> about submounts?)

It is the net namespace of the process that does the mount, yes.

> > 2) Convert rpc_pipefs to be per-net namespace.
> > 3) Convert the nfs_client and superblock to be per-net namespace
> > 4) Convert lockd's struct host to be per-net namespace
>
> What do we expect behavior to actually look like from the point of view
> of somebody on the client?
>
> I'd like to see someone write some kind of spec for how this should all
> work. That worries me a lot more than the code.....

I think it is fairly obvious what should happen once you are in a net
namespace jail: you want all future NFS mounts to confine themselves to
that private net namespace. i.e. they must talk to the portmapper,
rpc.statd, and rpc.gssd that are defined on that net namespace, and they
must confine themselves to that net namespace when talking to servers.

The problem is dealing with clone() and unshare() (i.e. the process of
changing net namespaces).
If the resulting container inherits an NFS mountpoint from its parent
process, then I cannot see how we could sanely migrate that to a new net
namespace, since the super block etc remains shared between the two
containers as part of the mount namespaces. To avoid confusion, I
believe we need to ensure that under-the-cover mounts etc inherit the
same net namespace as the original mount, and they should talk to the
portmapper, rpc.statd and rpc.gssd that the original mount uses. If
those die, then too bad - that's operator error.

Cheers
Trond

2010-09-15 13:04:51

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 6/9] sunrpc: Tag svc_xprt with net

The transport representation should be per-net of course.

Signed-off-by: Pavel Emelyanov <[email protected]>

---
include/linux/sunrpc/svc_xprt.h | 2 ++
net/sunrpc/svc_xprt.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 5f4e18b..e50e3ec 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -66,6 +66,8 @@ struct svc_xprt {
struct sockaddr_storage xpt_remote; /* remote peer's address */
size_t xpt_remotelen; /* length of address */
struct rpc_wait_queue xpt_bc_pending; /* backchannel wait queue */
+
+ struct net *xpt_net;
};

int svc_reg_xprt_class(struct svc_xprt_class *);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 57703ac..c59722d 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -130,6 +130,7 @@ static void svc_xprt_free(struct kref *kref)
struct module *owner = xprt->xpt_class->xcl_owner;
if (test_bit(XPT_CACHE_AUTH, &xprt->xpt_flags))
svcauth_unix_info_release(xprt);
+ put_net(xprt->xpt_net);
xprt->xpt_ops->xpo_free(xprt);
module_put(owner);
}
@@ -159,6 +160,7 @@ void svc_xprt_init(struct svc_xprt_class *xcl, struct svc_xprt *xprt,
spin_lock_init(&xprt->xpt_lock);
set_bit(XPT_BUSY, &xprt->xpt_flags);
rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
+ xprt->xpt_net = get_net(&init_net);
}
EXPORT_SYMBOL_GPL(svc_xprt_init);

--
1.5.5.6


2010-09-20 19:13:23

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On 09/20/2010 10:04 PM, J. Bruce Fields wrote:
> On Mon, Sep 20, 2010 at 08:33:42PM +0400, Pavel Emelyanov wrote:
>>>> Looking forward to your feedback.
>>>
>>> What are you thinking of as a use-case for this?
>>
>> To make it possible run both NFS server and client in containers.
>
> Could you describe that in user-visible terms? (Currently if I create a
> new network namespace, what happens, and what will happen differently
> afterwards?)

This is not about the network namespace only I believe. E.g. the
nfsd filesystem is a filesystem already and shouldn't be tied to
any task-driven context.

E.g. as far as the net namespace part is concerned. First of all
the TCP/UDP socket used by transport will be per-namespace. User
will "feel" this for example by different routing and netfilter
rules applied to connections. Besides the rpc service sockets will
be per namespace as well.

>> Sure! The thing is that the full containerization of that stuff is
>> too many patches and I'm not sure that you and other maintainers wish
>> to review the 100-patch set in one go ;)
>
> Well, if it's really all ready....
>
> Better, though, would be an outline of the work to be done and what you
> expect to be working at the end.

The nearest plan is

1. Prepare the sunrpc layer to work in net namespaces
2. Make rpcpipefs and nfsd filesystems be mountable multiple times
3. Make support for multiple instances of the nfsd caches
4. Make suuport for multiple instances of the nfsd_serv

After this several NFSd-s can be used in containers (hopefully I
didn't miss anything).

Plans about the nfs client are much more obscure for now.

>> I want to find out what git tree to hack on and prepare small patch
>> sets making things step-by-step. This one is just the first in a row.
>
> For the server side you can use
>
> git://linux-nfs.org/~bfields/linux.git nfsd-next
>
> though generally the latest upstream will likely work as well.

OK, I will re-base the set onto the nfsd-next then.

> On a quick skim, those patches look fine (and brokenly up nicely for
> review, thanks). My main concern is just being sure I understand where
> this all ends up.

Well, as far as I know the nfsd and sunrpc code the plan described
above should give us most of the work needed to containerize nfsd.

> --b.

2010-09-21 07:11:37

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

> The client should be something like the following:
>
> 1) Ensure sunrpc sockets are created using the correct net namespace

Ack

> 2) Convert rpc_pipefs to be per-net namespace.

Trond, I think this part should be done the other way.

You see, the rpc_pipefs is a filesystem already and we shouldn't
make it bound to any task-driven context. What I was thinking about
in that direction is make it mountable multiple times.

The central issue of this is - the way we say the rpc_get_mount()
which vfsmount we need. Userspace will just use the per-container (i.e.
per-chroot) instance of it and the kernel users will work with the
vfsmount obtained by the rpc_get_mount() call.

Now, how do I plan to solve the rpc_get_mount problem. Some time ago
there was similar problem with the devpts filesystem - people making
ptys work per-container tried to solve the same problem and they
ended up (with Al's help) with a yet another devpts mount option which
explicitly stated that a new instance should be created. How do you
think if we do the same for rpc_pipefs (a newinstance mount option) and
add yet another mount option for its only client (nfs) telling it where
to look for the rpc mount for (e.g. rpcmount=/var/...) ?

> 3) Convert the nfs_client and superblock to be per-net namespace

Ack about the nfs_client, but as far as the superblock is
concerned - I think we should tag only the nfs_server with
net for the same reasons as in the item 2) above.

> 4) Convert lockd's struct host to be per-net namespace

Ack

> Cheers
> Trond
>
>


2010-09-15 13:04:54

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH 3/9] sunrpc: Pass xprt to cached get/put routines

They do not require the rqst actually and having the xprt simplifies
further patching.

Signed-off-by: Pavel Emelyanov <[email protected]>

---
net/sunrpc/svcauth_unix.c | 12 +++++-------
1 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index aef0feb..f0017ca 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -426,10 +426,9 @@ void svcauth_unix_purge(void)
EXPORT_SYMBOL_GPL(svcauth_unix_purge);

static inline struct ip_map *
-ip_map_cached_get(struct svc_rqst *rqstp)
+ip_map_cached_get(struct svc_xprt *xprt)
{
struct ip_map *ipm = NULL;
- struct svc_xprt *xprt = rqstp->rq_xprt;

if (test_bit(XPT_CACHE_AUTH, &xprt->xpt_flags)) {
spin_lock(&xprt->xpt_lock);
@@ -454,10 +453,8 @@ ip_map_cached_get(struct svc_rqst *rqstp)
}

static inline void
-ip_map_cached_put(struct svc_rqst *rqstp, struct ip_map *ipm)
+ip_map_cached_put(struct svc_xprt *xprt, struct ip_map *ipm)
{
- struct svc_xprt *xprt = rqstp->rq_xprt;
-
if (test_bit(XPT_CACHE_AUTH, &xprt->xpt_flags)) {
spin_lock(&xprt->xpt_lock);
if (xprt->xpt_auth_cache == NULL) {
@@ -705,6 +702,7 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
struct ip_map *ipm;
struct group_info *gi;
struct svc_cred *cred = &rqstp->rq_cred;
+ struct svc_xprt *xprt = rqstp->rq_xprt;

switch (rqstp->rq_addr.ss_family) {
case AF_INET:
@@ -723,7 +721,7 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
if (rqstp->rq_proc == 0)
return SVC_OK;

- ipm = ip_map_cached_get(rqstp);
+ ipm = ip_map_cached_get(xprt);
if (ipm == NULL)
ipm = ip_map_lookup(rqstp->rq_server->sv_program->pg_class,
&sin6->sin6_addr);
@@ -742,7 +740,7 @@ svcauth_unix_set_client(struct svc_rqst *rqstp)
case 0:
rqstp->rq_client = &ipm->m_client->h;
kref_get(&rqstp->rq_client->ref);
- ip_map_cached_put(rqstp, ipm);
+ ip_map_cached_put(xprt, ipm);
break;
}

--
1.5.5.6


2010-09-20 21:37:43

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, 2010-09-20 at 16:35 -0400, Chuck Lever wrote:
> On Sep 20, 2010, at 3:56 PM, J. Bruce Fields wrote:
>
> > On Mon, Sep 20, 2010 at 03:28:00PM -0400, Chuck Lever wrote:
> >>
> >> On Sep 20, 2010, at 3:13 PM, Pavel Emelyanov wrote:
> >>> The nearest plan is
> >>>
> >>> 1. Prepare the sunrpc layer to work in net namespaces 2. Make
> >>> rpcpipefs and nfsd filesystems be mountable multiple times 3. Make
> >>> support for multiple instances of the nfsd caches 4. Make suuport
> >>> for multiple instances of the nfsd_serv
> >>>
> >>> After this several NFSd-s can be used in containers (hopefully I
> >>> didn't miss anything).
> >>
> >> Are you assuming NFSv4 only? Something needs to be done about NLM and
> >> NSM to make this work right.
> >>
> >> Is there an issue for idmapper and svcgssd? Probably not, but worth
> >> exploring.
> >>
> >> And, how about AUTH_SYS certs? These contain the host's name in them,
> >> and that depends on the net namespace. NLM uses AUTH_SYS, and I
> >> believe the NFS server can make NLM calls to the client.
> >
> > The client probably can't use the auth_sys cred on nlm callbacks in any
> > sensible way, so this may not be a big deal.
>
> I doubt anything looks at that hostname, really. My worry is that it could leak information (like the wrong hostname) onto the network.
>

Which is one reason why using the utsname()->nodename at the time of
mount is the correct thing to do.

Trond

2010-09-20 19:58:08

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, Sep 20, 2010 at 03:28:00PM -0400, Chuck Lever wrote:
>
> On Sep 20, 2010, at 3:13 PM, Pavel Emelyanov wrote:
> > The nearest plan is
> >
> > 1. Prepare the sunrpc layer to work in net namespaces 2. Make
> > rpcpipefs and nfsd filesystems be mountable multiple times 3. Make
> > support for multiple instances of the nfsd caches 4. Make suuport
> > for multiple instances of the nfsd_serv
> >
> > After this several NFSd-s can be used in containers (hopefully I
> > didn't miss anything).
>
> Are you assuming NFSv4 only? Something needs to be done about NLM and
> NSM to make this work right.
>
> Is there an issue for idmapper and svcgssd? Probably not, but worth
> exploring.
>
> And, how about AUTH_SYS certs? These contain the host's name in them,
> and that depends on the net namespace. NLM uses AUTH_SYS, and I
> believe the NFS server can make NLM calls to the client.

The client probably can't use the auth_sys cred on nlm callbacks in any
sensible way, so this may not be a big deal.

But, yes, there are probably a lot more details like this; we'll need a
list.

--b.

2010-09-21 12:18:17

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Tue, 2010-09-21 at 11:11 +0400, Pavel Emelyanov wrote:
> > The client should be something like the following:
> >
> > 1) Ensure sunrpc sockets are created using the correct net namespace
>
> Ack
>
> > 2) Convert rpc_pipefs to be per-net namespace.
>
> Trond, I think this part should be done the other way.
>
> You see, the rpc_pipefs is a filesystem already and we shouldn't
> make it bound to any task-driven context. What I was thinking about
> in that direction is make it mountable multiple times.
>
> The central issue of this is - the way we say the rpc_get_mount()
> which vfsmount we need. Userspace will just use the per-container (i.e.
> per-chroot) instance of it and the kernel users will work with the
> vfsmount obtained by the rpc_get_mount() call.
>
> Now, how do I plan to solve the rpc_get_mount problem. Some time ago
> there was similar problem with the devpts filesystem - people making
> ptys work per-container tried to solve the same problem and they
> ended up (with Al's help) with a yet another devpts mount option which
> explicitly stated that a new instance should be created. How do you
> think if we do the same for rpc_pipefs (a newinstance mount option) and
> add yet another mount option for its only client (nfs) telling it where
> to look for the rpc mount for (e.g. rpcmount=/var/...) ?

As long as we have some mechanism to ensure that rpc.gssd from one net
namespace doesn't try to establish a kerberos security context on behalf
of a NFS mount that resides in a different net namespace.
My point is if the rpc.gssd resides in a different net namespace, then
we have no guarantee that the IP address we pass in the upcall even
points to the same server, so we must ensure that the namespaces match.

> > 3) Convert the nfs_client and superblock to be per-net namespace
>
> Ack about the nfs_client, but as far as the superblock is
> concerned - I think we should tag only the nfs_server with
> net for the same reasons as in the item 2) above.

You should tag the nfs_server, and then make sure that nfs_compare_super
does not match something that is tagged with a different net namespace
than the current one. Otherwise, you can end up mounting the wrong NFS
server (for the same reason as above).

> > 4) Convert lockd's struct host to be per-net namespace

Cheers
Trond

2010-09-20 19:29:00

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers


On Sep 20, 2010, at 3:13 PM, Pavel Emelyanov wrote:

> On 09/20/2010 10:04 PM, J. Bruce Fields wrote:
>> On Mon, Sep 20, 2010 at 08:33:42PM +0400, Pavel Emelyanov wrote:
>>>>> Looking forward to your feedback.
>>>>
>>>> What are you thinking of as a use-case for this?
>>>
>>> To make it possible run both NFS server and client in containers.
>>
>> Could you describe that in user-visible terms? (Currently if I create a
>> new network namespace, what happens, and what will happen differently
>> afterwards?)
>
> This is not about the network namespace only I believe. E.g. the
> nfsd filesystem is a filesystem already and shouldn't be tied to
> any task-driven context.
>
> E.g. as far as the net namespace part is concerned. First of all
> the TCP/UDP socket used by transport will be per-namespace. User
> will "feel" this for example by different routing and netfilter
> rules applied to connections. Besides the rpc service sockets will
> be per namespace as well.
>
>>> Sure! The thing is that the full containerization of that stuff is
>>> too many patches and I'm not sure that you and other maintainers wish
>>> to review the 100-patch set in one go ;)
>>
>> Well, if it's really all ready....
>>
>> Better, though, would be an outline of the work to be done and what you
>> expect to be working at the end.
>
> The nearest plan is
>
> 1. Prepare the sunrpc layer to work in net namespaces
> 2. Make rpcpipefs and nfsd filesystems be mountable multiple times
> 3. Make support for multiple instances of the nfsd caches
> 4. Make suuport for multiple instances of the nfsd_serv
>
> After this several NFSd-s can be used in containers (hopefully I
> didn't miss anything).

Are you assuming NFSv4 only? Something needs to be done about NLM and NSM to make this work right.

Is there an issue for idmapper and svcgssd? Probably not, but worth exploring.

And, how about AUTH_SYS certs? These contain the host's name in them, and that depends on the net namespace. NLM uses AUTH_SYS, and I believe the NFS server can make NLM calls to the client.

> Plans about the nfs client are much more obscure for now.
>
>>> I want to find out what git tree to hack on and prepare small patch
>>> sets making things step-by-step. This one is just the first in a row.
>>
>> For the server side you can use
>>
>> git://linux-nfs.org/~bfields/linux.git nfsd-next
>>
>> though generally the latest upstream will likely work as well.
>
> OK, I will re-base the set onto the nfsd-next then.
>
>> On a quick skim, those patches look fine (and brokenly up nicely for
>> review, thanks). My main concern is just being sure I understand where
>> this all ends up.
>
> Well, as far as I know the nfsd and sunrpc code the plan described
> above should give us most of the work needed to containerize nfsd.
>
>> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
chuck[dot]lever[at]oracle[dot]com





2010-09-20 18:54:54

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 7/9] sunrpc: The per-net skeleton

On 09/20/2010 09:19 PM, J. Bruce Fields wrote:
> On Wed, Sep 15, 2010 at 04:28:25PM +0400, Pavel Emelyanov wrote:
>> @@ -38,18 +58,26 @@ init_sunrpc(void)
>> err = rpcauth_init_module();
>> if (err)
>> goto out3;
>> +
>> + cache_initialize();
>> +
>> + err = register_pernet_subsys(&sunrpc_net_ops);
>> + if (err)
>> + goto out4;
>> #ifdef RPC_DEBUG
>> rpc_register_sysctl();
>> #endif
>> #ifdef CONFIG_PROC_FS
>> rpc_proc_init();
>> #endif
>> - cache_initialize();
>> cache_register(&ip_map_cache);
>> cache_register(&unix_gid_cache);
>> svc_init_xprt_sock(); /* svc sock transport */
>> init_socket_xprt(); /* clnt sock transport */
>> return 0;
>> +
>> +out4:
>> + unregister_pernet_subsys(&sunrpc_net_ops);
>
> If register_pernet_subsys() failed, then shouldn't this be unnecessary?
> Maybe this should be rpcauth_remove_module()?

Ouch! Of course you're right here... Will fix.

> --b.

2010-09-20 21:45:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, Sep 20, 2010 at 05:36:06PM -0400, Trond Myklebust wrote:
> On Mon, 2010-09-20 at 16:09 -0400, J. Bruce Fields wrote:
> > For the client, that's initially the net namespace of the mount? (What
> > about submounts?)
>
> It is the net namespace of the process that does the mount, yes.
>
> > > 2) Convert rpc_pipefs to be per-net namespace.
> > > 3) Convert the nfs_client and superblock to be per-net namespace
> > > 4) Convert lockd's struct host to be per-net namespace
> >
> > What do we expect behavior to actually look like from the point of view
> > of somebody on the client?
> >
> > I'd like to see someone write some kind of spec for how this should all
> > work. That worries me a lot more than the code.....
>
> I think it is fairly obvious what should happen once you are in a net
> namespace jail: you want all future NFS mounts to confine themselves to
> that private net namespace. i.e. they must talk to the portmapper,
> rpc.statd, and rpc.gssd that are defined on that net namespace, and they
> must confine themselves to that net namespace when talking to servers.
>
> The problem is dealing with clone() and unshare() (i.e. the process of
> changing net namespaces).
> If the resulting container inherits an NFS mountpoint from its parent
> process, then I cannot see how we could sanely migrate that to a new net
> namespace, since the super block etc remains shared between the two
> containers as part of the mount namespaces. To avoid confusion, I
> believe we need to ensure that under-the-cover mounts etc inherit the
> same net namespace as the original mount, and they should talk to the
> portmapper, rpc.statd and rpc.gssd that the original mount uses.

OK, that sounds right to me.

--b.

> If
> those die, then too bad - that's operator error.

2010-10-08 17:06:23

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/9] sunrpc: Start making sunrpc work in containers

On Mon, 2010-09-20 at 16:05 -0400, Trond Myklebust wrote:
> On Mon, 2010-09-20 at 23:13 +0400, Pavel Emelyanov wrote:
> > On 09/20/2010 10:04 PM, J. Bruce Fields wrote:
> > > On Mon, Sep 20, 2010 at 08:33:42PM +0400, Pavel Emelyanov wrote:
> > >>>> Looking forward to your feedback.
> > >>>
> > >>> What are you thinking of as a use-case for this?
> > >>
> > >> To make it possible run both NFS server and client in containers.
> > >
> > > Could you describe that in user-visible terms? (Currently if I create a
> > > new network namespace, what happens, and what will happen differently
> > > afterwards?)
> >
> > This is not about the network namespace only I believe. E.g. the
> > nfsd filesystem is a filesystem already and shouldn't be tied to
> > any task-driven context.
> >
> > E.g. as far as the net namespace part is concerned. First of all
> > the TCP/UDP socket used by transport will be per-namespace. User
> > will "feel" this for example by different routing and netfilter
> > rules applied to connections. Besides the rpc service sockets will
> > be per namespace as well.
> >
> > >> Sure! The thing is that the full containerization of that stuff is
> > >> too many patches and I'm not sure that you and other maintainers wish
> > >> to review the 100-patch set in one go ;)
> > >
> > > Well, if it's really all ready....
> > >
> > > Better, though, would be an outline of the work to be done and what you
> > > expect to be working at the end.
> >
> > The nearest plan is
> >
> > 1. Prepare the sunrpc layer to work in net namespaces
> > 2. Make rpcpipefs and nfsd filesystems be mountable multiple times
> > 3. Make support for multiple instances of the nfsd caches
> > 4. Make suuport for multiple instances of the nfsd_serv
> >
> > After this several NFSd-s can be used in containers (hopefully I
> > didn't miss anything).
> >
> > Plans about the nfs client are much more obscure for now.
>
> The client should be something like the following:
>
> 1) Ensure sunrpc sockets are created using the correct net namespace
> 2) Convert rpc_pipefs to be per-net namespace.
> 3) Convert the nfs_client and superblock to be per-net namespace
> 4) Convert lockd's struct host to be per-net namespace

Actually, there is one more task that needs to be added to the above
list. We need to figure out what to do with the keyring upcalls.

The keyring upcalls are currently initiated through the same mechanism
as module_request and therefore get started with the init_nsproxy
namespace. We'd really like them to run inside the same container as the
process.
As part of the same problem, there is the issue of what to do with the
dns resolver and Bryan's new keyring based idmapper code.

Cheers
Trond