2011-10-05 15:02:38

by J. Bruce Fields

[permalink] [raw]
Subject: network-namespace-aware nfsd

This is a draft outline what we'd need to support containerized nfs
service; please tell me what I've got wrong.

The goal is to give the impression of running multiple virtual nfs
services, each with its own ip address or addresses.

A new nfs service will be started by forking off a new network
namespace, setting up interfaces there, and then starting nfs service
normally (including starting all the appropriate userland daemons, such
as rpc.mountd).

This requires no changes to existing userland code. Instead, the kernel
side of each userland interface needs to be made aware of the network
namespace of the userland process it is talking to.

The kernel handles requests using a pool of threads, with the number of
threads controlled by writing to the "threads" file in the "nfsd"
filesystem. The files are also used to start the server (and to stop
it, by writing zero for the number of threads).

To conserve memory, I would prefer to have all of the virtual servers
share the same threads, rather than dedicating a separate set of threads
to each network namespace. So:

Minimum functionality
---------------------

To get something minimal working, we need the rpc work that's in
progress.

In addition, we need the nfsd/threads interface to remember the value
set for each network namespace. Writing to it will adjust the number of
threads, probably to the maximum value across all namespaces.

In addition, when the per-namespace value changes from zero to nonzero
or vice-versa, we need to trigger, respectively, starting or stopping
the per-namespace virtual server. That means setting up or shutting
down sockets, and initializing or destroying any per-namespace state (as
required depending on NFS version, see below).

Also, nfsd/pool_threads probably needs similar treatment.

The nfsd/ports interface allows setting up listening sockets by hand. I
suspect it needs at most trivial changes.

NFSv4
-----

To make NFSv4 work, we need per-network-namespace state that is
initialized and destroyed on startup and shutdown of a virtual nfs
server. Each client therefore needs to be associated with a network
namespace, so it can be shut down at the right time, and so that we
consistently handle, for example, a broken NFSv4.0 client that sends the
same long-form identifier to servers with different IP addresses.

For 4.1 we have the option of sharing state between servers if we'd
like. Initially simplest is to advertise the servers as entirely
distinct, without the ability to share any state.

The directory used for recovery data needs to be per-network-namespace.
If we replace it by something else, we'll need to make sure it's
namespace-aware.

NFSv2/v3
--------

For v2/v3 locking to work we also need per-network-namespace lockd and
statd state.

Note that there is a separate loopback interface per network namespace,
so the kernel can communicate separately with statd's in different
namespaces. (statd communicates with the kernel over the loopback
interface).

krb5
----

Different servers likely want different kerberos identities. To make
this work we need separate auth.rpcsec.context and auth.rpcsec.init
caches for each network namespace.

Independent export trees
------------------------

If we want to allow, for example, different filesystems to be exported
from different virtual servers, then we need per-namespace nfsd.export,
expkey, and auth.unix.ip caches.

Caches in general
-----------------

To containerize the /proc/net/rpc/* interfaces (as needed for the krb5
independent export trees), we need the content, channel, and flush files
to all be network-namespace-aware, so we want entirely separate caches
for each namespace.

I'm not sure whether that's best done by having lookups done in each
namespace get entirely different inodes, or whether the underlying
inodes should be shared and net/sunrpc/cache.c:cache_open() should
switch caches based on the network namespace of the opener.

Maybe some day
--------------

Not urgent, but possibly should be made namespace-aware some day:

- leasetime, gracetime: per-netns ideal but not
required? Probably more useful for gracetime.

- unlock_ip: should be per-netns, maybe, low priority

- unlock_fs: should be per-fsns, maybe, ignore for now.

- nfs4.idtoname, nfs4.nametoid, could be per-netns, or would
they need to be per-uidns?

- we could allow turning on nfs versions per-netns, but for now
that seems unnecessary.

- maxblksize: ditto. Keep it global, or take the maximum across
values given in each netns.

Should be non-issues:

- export_features, supported_enctypes: global, nothing
to do.

- filehandle: path->filehandle mapping should already be
per-fs, hopefully no changes required.

- auth.unix.gid
- keep global for now.


2011-10-05 17:27:12

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

05.10.2011 19:02, J. Bruce Fields пишет:
> This is a draft outline what we'd need to support containerized nfs
> service; please tell me what I've got wrong.
>
> The goal is to give the impression of running multiple virtual nfs
> services, each with its own ip address or addresses.
>
> A new nfs service will be started by forking off a new network
> namespace, setting up interfaces there, and then starting nfs service
> normally (including starting all the appropriate userland daemons, such
> as rpc.mountd).
>

Hello, Bruce.
What do you mean by "nfs service will be started by forking off a new network
namespace"?
Does it means, that each nfs service start will create new network namespace?
If so, what about if some process in freshly create namespace will start nfs
service?

> This requires no changes to existing userland code. Instead, the kernel
> side of each userland interface needs to be made aware of the network
> namespace of the userland process it is talking to.
>
> The kernel handles requests using a pool of threads, with the number of
> threads controlled by writing to the "threads" file in the "nfsd"
> filesystem. The files are also used to start the server (and to stop
> it, by writing zero for the number of threads).
>
> To conserve memory, I would prefer to have all of the virtual servers
> share the same threads, rather than dedicating a separate set of threads
> to each network namespace. So:
>
> Minimum functionality
> ---------------------
>
> To get something minimal working, we need the rpc work that's in
> progress.
>
> In addition, we need the nfsd/threads interface to remember the value
> set for each network namespace. Writing to it will adjust the number of
> threads, probably to the maximum value across all namespaces.
>
> In addition, when the per-namespace value changes from zero to nonzero
> or vice-versa, we need to trigger, respectively, starting or stopping
> the per-namespace virtual server. That means setting up or shutting
> down sockets, and initializing or destroying any per-namespace state (as
> required depending on NFS version, see below).
>
> Also, nfsd/pool_threads probably needs similar treatment.
>
> The nfsd/ports interface allows setting up listening sockets by hand. I
> suspect it needs at most trivial changes.
>

If I understood you right, you want to share nfsd threads between environments
and any of this threads can handle requests for different environments.
Am I right?
If not - then what the difference with separated nfs servers?
If yes, then will not we get problems with handling requests to container files
with changed root?

And what about versions file? If we will share all kernel threads, doesn't it
means, that we can't tune supported versions per network namespace?

> NFSv4
> -----
>
> To make NFSv4 work, we need per-network-namespace state that is
> initialized and destroyed on startup and shutdown of a virtual nfs
> server. Each client therefore needs to be associated with a network
> namespace, so it can be shut down at the right time, and so that we
> consistently handle, for example, a broken NFSv4.0 client that sends the
> same long-form identifier to servers with different IP addresses.
>
> For 4.1 we have the option of sharing state between servers if we'd
> like. Initially simplest is to advertise the servers as entirely
> distinct, without the ability to share any state.
>
> The directory used for recovery data needs to be per-network-namespace.
> If we replace it by something else, we'll need to make sure it's
> namespace-aware.
>
> NFSv2/v3
> --------
>
> For v2/v3 locking to work we also need per-network-namespace lockd and
> statd state.
>

What do you think about lockd kernel thread?
I mean, do you want to share one thread for all network namespaces or create one
thread per network namespace?

> Note that there is a separate loopback interface per network namespace,
> so the kernel can communicate separately with statd's in different
> namespaces. (statd communicates with the kernel over the loopback
> interface).
>
> krb5
> ----
>
> Different servers likely want different kerberos identities. To make
> this work we need separate auth.rpcsec.context and auth.rpcsec.init
> caches for each network namespace.
>
> Independent export trees
> ------------------------
>
> If we want to allow, for example, different filesystems to be exported
> from different virtual servers, then we need per-namespace nfsd.export,
> expkey, and auth.unix.ip caches.
>
> Caches in general
> -----------------
>
> To containerize the /proc/net/rpc/* interfaces (as needed for the krb5
> independent export trees), we need the content, channel, and flush files
> to all be network-namespace-aware, so we want entirely separate caches
> for each namespace.
>
> I'm not sure whether that's best done by having lookups done in each
> namespace get entirely different inodes, or whether the underlying
> inodes should be shared and net/sunrpc/cache.c:cache_open() should
> switch caches based on the network namespace of the opener.
>
> Maybe some day
> --------------
>
> Not urgent, but possibly should be made namespace-aware some day:
>
> - leasetime, gracetime: per-netns ideal but not
> required? Probably more useful for gracetime.
>
> - unlock_ip: should be per-netns, maybe, low priority
>
> - unlock_fs: should be per-fsns, maybe, ignore for now.
>
> - nfs4.idtoname, nfs4.nametoid, could be per-netns, or would
> they need to be per-uidns?
>
> - we could allow turning on nfs versions per-netns, but for now
> that seems unnecessary.
>
> - maxblksize: ditto. Keep it global, or take the maximum across
> values given in each netns.
>
> Should be non-issues:
>
> - export_features, supported_enctypes: global, nothing
> to do.
>
> - filehandle: path->filehandle mapping should already be
> per-fs, hopefully no changes required.
>
> - auth.unix.gid
> - keep global for now.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Best regards,
Stanislav Kinsbursky

2011-10-06 12:30:07

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

>> Also, do you think per-namespace version support is important?
>>
>
> Actually, yes, I do.
> As I see it, nfsd filesystem have to virtualized to provide flexible control for
> server features. If so, then we need to virtualize program as well.

ACK - per namespace version control is required as well.

AFAIK it's performed via sysctl-s and this part (sysctls engine I mean) is already
namespaces aware, thus it will not be the hard part of the implementation :)

2011-10-06 16:46:33

by J. Bruce Fields

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

On Thu, Oct 06, 2011 at 01:59:09PM +0400, Stanislav Kinsbursky wrote:
> 05.10.2011 22:19, J. Bruce Fields пишет:
> >To start with I suspect it would be OK to share the one lockd thread.
> >
>
> Yep, I think so too. It just will be harder to implement.

Why do you think it will be harder to implement?

There may be something about how tasks and namespaces interact that I'm
missing here....

To me it seems like either way we're going to have to add the network
namespace as an argument to any data structure lookups that we're doing,
and it doesn't really matter whether we get the namespace out of the
svc_rqst or someplace else.

--b.

2011-10-07 10:19:09

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

06.10.2011 20:46, J. Bruce Fields пишет:
> On Thu, Oct 06, 2011 at 01:59:09PM +0400, Stanislav Kinsbursky wrote:
>> 05.10.2011 22:19, J. Bruce Fields пишет:
>>> To start with I suspect it would be OK to share the one lockd thread.
>>>
>>
>> Yep, I think so too. It just will be harder to implement.
>
> Why do you think it will be harder to implement?
>

Because making lockd kthread per net ns is very easy. :)

> There may be something about how tasks and namespaces interact that I'm
> missing here....
>

The main problem, as I see it now, is creating and especially destroying lockd
rpcbind clients (and per ns data) on CT stop.
Right now they are destroyed on lockd kthread exit. And we can't make this
destruction in per-net operations since those clients holds net ns.
Thus, nlmclnt_init(done) logic have to be significantly reworked.

> To me it seems like either way we're going to have to add the network
> namespace as an argument to any data structure lookups that we're doing,
> and it doesn't really matter whether we get the namespace out of the
> svc_rqst or someplace else.
>

Yep, seems the same to me.

> --b.


--
Best regards,
Stanislav Kinsbursky

2011-10-06 13:18:11

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

06.10.2011 17:14, Pavel Emelyanov пишет:
> On 10/06/2011 05:11 PM, J. Bruce Fields wrote:
>> On Thu, Oct 06, 2011 at 04:29:51PM +0400, Pavel Emelyanov wrote:
>>>>> Also, do you think per-namespace version support is important?
>>>>>
>>>>
>>>> Actually, yes, I do.
>>>> As I see it, nfsd filesystem have to virtualized to provide flexible control for
>>>> server features. If so, then we need to virtualize program as well.
>>>
>>> ACK - per namespace version control is required as well.
>>>
>>> AFAIK it's performed via sysctl-s and this part (sysctls engine I mean) is already
>>> namespaces aware, thus it will not be the hard part of the implementation :)
>>
>> It's a special file in the nfsd filesystem. But I assume that won't be
>> a big deal either.
>
> Well, yes, you're right :)
>
>> By the way, I'm curious: as we do this virtualization step-by-step, is
>> there any way for userspace to tell how far we've gotten?
>>
>> So for example if you have a system that's configured to use some new
>> namespace-based feature, and you boot it to an old kernel, is there some
>> way for it to check at the start and say "sorry, this isn't going to
>> work"?
>
> M-m... I'd say - there's no automatic way for doing this. What we can (and probably
> should) do is - audit the nfs/nfsd subsystems and mark places with
> if (ns !=&init_net_ns)
> return -EOPNOTSUPP
> and remove these parts eventually.
>

Or just use init_net instead of current network namespace (is possible, of
course). At least I'm trying to do so.

>> --b.
>> .
>>
>


--
Best regards,
Stanislav Kinsbursky

2011-10-06 13:14:34

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

On 10/06/2011 05:11 PM, J. Bruce Fields wrote:
> On Thu, Oct 06, 2011 at 04:29:51PM +0400, Pavel Emelyanov wrote:
>>>> Also, do you think per-namespace version support is important?
>>>>
>>>
>>> Actually, yes, I do.
>>> As I see it, nfsd filesystem have to virtualized to provide flexible control for
>>> server features. If so, then we need to virtualize program as well.
>>
>> ACK - per namespace version control is required as well.
>>
>> AFAIK it's performed via sysctl-s and this part (sysctls engine I mean) is already
>> namespaces aware, thus it will not be the hard part of the implementation :)
>
> It's a special file in the nfsd filesystem. But I assume that won't be
> a big deal either.

Well, yes, you're right :)

> By the way, I'm curious: as we do this virtualization step-by-step, is
> there any way for userspace to tell how far we've gotten?
>
> So for example if you have a system that's configured to use some new
> namespace-based feature, and you boot it to an old kernel, is there some
> way for it to check at the start and say "sorry, this isn't going to
> work"?

M-m... I'd say - there's no automatic way for doing this. What we can (and probably
should) do is - audit the nfs/nfsd subsystems and mark places with
if (ns != &init_net_ns)
return -EOPNOTSUPP
and remove these parts eventually.

> --b.
> .
>


2011-10-06 13:11:10

by J. Bruce Fields

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

On Thu, Oct 06, 2011 at 04:29:51PM +0400, Pavel Emelyanov wrote:
> >> Also, do you think per-namespace version support is important?
> >>
> >
> > Actually, yes, I do.
> > As I see it, nfsd filesystem have to virtualized to provide flexible control for
> > server features. If so, then we need to virtualize program as well.
>
> ACK - per namespace version control is required as well.
>
> AFAIK it's performed via sysctl-s and this part (sysctls engine I mean) is already
> namespaces aware, thus it will not be the hard part of the implementation :)

It's a special file in the nfsd filesystem. But I assume that won't be
a big deal either.

By the way, I'm curious: as we do this virtualization step-by-step, is
there any way for userspace to tell how far we've gotten?

So for example if you have a system that's configured to use some new
namespace-based feature, and you boot it to an old kernel, is there some
way for it to check at the start and say "sorry, this isn't going to
work"?

--b.

2011-10-06 09:59:18

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

05.10.2011 22:19, J. Bruce Fields пишет:
>
> I don't think so. Here's roughly how nfsd looks up an inode given a
> filehandle:
>
> - look up the ip address in the auth.unix.ip cache (filled by
> rpc.mountd) and get a "struct auth_domain", which represents
> some set of clients. (E.g., "*.example.com").
> - extract the part of the filehandle that represents the export
> and look that up in the nfsd.fh cache (also filled by
> rpc.mountd); result is a path, resolved to a (vfsmount,
> dentry) in the context of rpc.mountd.
> - look up the (auth_domain, path) pair in the nfsd.export cache
> (again filled by rpc.mountd) to get export options (ro vs rw,
> security requirements, etc.).
>
> As long as we create per-network-namespace auth.unix.ip, nfsd.fh, and
> nfsd.export caches, and as long as nfsd does those lookups in the right
> cache (which should be easy, as it can always reach the namespace from
> rqstp->rq_xprt->xpt_net).... I think it all works. Do you see any
> problem?
>

I'm not so familiar with NFS server code. So, probably, you are right and no
problems here at all.

>
> Similarly net/sunrpc/svc.c:svc_process_common(), where the version check
> is normally done, knows what namespace the request is associated with
> (again by looking at xpt_net), and could look up the supported versions
> per-namespace.
>
> As long as everything on the server side is passed a struct svc_rqst, I
> don't think having distinct thread pools would simplify anything.
>
> Do you think I'm missing anything?
>

I realized, that probably no. At least I can't find any issues for now.

> Also, do you think per-namespace version support is important?
>

Actually, yes, I do.
As I see it, nfsd filesystem have to virtualized to provide flexible control for
server features. If so, then we need to virtualize program as well.

>
> To start with I suspect it would be OK to share the one lockd thread.
>

Yep, I think so too. It just will be harder to implement.
Anyway, thanks for comment.

> Some day I would very much like to allow lockd to be multithreaded. But
> I don't know that we'd want separate threads per namespace.
>
> --b.


--
Best regards,
Stanislav Kinsbursky

2011-10-05 18:20:01

by J. Bruce Fields

[permalink] [raw]
Subject: Re: network-namespace-aware nfsd

On Wed, Oct 05, 2011 at 09:26:59PM +0400, Stanislav Kinsbursky wrote:
> 05.10.2011 19:02, J. Bruce Fields пишет:
> >This is a draft outline what we'd need to support containerized nfs
> >service; please tell me what I've got wrong.
> >
> >The goal is to give the impression of running multiple virtual nfs
> >services, each with its own ip address or addresses.
> >
> >A new nfs service will be started by forking off a new network
> >namespace, setting up interfaces there, and then starting nfs service
> >normally (including starting all the appropriate userland daemons, such
> >as rpc.mountd).
> >
>
> Hello, Bruce.
> What do you mean by "nfs service will be started by forking off a
> new network namespace"?
> Does it means, that each nfs service start will create new network namespace?
> If so, what about if some process in freshly create namespace will
> start nfs service?

Sorry, what I meant to say was: "first userspace creates a new
namespace, then a process in that new namespace uses the ordinary
interfaces to start nfsd."

> If I understood you right, you want to share nfsd threads between
> environments and any of this threads can handle requests for
> different environments.
> Am I right?

Yes.

> If not - then what the difference with separated nfs servers?
> If yes, then will not we get problems with handling requests to
> container files with changed root?

I don't think so. Here's roughly how nfsd looks up an inode given a
filehandle:

- look up the ip address in the auth.unix.ip cache (filled by
rpc.mountd) and get a "struct auth_domain", which represents
some set of clients. (E.g., "*.example.com").
- extract the part of the filehandle that represents the export
and look that up in the nfsd.fh cache (also filled by
rpc.mountd); result is a path, resolved to a (vfsmount,
dentry) in the context of rpc.mountd.
- look up the (auth_domain, path) pair in the nfsd.export cache
(again filled by rpc.mountd) to get export options (ro vs rw,
security requirements, etc.).

As long as we create per-network-namespace auth.unix.ip, nfsd.fh, and
nfsd.export caches, and as long as nfsd does those lookups in the right
cache (which should be easy, as it can always reach the namespace from
rqstp->rq_xprt->xpt_net).... I think it all works. Do you see any
problem?

> And what about versions file? If we will share all kernel threads,
> doesn't it means, that we can't tune supported versions per network
> namespace?

Similarly net/sunrpc/svc.c:svc_process_common(), where the version check
is normally done, knows what namespace the request is associated with
(again by looking at xpt_net), and could look up the supported versions
per-namespace.

As long as everything on the server side is passed a struct svc_rqst, I
don't think having distinct thread pools would simplify anything.

Do you think I'm missing anything?

Also, do you think per-namespace version support is important?

> >NFSv4
> >-----
> >
> >To make NFSv4 work, we need per-network-namespace state that is
> >initialized and destroyed on startup and shutdown of a virtual nfs
> >server. Each client therefore needs to be associated with a network
> >namespace, so it can be shut down at the right time, and so that we
> >consistently handle, for example, a broken NFSv4.0 client that sends the
> >same long-form identifier to servers with different IP addresses.
> >
> >For 4.1 we have the option of sharing state between servers if we'd
> >like. Initially simplest is to advertise the servers as entirely
> >distinct, without the ability to share any state.
> >
> >The directory used for recovery data needs to be per-network-namespace.
> >If we replace it by something else, we'll need to make sure it's
> >namespace-aware.
> >
> >NFSv2/v3
> >--------
> >
> >For v2/v3 locking to work we also need per-network-namespace lockd and
> >statd state.
> >
>
> What do you think about lockd kernel thread?
> I mean, do you want to share one thread for all network namespaces
> or create one thread per network namespace?

To start with I suspect it would be OK to share the one lockd thread.

Some day I would very much like to allow lockd to be multithreaded. But
I don't know that we'd want separate threads per namespace.

--b.