Date: Tue, 12 May 2009 18:05:45 -0700
From: Matt Helsley <matthltc@us.ibm.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Matt Helsley <matthltc@us.ibm.com>, Containers <containers@lists.osdl.org>,
        linux-nfs@vger.kernel.org
Subject: Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
Message-ID: <20090513010545.GG3912@us.ibm.com>
References: <20090512215138.GD3912@us.ibm.com> <m1fxf97tvt.fsf@fess.ebiederm.org>
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <m1fxf97tvt.fsf@fess.ebiederm.org>
Sender: linux-nfs-owner@vger.kernel.org
MIME-Version: 1.0

On Tue, May 12, 2009 at 05:01:58PM -0700, Eric W. Biederman wrote:
> Matt Helsley <matthltc@us.ibm.com> writes:
> 
> > Sun RPC currently opens sockets from the initial network namespace making it
> > impossible to restrict which NFS servers a container may interact with.
> >
> > For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> > will always be used even if an entirely different server with the address
> > 10.0.0.3 is reachable from a container's network namespace. Hence network
> > namespaces cannot be used to restrict the network access of a container as long
> > as the RPC code opens sockets using the initial network namespace. This is
> > in stark contrast to other protocols like HTTP where the sockets are created in
> > their proper namespaces because kernel threads are not used to open sockets for
> > client network IO.
> >
> > We may plausibly end up with namespaces created by:
> > I) The administrator may mount 10.0.0.3:/export_foo from init's
> > container, clone the mount namespace, and unmount from the original
> > mount namespace.
> >
> > II) The administrator may start a task which clones the mount namespace
> > before mounting 10.0.0.3:/export_foo.
> >
> > Proposed Solution:
> >
> > The network namespace of the task that did the mount best defines which server
> > the "administrator", whether in a container or not, expects to work with.
> > When the mount is done inside a container then that is the network namespace 
> > to use. When the mount is done prior to creating the container then that's the 
> > namespace that should be used.
> >
> > This allows system administrators to isolate network traffic generated by NFS
> > clients by mounting after creating a container. If partial isolation is desired
> > then the administrator may mount before creating a container with a new network
> > namespace. In each case the RPC packets would originate from a consistent
> > namespace.
> >
> > One way to ensure consistent namespace usage would be to hold a reference to
> > the original network namespace as long as the mount exists. This naturally 
> > suggests storing the network namespace reference in the NFS superblock. 
> > However, it may be better to store it with the RPC transport itself since
> > it is directly responsible for (re)opening the sockets.
> >
> > This patch adds a reference to the network namespace to the RPC
> > transport. When the NFS export is mounted the network namespace of
> > the current task establishes which namespace to reference. That
> > reference is stored in the RPC transport and used to open sockets
> > whenever a new socket is required.
> 
> Matt.  This may be the basis of something and the problem is real.
> However it is clear you have missed a lot of details.

Well crap. While I did not ignore all the RPC services I noticed
when I tried reading the NFS/RPC code, based on the response from Chuck,
you, and Trond, I clearly fucked up when I thought I had properly understood 
how the RPC code works with the services that support NFS.

I figured that since RPC was the core of these services it would be a
good place to start trying to address the problem. It looked like the
RPC transport was a good place to deal with all of these services since
it's responsible for (re)opening the sockets needed to perform RPC IO.
But apparently the transport is not shared the way I thought it was :/..

> So could you first address this problem in nfs_get_sb by 
> denying the mount if we are not in the initial network namespace.
> 
> I.e.
> 
> if (current->nsproxy->net_ns != &init_net)
> 	return -EINVAL;
> 
> That should be a lot simpler to get right and at least give reliable
> and predictable semantics.

Yes, that seems like a reasonable preventitive measure for now.

	-Matt