Message-ID: <1424764908.2616.3.camel@pluto.fritz.box>
Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting
 task's init namespace
From: Ian Kent <ikent@redhat.com>
To: Benjamin Coddington <bcodding@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
        "Eric W. Biederman" <ebiederm@xmission.com>,
        David Howells <dhowells@redhat.com>,
        Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Oleg Nesterov <onestero@redhat.com>,
        Trond Myklebust <trond.myklebust@primarydata.com>,
        Al Viro <viro@ZenIV.linux.org.uk>,
        Jeff Layton <jeff.layton@primarydata.com>
Date: Tue, 24 Feb 2015 16:01:48 +0800
In-Reply-To: <alpine.OSX.2.19.9992.1502231654580.80359@planck.local>
References: <20150218170620.GI4148@fieldses.org>
	  <20150218173132.GJ4148@fieldses.org> <20150218205908.GB12573@fieldses.org>
	  <1424306341.2649.12.camel@pluto.fritz.box>
	  <20150219013116.GA13131@fieldses.org>
	  <1424424805.2632.24.camel@pluto.fritz.box>
	  <20150220172558.GD18103@fieldses.org>
	  <877fvc319o.fsf@x220.int.ebiederm.org>
	  <20150220190547.GE18103@fieldses.org>
	  <1424491138.2641.83.camel@pluto.fritz.box>
	  <20150223145237.GB21246@fieldses.org>
	 <1424739027.2616.20.camel@pluto.fritz.box>
	 <alpine.OSX.2.19.9992.1502231654580.80359@planck.local>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 12070
Lines: 244

On Mon, 2015-02-23 at 17:22 -0800, Benjamin Coddington wrote:
> On Tue, 24 Feb 2015, Ian Kent wrote:
> 
> > On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote:
> > > On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> > > > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > > > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > > > > "J. Bruce Fields" <bfields@fieldses.org> writes:
> > > > > >
> > > > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > > > > >
> > > > > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > > > > >> me out a bit with that too.
> > > > > > >
> > > > > > > Each network namespace can have its own virtual nfs server.  Servers can
> > > > > > > be started and stopped independently per network namespace.  We decide
> > > > > > > which server should handle an incoming rpc by looking at the network
> > > > > > > namespace associated with the socket that it arrived over.
> > > > > > >
> > > > > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > > > > file somewhere.
> > > > > >
> > > > > > nit.  Unless I am completely turned around that file is on the nfsd
> > > > > > filesystem, that lives in fs/nfsd/nfs.c.
> > > > > >
> > > > > > So I bevelive this really is a case of figuring out what we want the
> > > > > > semantics to be for mount and propogating the information down from
> > > > > > mount to where we call the user mode helpers.
> > > > >
> > > > > Oops, I agree.  So when I said:
> > > > >
> > > > > 	The upcalls need to happen consistently in one context for a
> > > > > 	given virtual nfs server, and that context should probably be
> > > > > 	derived from rpc.nfsd's somehow.
> > > > >
> > > > > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > > > > the nfsd filesystem".
> > > > >
> > > > > Which is already how we choose a net namespace: nfsd_mount and
> > > > > nfsd_fill_super store the current net namespace in s_fs_info.  (And then
> > > > > grep for "netns" to see the places where that's used.)
> > > >
> > > > This is going to be mostly a restatement of what's already been said,
> > > > partly for me to refer back to later and partly to clarify and confirm
> > > > what I need to do, so prepare to be bored.
> > > >
> > > > As a result of Oleg's recommendations and comments, the next version of
> > > > the series will take a reference to an nsproxy and a user namespace
> > > > (from the init process of the calling task, while it's still a child of
> > > > that task), it won't carry around task structs. There are still a couple
> > > > of questions with this so it's not quite there yet.
> > > >
> > > > We'll have to wait and see if what I've done is enough to remedy Oleg's
> > > > concerns too. LOL, and then there's the question of how much I'll need
> > > > to do to get it to actually work.
> > > >
> > > > The other difference is obtaining the context (now nsproxy and user
> > > > namspace) has been taken entirely within the usermode helper. I think
> > > > that's a good thing from the calling process isolation requirement. That
> > > > may need to change again based on the discussion here.
> > > >
> > > > Now we're starting to look at actual usage it's worth keeping in mind
> > > > that how to execute within required namespaces has to be sound before we
> > > > tackle use cases that have requirements over this fundamental
> > > > functionality.
> > > >
> > > > There are a couple of things to think about.
> > > >
> > > > One thing that's needed is how to work out if the UMH_USE_NS is needed
> > > > and another is how to provide provide persistent usage of particular
> > > > namespaces across containers. The later probably will relate to the
> > > > origin of the file system (which looks like it will be identified at
> > > > mount time).
> > > >
> > > > The first case is when the mount originates in the root init namespace
> > > > and most of the time (if not all the time) the UMH_USE_NS doesn't need
> > > > to be set and the helper should run in the root init namspace.
> > >
> > > The helper always runs in the original mount's container.  Sometimes
> > > that container is the init container, yes, but I don't see what value
> > > there is in setting a flag in that one case.
> >
> > Yep. that's pretty much what I meant.
> >
> > >
> > > > That
> > > > should work for mount propagation as well with mounts bound into a
> > > > container.
> > > >
> > > > Is this also true for automounted mounts at mount point crossing? Or
> > > > perhaps I should ask, should automounted NFS mounts inherit the property
> > > > from their parent mount?
> > >
> > > Yes.  If we run separate helpers in each container, then the superblocks
> > > should also be separate (so that one container can't poison cached
> > > values used by another).  So the containers would all end up with
> > > entirely separate superblocks for the submounts.
> >
> > That's almost what I was thinking.
> >
> > The question relates to a mount for which the namespace proxy would have
> > been set at mount time in a container and then bound into another
> > container (in Docker by using the "--volumes-from <name>"). I believe
> > the namespace information from the original mount should always be used
> > when calling a usermode helper. This might not be a sensible question
> > now but I think it needs to be considered.
> >
> > >
> > > That seems inefficient at least, and I don't think it's what an admin
> > > would expect as the default behavior.
> >
> > LOL, but the best way to manage this is to set the namespace information
> > at mount time (as Eric mentioned long ago) and use that everywhere. It's
> > consistent and it provides a way for a process with appropriate
> > privilege to specify the namespace information.
> >
> > >
> > > > The second case is when the mount originates in another namespace,
> > > > possibly a container. TBH I haven't thought too much about mounts that
> > > > originate from namespaces created by unshare(1) or other source yet. I'm
> > > > hoping that will just work once this is done, ;)
> > >
> > > So, one container mounts and spawns a "subcontainer" which continues to
> > > use that filesystem?  Yes, I think helpers should continue to run in the
> > > container of the original mount, I don't see any tricky exception here.
> >
> > That's what I think should happen too.
> >
> > >
> > > > The last time I tried binding NFS mounts from one container into another
> > > > it didn't work,
> > >
> > > I'm not sure what you mean by "binding NFS mounts from one container
> > > into another".  What exactly didn't work?
> >
> > It's the volumes-from Docker option I'm thinking of.
> > I'm not sure now if my statement is accurate. I'll need to test it
> > again. I thought I had but what didn't work with the volumes-from might
> > have been autofs not NFS mounts.
> >
> > Anyway, I'm going to need to provide a way for clients to say "calculate
> > the namespace information and give me an identifier so it can be used
> > everywhere for this mount" which amounts to maintaining a list of the
> > namespace objects.
> 
> That sounds a lot closer to some of the work I've been doing to see if I can
> come up with a way to solve the "where's the namespace I need?" problem.
> 
> I agree with Greg's very early comments that the easiest way to determine
> which namespace context a process should use is to keep it as a copy of
> the task -- and the place that copy should be done is fork().  The
> problem was where to keep that information and how to make it reusable.
> 
> I've been hacking out a keyrings-based "key-agent" service that is basically
> a special type of key (like a keyring).  A key_agent type roughly
> corresponds to a particular type of upcall user, such as the idmapper.  A
> key_agent_type is registered, and that registration ties a particular
> key_type to that key_agent.  When a process calls request_key() for that
> key_type instead of using the helper to execute /sbin/request-key the
> process' keyrings are searched for a key_agent.  If a key_agent isn't found,
> the key_agent provider is then asked to provide an existing one based on
> some rules (is there an existing key_agent running in a different namespace
> that we might want to use for this purpose -- for example: is there there
> one already running in the namespace where the mount occurred).  If so, it
> is linked to the calling process' keyrings and then used for the upcall.  If
> not, then the calling process itself is forked/execve-ed into a new
> persistent key_agent that is installed on the calling process' keyrings just
> like a key, and with the same lifetime and GC expectations of a key.
> 
> A key_agent is a user-space process waiting for a realtime signal to process a
> particular key and provide the requested key information that can be
> installed back onto the calling process' keyrings.
> 
> Basically, this approach allows a particular user of a keyrings-based upcall
> to specify their own rules about how to provide a namespace context for a
> calling process.  It does, however, require extra work to create a specific
> key_agent_type for each individual key_type that might want to use this
> mechanism.
> 
> I've been waiting to have a bit more of a proof-of-concept before bringing
> this approach into the discussion.  However, it looks like it may be
> important to allow particular users of the upcall their own rules about
> which namespace contexts they might want to use.  This approach could
> provide that flexibility.

I was wondering if what you've been doing would help.
This does sound interesting, perhaps I should wait a little before doing
much more in case it can be generalized a little and used here too.

It's likely the current limited implementation I have will also be
useful for upcalls that need a straight "just execute me in the caller
namespace", so it's probably worth continuing it for that case.

> 
> Ben
> 
> 
> 
> > I'm not sure yet if I should undo some of what I've done recently or
> > leave it for users who need a straight "execute me now within the
> > current namespace".
> >
> > >
> > > --b.
> > >
> > > > but if we assume that will work at some point then, as
> > > > Bruce points out, we need to provide the ability to record the
> > > > namespaces to be used for subsequent "in namespace" execution while
> > > > maintaining caller isolation (ie. derived from the callers init
> > > > process).
> > > >
> > > > I've been aware of the need for persistence for a while now and I've
> > > > been thinking about how to do it but I don't have a clear plan quite
> > > > yet. Bruce, having noticed this, has described details about the
> > > > environment I have to work with so that's a start. I need the thoughts
> > > > of others on this too.
> > > >
> > > > As a result I'm not sure yet if this persistence can be integrated into
> > > > the current implementation or if additional calls will be needed to set
> > > > and clear the namespace information while maintaining the needed
> > > > isolation.
> > > >
> > > > As Bruce says, perhaps the namespace information should be saved as
> > > > properties of a mount or perhaps it should be a list keyed by some
> > > > handle, the handle being the saved property. I'm not sure yet but the
> > > > later might be unnecessary complication and overhead. The cleanup of the
> > > > namespace information upon summary termination of processes could be a
> > > > bit difficult, but perhaps it will be as simple as making it a function
> > > > of freeing of the object it's stored in (in the cases we have so far
> > > > that would be the mount).
> > > >
> > > > So, yes, I've still got a fair way to go yet, ;)
> > > >
> > > > Ian
> >
> >
> >


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/