Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752244AbbBXICI (ORCPT ); Tue, 24 Feb 2015 03:02:08 -0500 Received: from mx1.redhat.com ([209.132.183.28]:50790 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751044AbbBXICG (ORCPT ); Tue, 24 Feb 2015 03:02:06 -0500 Message-ID: <1424764908.2616.3.camel@pluto.fritz.box> Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace From: Ian Kent To: Benjamin Coddington Cc: "J. Bruce Fields" , "Eric W. Biederman" , David Howells , Kernel Mailing List , Oleg Nesterov , Trond Myklebust , Al Viro , Jeff Layton Date: Tue, 24 Feb 2015 16:01:48 +0800 In-Reply-To: References: <20150218170620.GI4148@fieldses.org> <20150218173132.GJ4148@fieldses.org> <20150218205908.GB12573@fieldses.org> <1424306341.2649.12.camel@pluto.fritz.box> <20150219013116.GA13131@fieldses.org> <1424424805.2632.24.camel@pluto.fritz.box> <20150220172558.GD18103@fieldses.org> <877fvc319o.fsf@x220.int.ebiederm.org> <20150220190547.GE18103@fieldses.org> <1424491138.2641.83.camel@pluto.fritz.box> <20150223145237.GB21246@fieldses.org> <1424739027.2616.20.camel@pluto.fritz.box> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12070 Lines: 244 On Mon, 2015-02-23 at 17:22 -0800, Benjamin Coddington wrote: > On Tue, 24 Feb 2015, Ian Kent wrote: > > > On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote: > > > On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote: > > > > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote: > > > > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote: > > > > > > "J. Bruce Fields" writes: > > > > > > > > > > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote: > > > > > > > > > > > > >> The case of nfsd state-recovery might be similar but you'll need to help > > > > > > >> me out a bit with that too. > > > > > > > > > > > > > > Each network namespace can have its own virtual nfs server. Servers can > > > > > > > be started and stopped independently per network namespace. We decide > > > > > > > which server should handle an incoming rpc by looking at the network > > > > > > > namespace associated with the socket that it arrived over. > > > > > > > > > > > > > > A server is started by the rpc.nfsd command writing a value into a magic > > > > > > > file somewhere. > > > > > > > > > > > > nit. Unless I am completely turned around that file is on the nfsd > > > > > > filesystem, that lives in fs/nfsd/nfs.c. > > > > > > > > > > > > So I bevelive this really is a case of figuring out what we want the > > > > > > semantics to be for mount and propogating the information down from > > > > > > mount to where we call the user mode helpers. > > > > > > > > > > Oops, I agree. So when I said: > > > > > > > > > > The upcalls need to happen consistently in one context for a > > > > > given virtual nfs server, and that context should probably be > > > > > derived from rpc.nfsd's somehow. > > > > > > > > > > Instead of "rpc.nfsd's", I think I should have said "the mounter of > > > > > the nfsd filesystem". > > > > > > > > > > Which is already how we choose a net namespace: nfsd_mount and > > > > > nfsd_fill_super store the current net namespace in s_fs_info. (And then > > > > > grep for "netns" to see the places where that's used.) > > > > > > > > This is going to be mostly a restatement of what's already been said, > > > > partly for me to refer back to later and partly to clarify and confirm > > > > what I need to do, so prepare to be bored. > > > > > > > > As a result of Oleg's recommendations and comments, the next version of > > > > the series will take a reference to an nsproxy and a user namespace > > > > (from the init process of the calling task, while it's still a child of > > > > that task), it won't carry around task structs. There are still a couple > > > > of questions with this so it's not quite there yet. > > > > > > > > We'll have to wait and see if what I've done is enough to remedy Oleg's > > > > concerns too. LOL, and then there's the question of how much I'll need > > > > to do to get it to actually work. > > > > > > > > The other difference is obtaining the context (now nsproxy and user > > > > namspace) has been taken entirely within the usermode helper. I think > > > > that's a good thing from the calling process isolation requirement. That > > > > may need to change again based on the discussion here. > > > > > > > > Now we're starting to look at actual usage it's worth keeping in mind > > > > that how to execute within required namespaces has to be sound before we > > > > tackle use cases that have requirements over this fundamental > > > > functionality. > > > > > > > > There are a couple of things to think about. > > > > > > > > One thing that's needed is how to work out if the UMH_USE_NS is needed > > > > and another is how to provide provide persistent usage of particular > > > > namespaces across containers. The later probably will relate to the > > > > origin of the file system (which looks like it will be identified at > > > > mount time). > > > > > > > > The first case is when the mount originates in the root init namespace > > > > and most of the time (if not all the time) the UMH_USE_NS doesn't need > > > > to be set and the helper should run in the root init namspace. > > > > > > The helper always runs in the original mount's container. Sometimes > > > that container is the init container, yes, but I don't see what value > > > there is in setting a flag in that one case. > > > > Yep. that's pretty much what I meant. > > > > > > > > > That > > > > should work for mount propagation as well with mounts bound into a > > > > container. > > > > > > > > Is this also true for automounted mounts at mount point crossing? Or > > > > perhaps I should ask, should automounted NFS mounts inherit the property > > > > from their parent mount? > > > > > > Yes. If we run separate helpers in each container, then the superblocks > > > should also be separate (so that one container can't poison cached > > > values used by another). So the containers would all end up with > > > entirely separate superblocks for the submounts. > > > > That's almost what I was thinking. > > > > The question relates to a mount for which the namespace proxy would have > > been set at mount time in a container and then bound into another > > container (in Docker by using the "--volumes-from "). I believe > > the namespace information from the original mount should always be used > > when calling a usermode helper. This might not be a sensible question > > now but I think it needs to be considered. > > > > > > > > That seems inefficient at least, and I don't think it's what an admin > > > would expect as the default behavior. > > > > LOL, but the best way to manage this is to set the namespace information > > at mount time (as Eric mentioned long ago) and use that everywhere. It's > > consistent and it provides a way for a process with appropriate > > privilege to specify the namespace information. > > > > > > > > > The second case is when the mount originates in another namespace, > > > > possibly a container. TBH I haven't thought too much about mounts that > > > > originate from namespaces created by unshare(1) or other source yet. I'm > > > > hoping that will just work once this is done, ;) > > > > > > So, one container mounts and spawns a "subcontainer" which continues to > > > use that filesystem? Yes, I think helpers should continue to run in the > > > container of the original mount, I don't see any tricky exception here. > > > > That's what I think should happen too. > > > > > > > > > The last time I tried binding NFS mounts from one container into another > > > > it didn't work, > > > > > > I'm not sure what you mean by "binding NFS mounts from one container > > > into another". What exactly didn't work? > > > > It's the volumes-from Docker option I'm thinking of. > > I'm not sure now if my statement is accurate. I'll need to test it > > again. I thought I had but what didn't work with the volumes-from might > > have been autofs not NFS mounts. > > > > Anyway, I'm going to need to provide a way for clients to say "calculate > > the namespace information and give me an identifier so it can be used > > everywhere for this mount" which amounts to maintaining a list of the > > namespace objects. > > That sounds a lot closer to some of the work I've been doing to see if I can > come up with a way to solve the "where's the namespace I need?" problem. > > I agree with Greg's very early comments that the easiest way to determine > which namespace context a process should use is to keep it as a copy of > the task -- and the place that copy should be done is fork(). The > problem was where to keep that information and how to make it reusable. > > I've been hacking out a keyrings-based "key-agent" service that is basically > a special type of key (like a keyring). A key_agent type roughly > corresponds to a particular type of upcall user, such as the idmapper. A > key_agent_type is registered, and that registration ties a particular > key_type to that key_agent. When a process calls request_key() for that > key_type instead of using the helper to execute /sbin/request-key the > process' keyrings are searched for a key_agent. If a key_agent isn't found, > the key_agent provider is then asked to provide an existing one based on > some rules (is there an existing key_agent running in a different namespace > that we might want to use for this purpose -- for example: is there there > one already running in the namespace where the mount occurred). If so, it > is linked to the calling process' keyrings and then used for the upcall. If > not, then the calling process itself is forked/execve-ed into a new > persistent key_agent that is installed on the calling process' keyrings just > like a key, and with the same lifetime and GC expectations of a key. > > A key_agent is a user-space process waiting for a realtime signal to process a > particular key and provide the requested key information that can be > installed back onto the calling process' keyrings. > > Basically, this approach allows a particular user of a keyrings-based upcall > to specify their own rules about how to provide a namespace context for a > calling process. It does, however, require extra work to create a specific > key_agent_type for each individual key_type that might want to use this > mechanism. > > I've been waiting to have a bit more of a proof-of-concept before bringing > this approach into the discussion. However, it looks like it may be > important to allow particular users of the upcall their own rules about > which namespace contexts they might want to use. This approach could > provide that flexibility. I was wondering if what you've been doing would help. This does sound interesting, perhaps I should wait a little before doing much more in case it can be generalized a little and used here too. It's likely the current limited implementation I have will also be useful for upcalls that need a straight "just execute me in the caller namespace", so it's probably worth continuing it for that case. > > Ben > > > > > I'm not sure yet if I should undo some of what I've done recently or > > leave it for users who need a straight "execute me now within the > > current namespace". > > > > > > > > --b. > > > > > > > but if we assume that will work at some point then, as > > > > Bruce points out, we need to provide the ability to record the > > > > namespaces to be used for subsequent "in namespace" execution while > > > > maintaining caller isolation (ie. derived from the callers init > > > > process). > > > > > > > > I've been aware of the need for persistence for a while now and I've > > > > been thinking about how to do it but I don't have a clear plan quite > > > > yet. Bruce, having noticed this, has described details about the > > > > environment I have to work with so that's a start. I need the thoughts > > > > of others on this too. > > > > > > > > As a result I'm not sure yet if this persistence can be integrated into > > > > the current implementation or if additional calls will be needed to set > > > > and clear the namespace information while maintaining the needed > > > > isolation. > > > > > > > > As Bruce says, perhaps the namespace information should be saved as > > > > properties of a mount or perhaps it should be a list keyed by some > > > > handle, the handle being the saved property. I'm not sure yet but the > > > > later might be unnecessary complication and overhead. The cleanup of the > > > > namespace information upon summary termination of processes could be a > > > > bit difficult, but perhaps it will be as simple as making it a function > > > > of freeing of the object it's stored in (in the cases we have so far > > > > that would be the mount). > > > > > > > > So, yes, I've still got a fair way to go yet, ;) > > > > > > > > Ian > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/