Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261793AbVAMWZ5 (ORCPT ); Thu, 13 Jan 2005 17:25:57 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261789AbVAMWXd (ORCPT ); Thu, 13 Jan 2005 17:23:33 -0500 Received: from parcelfarce.linux.theplanet.co.uk ([195.92.249.252]:40667 "EHLO parcelfarce.linux.theplanet.co.uk") by vger.kernel.org with ESMTP id S261741AbVAMWS5 (ORCPT ); Thu, 13 Jan 2005 17:18:57 -0500 Date: Thu, 13 Jan 2005 22:18:51 +0000 From: Al Viro To: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Subject: [RFC] shared subtrees Message-ID: <20050113221851.GI26051@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7110 Lines: 173 [apologies for delay - there'd been lots of unrelated crap lately] ====================================================================== NOTE: as far as I'm concerned, that's a beginning of VFS-2.7 branch. All that work will stay in a separate tree, with gradual merge back into 2.6 once the things start settling down. ====================================================================== OK, here comes the first draft of proposed semantics for subtree sharing. What we want is being able to propagate events between the parts of mount trees. Below is a description of what I think might be a workable semantics; it does *NOT* describe the data structures I would consider final and there are considerable areas where we still need to figure out the right behaviour. Let's start with introducing a notion of propagation node; I consider it only as a convenient way to describe the desired behaviour - it almost certainly won't be a data structure in the final variant. 1) each p-node corresponds to a group of 1 or more vfsmounts. 2) there is at most 1 p-node containing a given vfsmount. 3) each p-node owns a possibly empty set of p-nodes and vfsmounts 4) no p-node or vfsmount can be owned by more than one p-node 5) only vfsmounts that are not contained in any p-nodes might be owned. 6) no p-node can own (directly or via intermediates) itself (i.e. the graph of p-node ownership is a forest). These guys define propagation: a) if vfsmounts A and B are contained in the same p-node, events propagate from A to B b) if vfsmount A is contained in p-node p, vfsmount B is contained in p-node q and p owns q, events propagate from A to B c) if vfsmount A is contained in p-node p and vfsmount B is owned by p, events propagate from A to B d) propagation is transitive: if events propagate from A to B and from B to C, they propagate from A to C. In other words, members of the same p-node are equivalent and events anywhere in p-node are propagated to all its slaves. Note that not any transitive relation can be represented that way; it has to satisfy the following condition: * A->C and B->C => A->B or B->A All propagation setups we are going to deal with will satisfy that condition. How do we set them up? * we can mark a subtree sharable. Every vfsmount in the subtree that is not already in some p-node gets a single-element p-node of its own. * we can mark a subtree slave. That removes all vfsmounts in the subtree from their p-nodes and makes them owned by said p-nodes. p-nodes that became empty will disappear and everything they used to own will be repossessed by their owners (if any). * we can mark a subtree private. Same as above, but followed by taking all vfsmounts in our subtree and making them *not* owned by anybody. Of course, namespace operations (clone, mount, etc.) affect that structure and are affected by it (that's what it's for, after all). 1. CLONE_NS That one is simple - we copy vfsmounts as usual * if vfsmount A is contained in p-node p, then copy of A goes into the same p-node * if A is owned by p, then copy of A is also owned by p * no new p-nodes are created. 2. mount We have a new vfsmount A and want to attach it to mountpoint somewhere in vfsmount B. If B does not belong to any p-node, everything is as usual; A doesn't become a member or slave of any p-node and is simply attached to B. If B belongs to a p-node p, consider all vfsmounts B1,...,Bn that get events propagated from B and all p-nodes p1,...,pk that contain them. * A gets cloned into n copies and these copies (A1,...,An) are attached to corresponding points in B1,...,Bn. * k new p-nodes (q1,...,qk) are created * Ai is contained in qj <=> Bi is contained in qj * qi owns qj <=> pi owns pj * qi owns Aj <=> pi owns Bj In other words, mount is propagated and propagation among the new vfsmounts mirrors the propagation between mountpoints. 3. bind bind works almost identically to mount; new vfsmount is created for every place that gets propagation from mountpoint and propagation is set up to mirror that between the mountpoints. However, there is a difference: unlike the case of mount, vfsmount we were going to attach (say it, A) has some history - it was created as a copy of some pre-existing vfsmount V. And that's where the things get interesting: * if V is contained in some p-node p, A is placed into the same p-node. That may require merging one of the p-nodes we'd just created with p (that will be the counterpart of the p-node containing the mountpoint). * if V is owned by some p-node p, then A (or p-node containing A) becomes owned by p. 4. rbind rbind is recursive bind, so we just do binds for everything we had in a subtree we are binding in obvious order; everything is described by previous case. 5. umount umount everything that gets propagation from victim. 6. mount --move prohibited if what we are moving is in some p-node, otherwise we move as usual to intended mountpoint and create copies for everything that gets propagation from there (as we would do for rbind). 7. pivot_root similar to --move How to use all that stuff? Example 1: mount --bind /floppy /floppy mount --make-shared /floppy mount --rbind / /jail mount --make-slave /jail/floppy and we get /floppy in chroot jail slave to /floppy outside - if somebody (u)mounts stuff on it, that will get propagated to jail. Example 2: same, but with the namespaces instead of chroots. Example 3: same subtree visible (and kept in sync) in several places - just mark it shared and rbind; it will stay in sync Example 4: have some daemon control the stuff in a subtree sharable with many namespaces, chroots, etc. without any magic: mark that subtree sharable clone with CLONE_NS parent marks that subtree slave child keeps working on the tree in its private namespace. There's a lot more applications of the same idea, of course - AFS and its ilk, autofs-like stuff (with proper handling of MNT_EXPIRE and traps - see below), etc., etc. Areas where we still have to figure things out: * MNT_EXPIRE handling done right; there are some fun ideas in that area, but they still need to be done in more details (basically, lazy expire - mount in a slave expiring into a trap that would clone a copy from master when stepped upon). * traps and their sharing. What we want is an ability to use the master/slave mechanisms for *all* cross-namespace/cross-chroot issues in autofs, so that daemon would only need to work with the namespace of its own and no nothing about other instances. * implementation ;-) It certainly looks reasonably easy to do; memory demands are linear by number of vfsmounts involved and locking appears to be solvable. * whatever issues that might come up from MVFS demands (and AFS, and...) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/