Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752940AbbGERjl (ORCPT ); Sun, 5 Jul 2015 13:39:41 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:44427 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752185AbbGERje (ORCPT ); Sun, 5 Jul 2015 13:39:34 -0400 Date: Sun, 5 Jul 2015 18:39:25 +0100 From: Al Viro To: jon Cc: Valdis.Kletnieks@vt.edu, coreutils@gnu.org, linux-kernel@vger.kernel.org Subject: Re: Feature request, "create on mount" to create mount point directory on mount, implied remove on unmount Message-ID: <20150705173925.GX17109@ZenIV.linux.org.uk> References: <1435924919.6501.432.camel@jonspc> <172423.1436043394@turing-police.cc.vt.edu> <1436050108.6501.509.camel@jonspc> <20150705142936.GW17109@ZenIV.linux.org.uk> <1436111210.16546.29.camel@jonspc> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1436111210.16546.29.camel@jonspc> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4699 Lines: 81 On Sun, Jul 05, 2015 at 04:46:50PM +0100, jon wrote: > I should have titled it "Feature request from a simple minded user" > > I have not the slightest idea what you are talking about. > > When I learnt *nix it did not have "name spaces" in reference to process > tables. I understand the theory of VM a bit, the model in my mind each > "machine", be that one kernel on a true processor or a VM instance has > "a process table" and "a file descriptor table" etc - anything more is > beyond my current level of knowledge. File descriptor table isn't something system-wide - it belongs to a process... Containers are basically glorified process groups. Anyway, the underlying model hasn't changed much since _way_ back; each thread of execution is a virtual machine of its own, with actual CPUs switched between those. Each of them has memory, ports (== file descriptors) and traps (== signal handlers). The main primitives are clone() (== rfork() in other branches; plain fork() is just the most common case) - create a copy of the virtual machine, in the state identical to that of caller with the exception of different return values given to child and parent. exit() - terminate the virtual machine execve() - load a new program Parts of those virtual machines can be shared - e.g. you can have descriptor table not just identical to that of parent at the time of clone(), but actually shared with it, so e.g. open() in child makes the resulting descriptor visible to parent as well. Or you can have memory (address space) shared, so that something like mmap() in parent would affect the memory mappings of child, etc. Which components are to be shared and which - copied is selected by clone() argument. unshare() allows to switch to using a private copy of chosen components - e.g. you might say "from now on, I want my file descriptor table to be private". In e.g. Plan 9 that's expressed via rfork() as well. Less obvious componets including current directory and root. Normally, these are not shared; chdir() done in child won't affect the parent and vice versa. You could ask them to be shared, though - for multithreaded program it could be convenient. Different processes might see different parts of the mount tree since v7 had introduced chroot(2). Namespaces simply allow to have a *forest* - different groups of processes seeing different mount trees in that forest. The same filesystem may be mounted in many places, and the same directory might be a mountpoint in an instance visible to one process and not a mountpoint in an instance visible to another (or a mountpoint with something entirely different mounted in an instance visible to somebody else). Mount tree is yet another component; the difference is that normally it *is* shared on clone(), rather than being copied. I.e. mount() done by child affects the mount tree visible to parent. But you still can ask for a new private copy of mount tree via clone() or unshare(). When the last process sharing that mount tree exits, it gets dissolved, same as every file descriptor in a descriptor table gets closed when the last thread sharing that descriptor table exits (or asks for unshared copy of descriptor table, e.g. as a side effect of execve()). Just as with file descriptors close() does not necessary close the opened file descriptor's connected to (that happens only when all descriptors connected to given opened file are closed), umount() does not necessary shut the filesystem down; that happens only if it's not mounted elsewhere. With something like Plan 9 that would be pretty much all you need for isolating process groups into separate environments - just give each the set of filesystems they should be seeing and be done with that. We, unfortunately, can't drop certain FPOS APIs (starting with sockets, with their "network interfaces are magical sets of named objects, names are not experssed as pathnames, access control and visibility completely ad-hoc, ditto for listing and renaming" shite), so we get more state components ;-/ Which leads to e.g. "network namespace" and similar complications; that crap should've been dealt with in _filesystem_ namespace, but Occam Razor be damned, we need to support every misdesigned interface that got there, no matter how many entities it breeds and how convoluted the result becomes... In principle, though, it's still the same model - only with more components to be possibly shared. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/