Date: Sun, 5 Jul 2015 18:39:25 +0100
From: Al Viro <viro@ZenIV.linux.org.uk>
To: jon <jon@jonshouse.co.uk>
Cc: Valdis.Kletnieks@vt.edu, coreutils@gnu.org, linux-kernel@vger.kernel.org
Subject: Re: Feature request, "create on mount" to create mount point
 directory on mount, implied remove on unmount
Message-ID: <20150705173925.GX17109@ZenIV.linux.org.uk>
References: <1435924919.6501.432.camel@jonspc>
 <172423.1436043394@turing-police.cc.vt.edu>
 <1436050108.6501.509.camel@jonspc>
 <20150705142936.GW17109@ZenIV.linux.org.uk>
 <1436111210.16546.29.camel@jonspc>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1436111210.16546.29.camel@jonspc>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4699
Lines: 81

On Sun, Jul 05, 2015 at 04:46:50PM +0100, jon wrote:

> I should have titled it "Feature request from a simple minded user"
> 
> I have not the slightest idea what you are talking about.  
> 
> When I learnt *nix it did not have "name spaces" in reference to process
> tables.  I understand the theory of VM a bit, the model in my mind each
> "machine", be that one kernel on a true processor or a VM instance has
> "a process table" and "a file descriptor table" etc - anything more is
> beyond my current level of knowledge.

File descriptor table isn't something system-wide - it belongs to a process...

Containers are basically glorified process groups.

Anyway, the underlying model hasn't changed much since _way_ back; each
thread of execution is a virtual machine of its own, with actual CPUs
switched between those.  Each of them has memory, ports (== file descriptors)
and traps (== signal handlers).  The main primitives are
	clone() (== rfork() in other branches; plain fork() is just the most
common case) - create a copy of the virtual machine, in the state identical
to that of caller with the exception of different return values given to
child and parent.
	exit() - terminate the virtual machine
	execve() - load a new program
Parts of those virtual machines can be shared - e.g. you can have descriptor
table not just identical to that of parent at the time of clone(), but
actually shared with it, so e.g. open() in child makes the resulting descriptor
visible to parent as well.  Or you can have memory (address space) shared,
so that something like mmap() in parent would affect the memory mappings of
child, etc.  Which components are to be shared and which - copied is selected
by clone() argument.
	unshare() allows to switch to using a private copy of chosen components
- e.g. you might say "from now on, I want my file descriptor table to be
private".  In e.g. Plan 9 that's expressed via rfork() as well.

Less obvious componets including current directory and root.  Normally, these
are not shared; chdir() done in child won't affect the parent and vice versa.
You could ask them to be shared, though - for multithreaded program it could
be convenient.

Different processes might see different parts of the mount tree since v7 had
introduced chroot(2).  Namespaces simply allow to have a *forest* - different
groups of processes seeing different mount trees in that forest.  The same
filesystem may be mounted in many places, and the same directory might be
a mountpoint in an instance visible to one process and not a mountpoint
in an instance visible to another (or a mountpoint with something entirely
different mounted in an instance visible to somebody else).

Mount tree is yet another component; the difference is that normally it *is*
shared on clone(), rather than being copied.  I.e. mount() done by child
affects the mount tree visible to parent.   But you still can ask for
a new private copy of mount tree via clone() or unshare().  When the
last process sharing that mount tree exits, it gets dissolved, same as
every file descriptor in a descriptor table gets closed when the last
thread sharing that descriptor table exits (or asks for unshared copy of
descriptor table, e.g. as a side effect of execve()).  Just as with
file descriptors close() does not necessary close the opened file
descriptor's connected to (that happens only when all descriptors connected
to given opened file are closed), umount() does not necessary shut the
filesystem down; that happens only if it's not mounted elsewhere.

With something like Plan 9 that would be pretty much all you need for
isolating process groups into separate environments - just give each
the set of filesystems they should be seeing and be done with that.
We, unfortunately, can't drop certain FPOS APIs (starting with sockets,
with their "network interfaces are magical sets of named objects, names
are not experssed as pathnames, access control and visibility completely
ad-hoc, ditto for listing and renaming" shite), so we get more
state components ;-/  Which leads to e.g. "network namespace" and similar
complications; that crap should've been dealt with in _filesystem_ namespace,
but Occam Razor be damned, we need to support every misdesigned interface
that got there, no matter how many entities it breeds and how convoluted
the result becomes...  In principle, though, it's still the same model -
only with more components to be possibly shared.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/