Here are a set of patches to define a container object for the kernel and
to provide some methods to create and manipulate them.
The reason I think this is necessary is that the kernel has no idea how to
direct upcalls to what userspace considers to be a container - current
Linux practice appears to make a "container" just an arbitrarily chosen
junction of namespaces, control groups and files, which may be changed
individually within the "container".
The kernel upcall mechanism then needs to decide which set of namespaces,
etc., it must exec the appropriate upcall program. Examples of this
include:
(1) The DNS resolver. The DNS cache in the kernel should probably be
per-network namespace, but in userspace the program, its libraries and
its config data are associated with a mount tree and a user namespace
and it gets run in a particular pid namespace.
(2) NFS ID mapper. The NFS ID mapping cache should also probably be
per-network namespace.
(3) nfsdcltrack. A way for NFSD to access stable storage for tracking
of persistent state. Again, network-namespace dependent, but also
perhaps mount-namespace dependent.
(4) General request-key upcalls. Not particularly namespace dependent,
apart from keyrings being somewhat governed by the user namespace and
the upcall being configured by the mount namespace.
These patches are built on top of the mount context patchset so that
namespaces can be properly propagated over submounts/automounts.
These patches implement a container object that holds the following things:
(1) Namespaces.
(2) A root directory.
(3) A set of processes, including a designated 'init' process.
(4) The creator's credentials, including ownership.
(5) A place to hang security for the container, allowing policies to be
set per-container.
I also want to add:
(6) Control groups.
(7) A per-container keyring that can be added to from outside of the
container, even once the container is live, for the provision of
filesystem authentication/encryption keys in advance of the container
being started.
You can get a list of containers by examining /proc/containers - but I'm
not sure how much value this gets you. Note that the container in which
you are running is called "<current>" and you can only see other containers
that were started from within yours. Containers are therefore effectively
hierarchical and an init_container is set up when the system boots.
Some management operations are provided:
(1) int fd = container_create(const char *name, unsigned int flags);
Create a container of the given name and return a handle to it as a
file descriptor. flags indicates what namespaces should be inherited
from the caller and what should be replaced new. It is possible to
set up a container with a null root filesystem that can be mounted
later.
(2) int fsfd = fsopen(const char *fsname, int container_fd,
unsigned int flags);
Prepare a mount context inside the container. This uses all the
containers namespaces instead of the caller's.
(3) fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
unsigned int flags);
Mount a prepared superblock. dfd can be given container_fd to use the
container to which it refers as the root of the pathwalk.
If path is "/" and at_flags is AT_FSMOUNT_CONTAINER_ROOT, then this
will attempt to mount the root of the container and create a mount
namespace for it. The container must've been created with
CONTAINER_NEW_EMPTY_FS_NS.
(4) pid_t pid = fork_into_container(int container_fd);
Create the init process in a container. The process uses that
container's namespaces instead of the caller's.
(5) int sfd = container_socket(int container_fd,
int domain, int type, int protocol);
Create a socket inside a container. The socket gets the container's
namespaces. This allows netlink operations to be called within that
container to set it up from outside (at least in theory).
(6) mkdirat(int dfd, ...);
mknodat(int dfd, ...);
openat(int dfd, ...);
Supplying a container fd as dfd makes the pathwalk happen relative to
the root of the container. Note that the path must be *relative*.
And some need to be/could be added:
(7) Directly set a container's namespaces to allow cross-container
sharing.
(8) Adjust the control group membership of a container.
(9) Add a key inside a container keyring.
(10) Kill/suspend/freeze/reboot container, both from inside and out.
(11) Set container's root dir.
(12) Set the container's security policy.
(13) Allow overlayfs to access filesystems outside of the container in
which it is being created.
Kernel upcalls are invoked in the root of the container that incurs them
rather than in the init namespace context. There's still some awkwardness
here if you, say, share a network namespace between containers. Either the
upcall binaries and configuration must be duplicated between sharing
containers or a container must be elected as the one in which such upcalls
will be done.
Some further thoughts:
(*) Should there be an AT_IN_CONTAINER flag to provide to syscalls that
take a container in lieu of AT_FDCWD or a directory fd? The problem
is that such as mkdirat() and openat() don't have an at_flags
argument.
(*) Should there be a container hierarchy at all? It seems that this is
only really necessary for /proc/containers. Do we want to allow
containers-within-containers?
(*) Should each container automatically have its own pid namespace such
that its 'init' process always appears as pid 1?
(*) Does this allow kernel upcalls to be accounted against the correct
control group?
(*) Should each container have a 'list' of accessible device numbers such
that certain device files can be made usable within a container? And
can devtmpfs/udev be made to show the correct file set for each
container?
The patches can be found here also:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=container
Note that this is dependent on the mount-context branch.
David
---
David Howells (9):
containers: Rename linux/container.h to linux/container_dev.h
Implement containers as kernel objects
Provide /proc/containers
Allow processes to be forked and upcalled into a container
Open a socket inside a container
Allow fs syscall dfd arguments to take a container fd
Make fsopen() able to initiate mounting into a container
Honour CONTAINER_NEW_EMPTY_FS_NS
Sample program for driving container objects
arch/x86/entry/syscalls/syscall_32.tbl | 3
arch/x86/entry/syscalls/syscall_64.tbl | 3
drivers/acpi/container.c | 2
drivers/base/container.c | 2
fs/fsopen.c | 33 +-
fs/libfs.c | 3
fs/namei.c | 52 ++-
fs/namespace.c | 108 +++++-
fs/nfs/namespace.c | 2
fs/nfs/nfs4namespace.c | 4
fs/proc/root.c | 13 +
fs/sb_config.c | 29 +-
include/linux/container.h | 91 ++++-
include/linux/container_dev.h | 25 +
include/linux/cred.h | 3
include/linux/init_task.h | 4
include/linux/kmod.h | 1
include/linux/lsm_hooks.h | 25 +
include/linux/mount.h | 5
include/linux/nsproxy.h | 7
include/linux/pid.h | 5
include/linux/proc_ns.h | 3
include/linux/sb_config.h | 5
include/linux/sched.h | 3
include/linux/sched/task.h | 4
include/linux/security.h | 20 +
include/linux/syscalls.h | 6
include/uapi/linux/container.h | 28 ++
include/uapi/linux/fcntl.h | 2
include/uapi/linux/magic.h | 1
init/Kconfig | 7
init/main.c | 4
kernel/Makefile | 2
kernel/container.c | 576 ++++++++++++++++++++++++++++++++
kernel/cred.c | 45 ++-
kernel/exit.c | 1
kernel/fork.c | 117 ++++++-
kernel/kmod.c | 13 +
kernel/kthread.c | 3
kernel/namespaces.h | 15 +
kernel/nsproxy.c | 34 +-
kernel/pid.c | 4
kernel/sys_ni.c | 5
net/socket.c | 37 ++
samples/containers/test-container.c | 162 +++++++++
security/security.c | 18 +
security/selinux/hooks.c | 5
47 files changed, 1408 insertions(+), 132 deletions(-)
create mode 100644 include/linux/container_dev.h
create mode 100644 include/uapi/linux/container.h
create mode 100644 kernel/container.c
create mode 100644 kernel/namespaces.h
create mode 100644 samples/containers/test-container.c
Allow a single process to be forked directly into a container using a new
syscall:
pid_t pid = fork_into_container(int container_fd);
Further attempts to fork into the container will be rejected.
Kernel upcalls will happen in the context of current's container, using
that containers namespaces.
Signed-off-by: David Howells <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
include/linux/cred.h | 3 +
include/linux/kmod.h | 1
include/linux/lsm_hooks.h | 4 +
include/linux/nsproxy.h | 7 ++
include/linux/sched/task.h | 4 +
include/linux/security.h | 5 +
include/linux/syscalls.h | 1
init/main.c | 4 +
kernel/cred.c | 45 ++++++++++++
kernel/fork.c | 117 ++++++++++++++++++++++++++------
kernel/kmod.c | 13 +++-
kernel/kthread.c | 3 +
kernel/nsproxy.c | 13 +++-
kernel/sys_ni.c | 2 -
security/security.c | 5 +
security/selinux/hooks.c | 3 +
18 files changed, 188 insertions(+), 44 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 9ccd0f52f874..0d5a9875ead2 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -394,3 +394,4 @@
385 i386 fsopen sys_fsopen
386 i386 fsmount sys_fsmount
387 i386 container_create sys_container_create
+388 i386 fork_into_container sys_fork_into_container
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index dab92591511e..e4005cc579b6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -342,6 +342,7 @@
333 common fsopen sys_fsopen
334 common fsmount sys_fsmount
335 common container_create sys_container_create
+336 common fork_into_container sys_fork_into_container
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/cred.h b/include/linux/cred.h
index b03e7d049a64..834f10962014 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -23,6 +23,7 @@
struct cred;
struct inode;
+struct container;
/*
* COW Supplementary groups list
@@ -149,7 +150,7 @@ struct cred {
extern void __put_cred(struct cred *);
extern void exit_creds(struct task_struct *);
-extern int copy_creds(struct task_struct *, unsigned long);
+extern int copy_creds(struct task_struct *, unsigned long, struct container *);
extern const struct cred *get_task_cred(struct task_struct *);
extern struct cred *cred_alloc_blank(void);
extern struct cred *prepare_creds(void);
diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index c4e441e00db5..7f004a261a1c 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -56,6 +56,7 @@ struct file;
struct subprocess_info {
struct work_struct work;
struct completion *complete;
+ struct container *container;
const char *path;
char **argv;
char **envp;
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 7b0d484a6a25..37ac19645cca 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -564,6 +564,7 @@
* Check permission before creating a child process. See the clone(2)
* manual page for definitions of the @clone_flags.
* @clone_flags contains the flags indicating what should be shared.
+ * @container indicates the container the task is being created in (or NULL)
* Return 0 if permission is granted.
* @task_alloc:
* @task task being allocated.
@@ -1535,7 +1536,8 @@ union security_list_options {
int (*file_receive)(struct file *file);
int (*file_open)(struct file *file, const struct cred *cred);
- int (*task_create)(unsigned long clone_flags);
+ int (*task_create)(unsigned long clone_flags,
+ struct container *container);
int (*task_alloc)(struct task_struct *task, unsigned long clone_flags);
void (*task_free)(struct task_struct *task);
int (*cred_alloc_blank)(struct cred *cred, gfp_t gfp);
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index ac0d65bef5d0..40478a65ab0a 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -10,6 +10,7 @@ struct ipc_namespace;
struct pid_namespace;
struct cgroup_namespace;
struct fs_struct;
+struct container;
/*
* A structure to contain pointers to all per-process
@@ -62,9 +63,13 @@ extern struct nsproxy init_nsproxy;
* * /
* task_unlock(task);
*
+ * 4. Container namespaces are set at container creation and cannot be
+ * changed.
+ *
*/
-int copy_namespaces(unsigned long flags, struct task_struct *tsk);
+int copy_namespaces(unsigned long flags, struct task_struct *tsk,
+ struct container *container);
void exit_task_namespaces(struct task_struct *tsk);
void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
void free_nsproxy(struct nsproxy *ns);
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index a978d7189cfd..025193fd0260 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -70,10 +70,10 @@ extern void do_group_exit(int);
extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long, struct container *);
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
struct task_struct *fork_idle(int);
-extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
+extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags, struct container *);
extern void free_task(struct task_struct *tsk);
diff --git a/include/linux/security.h b/include/linux/security.h
index 01bdf7637ec6..ac8625b72d0e 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -314,7 +314,7 @@ int security_file_send_sigiotask(struct task_struct *tsk,
struct fown_struct *fown, int sig);
int security_file_receive(struct file *file);
int security_file_open(struct file *file, const struct cred *cred);
-int security_task_create(unsigned long clone_flags);
+int security_task_create(unsigned long clone_flags, struct container *container);
int security_task_alloc(struct task_struct *task, unsigned long clone_flags);
void security_task_free(struct task_struct *task);
int security_cred_alloc_blank(struct cred *cred, gfp_t gfp);
@@ -885,7 +885,8 @@ static inline int security_file_open(struct file *file,
return 0;
}
-static inline int security_task_create(unsigned long clone_flags)
+static inline int security_task_create(unsigned long clone_flags,
+ struct container *container)
{
return 0;
}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 5a0324dd024c..7ca6c287ce84 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -911,5 +911,6 @@ asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at
asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
unsigned long spare3, unsigned long spare4,
unsigned long spare5);
+asmlinkage long sys_fork_into_container(int containerfd);
#endif
diff --git a/init/main.c b/init/main.c
index f866510472d7..f638cb44826a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -397,9 +397,9 @@ static noinline void __ref rest_init(void)
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
- kernel_thread(kernel_init, NULL, CLONE_FS);
+ kernel_thread(kernel_init, NULL, CLONE_FS, NULL);
numa_default_policy();
- pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
+ pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES, NULL);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
diff --git a/kernel/cred.c b/kernel/cred.c
index 2bc66075740f..363ccd333267 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -312,6 +312,43 @@ struct cred *prepare_exec_creds(void)
}
/*
+ * Handle forking a process into a container.
+ */
+static struct cred *copy_container_creds(struct container *container)
+{
+ struct cred *new;
+
+ validate_process_creds();
+
+ new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
+ if (!new)
+ return NULL;
+
+ kdebug("prepare_creds() alloc %p", new);
+
+ memcpy(new, container->cred, sizeof(struct cred));
+
+ atomic_set(&new->usage, 1);
+ set_cred_subscribers(new, 0);
+ get_group_info(new->group_info);
+ get_uid(new->user);
+ get_user_ns(new->user_ns);
+
+#ifdef CONFIG_SECURITY
+ new->security = NULL;
+#endif
+
+ if (security_prepare_creds(new, container->cred, GFP_KERNEL) < 0)
+ goto error;
+ validate_creds(new);
+ return new;
+
+error:
+ abort_creds(new);
+ return NULL;
+}
+
+/*
* Copy credentials for the new process created by fork()
*
* We share if we can, but under some circumstances we have to generate a new
@@ -320,7 +357,8 @@ struct cred *prepare_exec_creds(void)
* The new process gets the current process's subjective credentials as its
* objective and subjective credentials
*/
-int copy_creds(struct task_struct *p, unsigned long clone_flags)
+int copy_creds(struct task_struct *p, unsigned long clone_flags,
+ struct container *container)
{
struct cred *new;
int ret;
@@ -341,7 +379,10 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
return 0;
}
- new = prepare_creds();
+ if (container)
+ new = copy_container_creds(container);
+ else
+ new = prepare_creds();
if (!new)
return -ENOMEM;
diff --git a/kernel/fork.c b/kernel/fork.c
index ff2779426fe9..d185c13820d7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1241,9 +1241,33 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
return retval;
}
-static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_fs(unsigned long clone_flags, struct task_struct *tsk,
+ struct container *container)
{
struct fs_struct *fs = current->fs;
+
+#ifdef CONFIG_CONTAINERS
+ if (container) {
+ fs = kmem_cache_alloc(fs_cachep, GFP_KERNEL);
+ if (!fs)
+ return -ENOMEM;
+
+ fs->users = 1;
+ fs->in_exec = 0;
+ spin_lock_init(&fs->lock);
+ seqcount_init(&fs->seq);
+ fs->umask = 0022;
+
+ spin_lock(&container->lock);
+ fs->pwd = fs->root = container->root;
+ path_get(&fs->root);
+ path_get(&fs->pwd);
+ spin_unlock(&container->lock);
+ tsk->fs = fs;
+ return 0;
+ }
+#endif
+
if (clone_flags & CLONE_FS) {
/* tsk->fs is already what we want */
spin_lock(&fs->lock);
@@ -1521,7 +1545,8 @@ static __latent_entropy struct task_struct *copy_process(
struct pid *pid,
int trace,
unsigned long tls,
- int node)
+ int node,
+ struct container *container)
{
int retval;
struct task_struct *p;
@@ -1568,7 +1593,7 @@ static __latent_entropy struct task_struct *copy_process(
return ERR_PTR(-EINVAL);
}
- retval = security_task_create(clone_flags);
+ retval = security_task_create(clone_flags, container);
if (retval)
goto fork_out;
@@ -1594,7 +1619,7 @@ static __latent_entropy struct task_struct *copy_process(
}
current->flags &= ~PF_NPROC_EXCEEDED;
- retval = copy_creds(p, clone_flags);
+ retval = copy_creds(p, clone_flags, container);
if (retval < 0)
goto bad_fork_free;
@@ -1713,7 +1738,7 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
- retval = copy_fs(clone_flags, p);
+ retval = copy_fs(clone_flags, p, container);
if (retval)
goto bad_fork_cleanup_files;
retval = copy_sighand(clone_flags, p);
@@ -1725,15 +1750,15 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_mm(clone_flags, p);
if (retval)
goto bad_fork_cleanup_signal;
- retval = copy_namespaces(clone_flags, p);
+ retval = copy_container(clone_flags, p, container);
if (retval)
goto bad_fork_cleanup_mm;
- retval = copy_container(clone_flags, p, NULL);
+ retval = copy_namespaces(clone_flags, p, container);
if (retval)
- goto bad_fork_cleanup_namespaces;
+ goto bad_fork_cleanup_container;
retval = copy_io(clone_flags, p);
if (retval)
- goto bad_fork_cleanup_container;
+ goto bad_fork_cleanup_namespaces;
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
goto bad_fork_cleanup_io;
@@ -1921,10 +1946,10 @@ static __latent_entropy struct task_struct *copy_process(
bad_fork_cleanup_io:
if (p->io_context)
exit_io_context(p);
-bad_fork_cleanup_container:
- exit_container(p);
bad_fork_cleanup_namespaces:
exit_task_namespaces(p);
+bad_fork_cleanup_container:
+ exit_container(p);
bad_fork_cleanup_mm:
if (p->mm)
mmput(p->mm);
@@ -1976,7 +2001,7 @@ struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
- cpu_to_node(cpu));
+ cpu_to_node(cpu), NULL);
if (!IS_ERR(task)) {
init_idle_pids(task->pids);
init_idle(task, cpu);
@@ -1988,15 +2013,16 @@ struct task_struct *fork_idle(int cpu)
/*
* Ok, this is the main fork-routine.
*
- * It copies the process, and if successful kick-starts
- * it and waits for it to finish using the VM if required.
+ * It copies the process into the specified container, and if successful
+ * kick-starts it and waits for it to finish using the VM if required.
*/
long _do_fork(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
- unsigned long tls)
+ unsigned long tls,
+ struct container *container)
{
struct task_struct *p;
int trace = 0;
@@ -2020,8 +2046,32 @@ long _do_fork(unsigned long clone_flags,
trace = 0;
}
+ if (container) {
+ /* A process spawned into a container doesn't share anything
+ * with the parent other than namespaces.
+ */
+ if (clone_flags & (CLONE_CHILD_CLEARTID |
+ CLONE_CHILD_SETTID |
+ CLONE_FILES |
+ CLONE_FS |
+ CLONE_IO |
+ CLONE_PARENT |
+ CLONE_PARENT_SETTID |
+ CLONE_PTRACE |
+ CLONE_SETTLS |
+ CLONE_SIGHAND |
+ CLONE_SYSVSEM |
+ CLONE_THREAD))
+ return -EINVAL;
+
+ /* However, we do have to let kernel threads borrow a VM. */
+ if ((clone_flags & CLONE_VM) && current->mm)
+ return -EINVAL;
+ }
+
p = copy_process(clone_flags, stack_start, stack_size,
- child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+ child_tidptr, NULL, trace, tls, NUMA_NO_NODE,
+ container);
add_latent_entropy();
/*
* Do this prior waking up the new thread - the thread pointer
@@ -2073,24 +2123,25 @@ long do_fork(unsigned long clone_flags,
int __user *child_tidptr)
{
return _do_fork(clone_flags, stack_start, stack_size,
- parent_tidptr, child_tidptr, 0);
+ parent_tidptr, child_tidptr, 0, NULL);
}
#endif
/*
* Create a kernel thread.
*/
-pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
+pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags,
+ struct container *container)
{
return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
- (unsigned long)arg, NULL, NULL, 0);
+ (unsigned long)arg, NULL, NULL, 0, container);
}
#ifdef __ARCH_WANT_SYS_FORK
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
- return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+ return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, NULL);
#else
/* can not support in nommu mode */
return -EINVAL;
@@ -2102,10 +2153,31 @@ SYSCALL_DEFINE0(fork)
SYSCALL_DEFINE0(vfork)
{
return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
- 0, NULL, NULL, 0);
+ 0, NULL, NULL, 0, NULL);
}
#endif
+SYSCALL_DEFINE1(fork_into_container, int, containerfd)
+{
+#ifdef CONFIG_CONTAINERS
+ struct fd f = fdget(containerfd);
+ int ret;
+
+ if (!f.file)
+ return -EBADF;
+ ret = -EINVAL;
+ if (is_container_file(f.file)) {
+ struct container *c = f.file->private_data;
+
+ ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, c);
+ }
+ fdput(f);
+ return ret;
+#else
+ return -ENOSYS;
+#endif
+}
+
#ifdef __ARCH_WANT_SYS_CLONE
#ifdef CONFIG_CLONE_BACKWARDS
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
@@ -2130,7 +2202,8 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
unsigned long, tls)
#endif
{
- return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+ return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls,
+ NULL);
}
#endif
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 563f97e2be36..1857a3bb9e61 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -42,6 +42,7 @@
#include <linux/ptrace.h>
#include <linux/async.h>
#include <linux/uaccess.h>
+#include <linux/container.h>
#include <trace/events/module.h>
@@ -160,7 +161,7 @@ int __request_module(bool wait, const char *fmt, ...)
* would be to run the parents of this process, counting how many times
* kmod was invoked. That would mean accessing the internals of the
* process tables to get the command line, proc_pid_cmdline is static
- * and it is not worth changing the proc code just to handle this case.
+ * and it is not worth changing the proc code just to handle this case.
* KAO.
*
* "trace the ppid" is simple, but will fail if someone's
@@ -194,6 +195,7 @@ static void call_usermodehelper_freeinfo(struct subprocess_info *info)
{
if (info->cleanup)
(*info->cleanup)(info);
+ put_container(info->container);
kfree(info);
}
@@ -274,7 +276,8 @@ static void call_usermodehelper_exec_sync(struct subprocess_info *sub_info)
/* If SIGCLD is ignored sys_wait4 won't populate the status. */
kernel_sigaction(SIGCHLD, SIG_DFL);
- pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);
+ pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD,
+ sub_info->container);
if (pid < 0) {
sub_info->retval = pid;
} else {
@@ -335,7 +338,7 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
* that always ignores SIGCHLD to ensure auto-reaping.
*/
pid = kernel_thread(call_usermodehelper_exec_async, sub_info,
- CLONE_PARENT | SIGCHLD);
+ CLONE_PARENT | SIGCHLD, sub_info->container);
if (pid < 0) {
sub_info->retval = pid;
umh_complete(sub_info);
@@ -531,6 +534,8 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
+ sub_info->container = current->container;
+
#ifdef CONFIG_STATIC_USERMODEHELPER
sub_info->path = CONFIG_STATIC_USERMODEHELPER_PATH;
#else
@@ -564,6 +569,8 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
DECLARE_COMPLETION_ONSTACK(done);
int retval = 0;
+ get_container(sub_info->container);
+
if (!sub_info->path) {
call_usermodehelper_freeinfo(sub_info);
return -EINVAL;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528c1d88..ca0090f90645 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -251,7 +251,8 @@ static void create_kthread(struct kthread_create_info *create)
current->pref_node_fork = create->node;
#endif
/* We want our own signal handler (we take no signals by default). */
- pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
+ pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD,
+ NULL);
if (pid < 0) {
/* If user was SIGKILLed, I release the structure. */
struct completion *done = xchg(&create->done, NULL);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4bb5184b3a80..9743cf23df93 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -136,12 +136,19 @@ struct nsproxy *create_new_namespaces(unsigned long flags,
* called from clone. This now handles copy for nsproxy and all
* namespaces therein.
*/
-int copy_namespaces(unsigned long flags, struct task_struct *tsk)
+int copy_namespaces(unsigned long flags, struct task_struct *tsk,
+ struct container *container)
{
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
+ if (container) {
+ get_nsproxy(container->ns);
+ tsk->nsproxy = container->ns;
+ return 0;
+ }
+
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
CLONE_NEWCGROUP)))) {
@@ -151,7 +158,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
-
+
/*
* CLONE_NEWIPC must detach from the undolist: after switching
* to a new ipc namespace, the semaphore arrays from the old
@@ -163,7 +170,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
(CLONE_NEWIPC | CLONE_SYSVSEM))
return -EINVAL;
- new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
+ new_ns = create_new_namespaces(flags, old_ns, user_ns, tsk->fs);
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 99b1e1f58d05..b685ffe3591f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -265,4 +265,4 @@ cond_syscall(sys_fsmount);
/* Containers */
cond_syscall(sys_container_create);
-
+cond_syscall(sys_fork_into_container);
diff --git a/security/security.c b/security/security.c
index b5c5b5ae1266..21e14aa26cd3 100644
--- a/security/security.c
+++ b/security/security.c
@@ -961,9 +961,10 @@ int security_file_open(struct file *file, const struct cred *cred)
return fsnotify_perm(file, MAY_OPEN);
}
-int security_task_create(unsigned long clone_flags)
+int security_task_create(unsigned long clone_flags,
+ struct container *container)
{
- return call_int_hook(task_create, 0, clone_flags);
+ return call_int_hook(task_create, 0, clone_flags, container);
}
int security_task_alloc(struct task_struct *task, unsigned long clone_flags)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 877b7e7bd2d5..23bdbb0c2de5 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3865,7 +3865,8 @@ static int selinux_file_open(struct file *file, const struct cred *cred)
/* task security operations */
-static int selinux_task_create(unsigned long clone_flags)
+static int selinux_task_create(unsigned long clone_flags,
+ struct container *container)
{
u32 sid = current_sid();
Provide /proc/containers to view the current container and all the
containers created within it:
# ./foo-container
NAME USE FL OWNER GROUP
<current> 141 01 0 0
foo-test 1 04 0 0
I'm not sure whether this is really desirable, though.
Signed-off-by: David Howells <[email protected]>
---
kernel/container.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 104 insertions(+)
diff --git a/kernel/container.c b/kernel/container.c
index eef1566835eb..d5849c07a76b 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -22,6 +22,7 @@
#include <linux/syscalls.h>
#include <linux/printk.h>
#include <linux/security.h>
+#include <linux/proc_fs.h>
#include "namespaces.h"
struct container init_container = {
@@ -70,6 +71,108 @@ void put_container(struct container *c)
}
}
+static void *container_proc_start(struct seq_file *m, loff_t *_pos)
+{
+ struct container *c = m->private;
+ struct list_head *p;
+ loff_t pos = *_pos;
+
+ spin_lock(&c->lock);
+
+ if (pos <= 1) {
+ *_pos = 1;
+ return (void *)1UL; /* Banner on first line */
+ }
+
+ if (pos == 2)
+ return m->private; /* Current container on second line */
+
+ /* Subordinate containers thereafter */
+ p = c->children.next;
+ pos--;
+ for (pos--; pos > 0 && p != &c->children; pos--) {
+ p = p->next;
+ }
+
+ if (p == &c->children)
+ return NULL;
+ return container_of(p, struct container, child_link);
+}
+
+static void *container_proc_next(struct seq_file *m, void *v, loff_t *_pos)
+{
+ struct container *c = m->private, *vc = v;
+ struct list_head *p;
+ loff_t pos = *_pos;
+
+ pos++;
+ *_pos = pos;
+ if (pos == 2)
+ return c; /* Current container on second line */
+
+ if (pos == 3)
+ p = &c->children;
+ else
+ p = &vc->child_link;
+ p = p->next;
+ if (p == &c->children)
+ return NULL;
+ return container_of(p, struct container, child_link);
+}
+
+static void container_proc_stop(struct seq_file *m, void *v)
+{
+ struct container *c = m->private;
+
+ spin_unlock(&c->lock);
+}
+
+static int container_proc_show(struct seq_file *m, void *v)
+{
+ struct user_namespace *uns = current_user_ns();
+ struct container *c = v;
+ const char *name;
+
+ if (v == (void *)1UL) {
+ seq_puts(m, "NAME USE FL OWNER GROUP\n");
+ return 0;
+ }
+
+ name = (c == m->private) ? "<current>" : c->name;
+ seq_printf(m, "%-24s %3u %02lx %0d %5d\n",
+ name, refcount_read(&c->usage), c->flags,
+ from_kuid_munged(uns, c->cred->uid),
+ from_kgid_munged(uns, c->cred->gid));
+
+ return 0;
+}
+
+static const struct seq_operations container_proc_ops = {
+ .start = container_proc_start,
+ .next = container_proc_next,
+ .stop = container_proc_stop,
+ .show = container_proc_show,
+};
+
+static int container_proc_open(struct inode *inode, struct file *file)
+{
+ struct seq_file *m;
+ int ret = seq_open(file, &container_proc_ops);
+
+ if (ret == 0) {
+ m = file->private_data;
+ m->private = current->container;
+ }
+ return ret;
+}
+
+static const struct file_operations container_proc_fops = {
+ .open = container_proc_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
/*
* Allow the user to poll for the container dying.
*/
@@ -230,6 +333,7 @@ static int __init init_container_fs(void)
panic("Cannot mount containerfs: %ld\n",
PTR_ERR(containerfs_mnt));
+ proc_create("containers", 0, NULL, &container_proc_fops);
return 0;
}
Allow a container to be created with an empty mount namespace, as specified
by passing CONTAINER_NEW_EMPTY_FS_NS to container_create(), and allow a
root filesystem to be mounted into the container:
cfd = container_create("foo", CONTAINER_NEW_EMPTY_FS_NS);
fd = fsopen("ext3", cfd, 0);
write(fd, "o foo");
...
fsmount(fd, -1, "/", AT_FSMOUNT_CONTAINER_ROOT, 0);
close(fd);
fd = fsopen("proc", cfd, 0);
fsmount(fd, cfd, "/proc", 0, 0);
close(fd);
---
fs/namespace.c | 84 ++++++++++++++++++++++++++++++++++++--------
include/linux/mount.h | 3 +-
include/uapi/linux/fcntl.h | 2 +
kernel/container.c | 6 +++
kernel/fork.c | 5 ++-
security/selinux/hooks.c | 2 +
6 files changed, 85 insertions(+), 17 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 9ca8b9f49f80..a365a7cba3ad 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2458,6 +2458,38 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags,
}
static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);
+static struct mnt_namespace *create_mnt_ns(struct vfsmount *m);
+
+/*
+ * Create a mount namespace for a container and set the root mount in it.
+ */
+static int set_container_root(struct sb_config *sc, struct vfsmount *mnt)
+{
+ struct container *container = sc->container;
+ struct mnt_namespace *mnt_ns;
+ int ret = -EBUSY;
+
+ mnt_ns = create_mnt_ns(mnt);
+ if (IS_ERR(mnt_ns))
+ return PTR_ERR(mnt_ns);
+
+ spin_lock(&container->lock);
+ if (!container->ns->mnt_ns) {
+ container->ns->mnt_ns = mnt_ns;
+ write_seqcount_begin(&container->seq);
+ container->root.mnt = mnt;
+ container->root.dentry = mnt->mnt_root;
+ write_seqcount_end(&container->seq);
+ path_get(&container->root);
+ mnt_ns = NULL;
+ ret = 0;
+ }
+ spin_unlock(&container->lock);
+
+ if (ret < 0)
+ put_mnt_ns(mnt_ns);
+ return ret;
+}
/*
* Create a new mount using a superblock configuration and request it
@@ -2479,8 +2511,12 @@ static int do_new_mount_sc(struct sb_config *sc, struct path *mountpoint,
goto err_mnt;
}
- ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags,
- sc->container ? sc->container->ns->mnt_ns : NULL);
+ if (mnt_flags & MNT_CONTAINER_ROOT)
+ ret = set_container_root(sc, mnt);
+ else
+ ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags,
+ sc->container ? sc->container->ns->mnt_ns : NULL);
+
if (ret < 0) {
errorf("VFS: Failed to add mount");
goto err_mnt;
@@ -3262,10 +3298,17 @@ SYSCALL_DEFINE5(fsmount, int, fs_fd, int, dfd, const char __user *, dir_name,
struct fd f;
unsigned int lookup_flags, mnt_flags = 0;
long ret;
+ char buf[2];
if ((at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
- AT_EMPTY_PATH)) != 0)
+ AT_EMPTY_PATH | AT_FSMOUNT_CONTAINER_ROOT)) != 0)
return -EINVAL;
+ if (at_flags & AT_FSMOUNT_CONTAINER_ROOT) {
+ if (strncpy_from_user(buf, dir_name, 2) < 0)
+ return -EFAULT;
+ if (buf[0] != '/' || buf[1] != '\0')
+ return -EINVAL;
+ }
if (flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_STRICTATIME))
@@ -3317,18 +3360,29 @@ SYSCALL_DEFINE5(fsmount, int, fs_fd, int, dfd, const char __user *, dir_name,
if (ret < 0)
goto err_fsfd;
- /* Find the mountpoint. A container can be specified in dfd. */
- lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
- if (at_flags & AT_SYMLINK_NOFOLLOW)
- lookup_flags &= ~LOOKUP_FOLLOW;
- if (at_flags & AT_NO_AUTOMOUNT)
- lookup_flags &= ~LOOKUP_AUTOMOUNT;
- if (at_flags & AT_EMPTY_PATH)
- lookup_flags |= LOOKUP_EMPTY;
- ret = user_path_at(dfd, dir_name, lookup_flags, &mountpoint);
- if (ret < 0) {
- errorf("VFS: Mountpoint lookup failed");
- goto err_fsfd;
+ if (at_flags & AT_FSMOUNT_CONTAINER_ROOT) {
+ /* We're mounting the root of the container that was specified
+ * to sys_fsopen(). The dir_name should be specified as "/"
+ * and dfd is ignored.
+ */
+ mountpoint.mnt = NULL;
+ mountpoint.dentry = NULL;
+ mnt_flags |= MNT_CONTAINER_ROOT;
+ } else {
+ /* Find the mountpoint. A container can be specified in dfd. */
+ lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+
+ if (at_flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (at_flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (at_flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+ ret = user_path_at(dfd, dir_name, lookup_flags, &mountpoint);
+ if (ret < 0) {
+ errorf("VFS: Mountpoint lookup failed");
+ goto err_fsfd;
+ }
}
ret = security_sb_mountpoint(sc, &mountpoint);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 265e9aa2ab0b..480c6b4061e0 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -51,7 +51,8 @@ struct sb_config;
#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
-#define MNT_INTERNAL 0x4000
+#define MNT_INTERNAL 0x4000
+#define MNT_CONTAINER_ROOT 0x8000 /* Mounting a container root */
#define MNT_LOCK_ATIME 0x040000
#define MNT_LOCK_NOEXEC 0x080000
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 813afd6eee71..747af8704bbf 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -68,5 +68,7 @@
#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */
#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */
+#define AT_FSMOUNT_CONTAINER_ROOT 0x2000
+
#endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/kernel/container.c b/kernel/container.c
index 5ebbf548f01a..68276603d255 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -23,6 +23,7 @@
#include <linux/printk.h>
#include <linux/security.h>
#include <linux/proc_fs.h>
+#include <linux/mnt_namespace.h>
#include "namespaces.h"
struct container init_container = {
@@ -500,6 +501,11 @@ static struct container *create_container(const char *name, unsigned int flags)
fs->root.mnt = NULL;
fs->root.dentry = NULL;
+ if (flags & CONTAINER_NEW_EMPTY_FS_NS) {
+ put_mnt_ns(ns->mnt_ns);
+ ns->mnt_ns = NULL;
+ }
+
ret = security_container_alloc(c, flags);
if (ret < 0)
goto err_fs;
diff --git a/kernel/fork.c b/kernel/fork.c
index 68cd7367fcd5..e5111d4bcc1c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2169,7 +2169,10 @@ SYSCALL_DEFINE1(fork_into_container, int, containerfd)
if (is_container_file(f.file)) {
struct container *c = f.file->private_data;
- ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, c);
+ if (!c->ns->mnt_ns)
+ ret = -ENOENT;
+ else
+ ret = _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, c);
}
fdput(f);
return ret;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 23bdbb0c2de5..f6b994b15a4d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2975,6 +2975,8 @@ static int selinux_sb_mountpoint(struct sb_config *sc, struct path *mountpoint)
const struct cred *cred = current_cred();
int ret;
+ if (!mountpoint->mnt)
+ return 0; /* This is the root in an empty namespace */
ret = path_has_perm(cred, mountpoint, FILE__MOUNTON);
if (ret < 0)
errorf("SELinux: Mount on mountpoint not permitted");
Some filesystem system calls, such as mkdirat(), take a 'directory fd' to
specify the pathwalk origin. This takes either AT_FDCWD or a file
descriptor that refers to an open directory.
Make it possible to supply a container fd, as obtained from
container_create(), instead thereby specifying the container's root as the
origin. This performs the filesystem operation into the container's mount
namespace. For example:
int cfd = container_create("fred", CONTAINER_NEW_MNT_NS, 0);
mkdirat(cfd, "/fred", 0755);
A better way to do this might be to temporarily override current->fs and
current->nsproxy, but this requires splitting those fields so that procfs
doesn't see the override.
A sequence number and lock are available to protect the root pointer in
case container_chroot() and/or container_pivot_root() are implemented.
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 52 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 38 insertions(+), 14 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 0d35760fee00..2f0310a39e60 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2208,23 +2208,47 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
if (!f.file)
return ERR_PTR(-EBADF);
- dentry = f.file->f_path.dentry;
+ if (is_container_file(f.file)) {
+ struct container *c = f.file->private_data;
+ unsigned seq;
- if (*s) {
- if (!d_can_lookup(dentry)) {
- fdput(f);
- return ERR_PTR(-ENOTDIR);
+ if (!*s)
+ return ERR_PTR(-EINVAL);
+
+ if (flags & LOOKUP_RCU) {
+ rcu_read_lock();
+ do {
+ seq = read_seqcount_begin(&c->seq);
+ nd->path = c->root;
+ nd->inode = nd->path.dentry->d_inode;
+ nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+ } while (read_seqcount_retry(&c->seq, seq));
+ } else {
+ spin_lock(&c->lock);
+ nd->path = c->root;
+ path_get(&nd->path);
+ spin_unlock(&c->lock);
+ nd->inode = nd->path.dentry->d_inode;
}
- }
-
- nd->path = f.file->f_path;
- if (flags & LOOKUP_RCU) {
- rcu_read_lock();
- nd->inode = nd->path.dentry->d_inode;
- nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
} else {
- path_get(&nd->path);
- nd->inode = nd->path.dentry->d_inode;
+ dentry = f.file->f_path.dentry;
+
+ if (*s) {
+ if (!d_can_lookup(dentry)) {
+ fdput(f);
+ return ERR_PTR(-ENOTDIR);
+ }
+ }
+
+ nd->path = f.file->f_path;
+ if (flags & LOOKUP_RCU) {
+ rcu_read_lock();
+ nd->inode = nd->path.dentry->d_inode;
+ nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+ } else {
+ path_get(&nd->path);
+ nd->inode = nd->path.dentry->d_inode;
+ }
}
fdput(f);
return s;
---
samples/containers/test-container.c | 162 +++++++++++++++++++++++++++++++++++
1 file changed, 162 insertions(+)
create mode 100644 samples/containers/test-container.c
diff --git a/samples/containers/test-container.c b/samples/containers/test-container.c
new file mode 100644
index 000000000000..c467b447c63d
--- /dev/null
+++ b/samples/containers/test-container.c
@@ -0,0 +1,162 @@
+/* Container test.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+
+#define PR_ERRMSG_ENABLE 48
+#define PR_ERRMSG_READ 49
+
+#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
+
+static __attribute__((noreturn))
+void display_error(const char *s)
+{
+ char buf[4096];
+ int err, n, perr;
+
+ do {
+ err = errno;
+ errno = 0;
+ n = prctl(PR_ERRMSG_READ, buf, sizeof(buf));
+ perr = errno;
+ errno = err;
+ if (n > 0) {
+ fprintf(stderr, "Error: '%s': %*.*s: %m\n", s, n, n, buf);
+ } else {
+ fprintf(stderr, "%s: %m\n", s);
+ }
+ } while (perr == 0);
+
+ exit(1);
+}
+
+#define E_write(fd, s) \
+ do { \
+ if (write(fd, s, sizeof(s) - 1) == -1) \
+ display_error(s); \
+ } while (0)
+
+#define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */
+#define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */
+#define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace [priv] */
+#define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */
+#define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */
+#define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */
+#define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */
+#define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */
+#define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */
+#define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */
+#define CONTAINER__FLAG_MASK 0x000003ff
+
+#define AT_FSMOUNT_CONTAINER_ROOT 0x2000
+
+static inline int fsopen(const char *fs_name, int containerfd, int flags)
+{
+ return syscall(333, fs_name, containerfd, flags);
+}
+
+static inline int fsmount(int fsfd, int dfd, const char *path,
+ unsigned int at_flags, unsigned int flags)
+{
+ return syscall(334, fsfd, dfd, path, at_flags, flags);
+}
+
+static inline int container_create(const char *name, unsigned int mask)
+{
+ return syscall(335, name, mask, 0, 0, 0);
+}
+
+static inline int fork_into_container(int containerfd)
+{
+ return syscall(336, containerfd);
+}
+
+int main()
+{
+ pid_t pid;
+ int mfd, cfd, ws;
+
+ if (prctl(PR_ERRMSG_ENABLE, 1) < 0) {
+ perror("prctl/en");
+ exit(1);
+ }
+
+ cfd = container_create("foo-test",
+ CONTAINER_NEW_EMPTY_FS_NS |
+ CONTAINER_NEW_UTS_NS |
+ CONTAINER_NEW_IPC_NS |
+ CONTAINER_NEW_USER_NS |
+ CONTAINER_NEW_PID_NS |
+ CONTAINER_KILL_ON_CLOSE |
+ CONTAINER_FD_CLOEXEC);
+ if (cfd == -1) {
+ perror("container_create");
+ exit(1);
+ }
+
+ system("cat /proc/containers");
+
+ /* Open the filesystem that's going to form the container root. */
+ mfd = fsopen("nfs4", cfd, 0);
+ if (mfd == -1) {
+ perror("fsopen/root");
+ exit(1);
+ }
+
+ E_write(mfd, "s foo:/bar");
+ E_write(mfd, "o fsc");
+ E_write(mfd, "o sync");
+ E_write(mfd, "o intr");
+ E_write(mfd, "o vers=4.2");
+ E_write(mfd, "o addr=192.168.1.1");
+ E_write(mfd, "o clientaddr=192.168.1.1");
+ E_write(mfd, "x create");
+
+ /* Mount the container root */
+ if (fsmount(mfd, cfd, "/", AT_FSMOUNT_CONTAINER_ROOT, 0) < 0)
+ display_error("fsmount");
+ E(close(mfd));
+
+ /* Mount procfs within the container */
+ mfd = fsopen("proc", cfd, 0);
+ if (mfd == -1) {
+ perror("fsopen/proc");
+ exit(1);
+ }
+ E_write(mfd, "x create");
+ if (fsmount(mfd, cfd, "proc", 0, 0) < 0)
+ display_error("fsmount");
+ E(close(mfd));
+
+ switch ((pid = fork_into_container(cfd))) {
+ case -1:
+ perror("fork_into_container");
+ exit(1);
+ case 0:
+ close(cfd);
+ setenv("PS1", "container>", 1);
+ execl("/bin/bash", "bash", NULL);
+ perror("execl");
+ exit(1);
+ default:
+ if (waitpid(pid, &ws, 0) < 0) {
+ perror("waitpid");
+ exit(1);
+ }
+ }
+ E(close(cfd));
+ exit(0);
+}
Make it possible for fsopen() to mount into a specified container, using
the namespaces associated with that container to cover UID translation,
networking and filesystem content. This involves modifying the fsopen()
syscall to use the reserved parameter:
int mfd = fsopen(const char *fsname, int containerfd,
int open_flags);
where containerfd can be -1 to use the current process's namespaces (as
before) or a file descriptor created by container_create() to mount into
that container.
For example:
containerfd = container_create("fred", CONTAINER_NEW_FS_NS);
mfd = fsopen("nfs4", containerfd, 0);
write(mfd, "d warthog:/data", ...);
write(mfd, "o fsc", ...);
write(mfd, "o sync", ...);
write(mfd, "o intr", ...);
write(mfd, "o vers=4.2", ...);
write(mfd, "o addr=192.168.1.1", ...);
write(mfd, "o clientaddr=192.168.1.2", ...);
fsmount(mfd, container_fd, "/mnt", AT_NO_FOLLOW, 0);
Any upcalls the mount makes, say to access DNS services, will be made
inside the container.
Signed-off-by: David Howells <[email protected]>
---
fs/fsopen.c | 33 ++++++++++++++++++++++++++-------
fs/libfs.c | 3 ++-
fs/namespace.c | 23 ++++++++++++++++-------
fs/nfs/namespace.c | 2 +-
fs/nfs/nfs4namespace.c | 4 ++--
fs/proc/root.c | 13 ++++++++++---
fs/sb_config.c | 29 ++++++++++++++++++++++-------
include/linux/container.h | 1 +
include/linux/mount.h | 2 +-
include/linux/pid.h | 5 ++++-
include/linux/proc_ns.h | 3 ++-
include/linux/sb_config.h | 5 ++++-
kernel/container.c | 4 ++++
kernel/fork.c | 2 +-
kernel/pid.c | 4 ++--
15 files changed, 98 insertions(+), 35 deletions(-)
diff --git a/fs/fsopen.c b/fs/fsopen.c
index cbede77158ba..65278b7f5a45 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -13,6 +13,8 @@
#include <linux/mount.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
+#include <linux/fs.h>
+#include <linux/container.h>
#include <linux/file.h>
#include <linux/magic.h>
#include <linux/syscalls.h>
@@ -219,30 +221,44 @@ fs_initcall(init_fs_fs);
* opened, thereby indicating which namespaces will be used (notably, which
* network namespace will be used for network filesystems).
*/
-SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved,
+SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, containerfd,
unsigned int, flags)
{
+ struct container *container = NULL;
struct sb_config *sc;
struct file *file;
const char *fs_name;
int fd, ret;
- if (flags & ~O_CLOEXEC || reserved != -1)
+ if (flags & ~O_CLOEXEC)
return -EINVAL;
fs_name = strndup_user(_fs_name, PAGE_SIZE);
if (IS_ERR(fs_name))
return PTR_ERR(fs_name);
- sc = vfs_new_sb_config(fs_name);
+ if (containerfd != -1) {
+ struct fd f = fdget(containerfd);
+
+ ret = -EBADF;
+ if (!f.file)
+ goto err_fs_name;
+ ret = -EINVAL;
+ if (is_container_file(f.file)) {
+ container = get_container(f.file->private_data);
+ ret = 0;
+ }
+ fdput(f);
+ if (ret < 0)
+ goto err_fs_name;
+ }
+
+ sc = vfs_new_sb_config(fs_name, container);
kfree(fs_name);
+ put_container(container);
if (IS_ERR(sc))
return PTR_ERR(sc);
- ret = -ENOTSUPP;
- if (!sc->ops)
- goto err_sc;
-
file = create_fs_file(sc);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
@@ -264,4 +280,7 @@ SYSCALL_DEFINE3(fsopen, const char __user *, _fs_name, int, reserved,
err_sc:
put_sb_config(sc);
return ret;
+err_fs_name:
+ kfree(fs_name);
+ return ret;
}
diff --git a/fs/libfs.c b/fs/libfs.c
index e8787adf0363..d59dae7a9bd0 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -583,7 +583,8 @@ int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *c
if (unlikely(!*mount)) {
spin_unlock(&pin_fs_lock);
- sc = __vfs_new_sb_config(type, NULL, MS_KERNMOUNT, SB_CONFIG_FOR_NEW);
+ sc = __vfs_new_sb_config(type, NULL, NULL, MS_KERNMOUNT,
+ SB_CONFIG_FOR_NEW);
if (IS_ERR(sc))
return PTR_ERR(sc);
diff --git a/fs/namespace.c b/fs/namespace.c
index 7e2d5fe5728b..9ca8b9f49f80 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -783,9 +783,16 @@ static void put_mountpoint(struct mountpoint *mp)
}
}
+static inline int __check_mnt(struct mount *mnt, struct mnt_namespace *mnt_ns)
+{
+ if (!mnt_ns)
+ mnt_ns = current->nsproxy->mnt_ns;
+ return mnt->mnt_ns == mnt_ns;
+}
+
static inline int check_mnt(struct mount *mnt)
{
- return mnt->mnt_ns == current->nsproxy->mnt_ns;
+ return __check_mnt(mnt, NULL);
}
/*
@@ -2408,7 +2415,8 @@ static int do_move_mount(struct path *path, const char *old_name)
/*
* add a mount into a namespace's mount tree
*/
-static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
+static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags,
+ struct mnt_namespace *mnt_ns)
{
struct mountpoint *mp;
struct mount *parent;
@@ -2422,7 +2430,7 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
parent = real_mount(path->mnt);
err = -EINVAL;
- if (unlikely(!check_mnt(parent))) {
+ if (unlikely(!__check_mnt(parent, mnt_ns))) {
/* that's acceptable only for automounts done in private ns */
if (!(mnt_flags & MNT_SHRINKABLE))
goto unlock;
@@ -2471,7 +2479,8 @@ static int do_new_mount_sc(struct sb_config *sc, struct path *mountpoint,
goto err_mnt;
}
- ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ ret = do_add_mount(real_mount(mnt), mountpoint, mnt_flags,
+ sc->container ? sc->container->ns->mnt_ns : NULL);
if (ret < 0) {
errorf("VFS: Failed to add mount");
goto err_mnt;
@@ -2496,7 +2505,7 @@ static int do_new_mount(struct path *mountpoint, const char *fstype, int flags,
if (!fstype)
return -EINVAL;
- sc = vfs_new_sb_config(fstype);
+ sc = vfs_new_sb_config(fstype, NULL);
if (IS_ERR(sc)) {
err = PTR_ERR(sc);
goto err;
@@ -2544,7 +2553,7 @@ int finish_automount(struct vfsmount *m, struct path *path)
goto fail;
}
- err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE, NULL);
if (!err)
return 0;
fail:
@@ -3175,7 +3184,7 @@ struct vfsmount *vfs_kern_mount(struct file_system_type *type,
if (!type)
return ERR_PTR(-EINVAL);
- sc = __vfs_new_sb_config(type, NULL, flags, SB_CONFIG_FOR_NEW);
+ sc = __vfs_new_sb_config(type, NULL, NULL, flags, SB_CONFIG_FOR_NEW);
if (IS_ERR(sc))
return ERR_CAST(sc);
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index e95e669e4db8..2dcb0c3b4cbb 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -239,7 +239,7 @@ struct vfsmount *nfs_do_submount(struct dentry *dentry, struct nfs_fh *fh,
/* Open a new mount context, transferring parameters from the parent
* superblock, including the network namespace.
*/
- sc = __vfs_new_sb_config(&nfs_fs_type, dentry->d_sb, 0,
+ sc = __vfs_new_sb_config(&nfs_fs_type, dentry->d_sb, NULL, 0,
SB_CONFIG_FOR_SUBMOUNT);
if (IS_ERR(sc))
return ERR_CAST(sc);
diff --git a/fs/nfs/nfs4namespace.c b/fs/nfs/nfs4namespace.c
index 60b711aa0618..5e49684faf79 100644
--- a/fs/nfs/nfs4namespace.c
+++ b/fs/nfs/nfs4namespace.c
@@ -346,8 +346,8 @@ static struct vfsmount *nfs_follow_referral(struct dentry *dentry,
if (locations == NULL || locations->nlocations <= 0)
goto out;
-
- sc = __vfs_new_sb_config(&nfs4_fs_type, dentry->d_sb, 0,
+
+ sc = __vfs_new_sb_config(&nfs4_fs_type, dentry->d_sb, NULL, 0,
SB_CONFIG_FOR_SUBMOUNT);
if (IS_ERR(sc)) {
mnt = ERR_CAST(sc);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 9878b62e874c..70e52b060873 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -17,6 +17,7 @@
#include <linux/sched/stat.h>
#include <linux/module.h>
#include <linux/bitops.h>
+#include <linux/container.h>
#include <linux/user_namespace.h>
#include <linux/sb_config.h>
#include <linux/pid_namespace.h>
@@ -171,8 +172,14 @@ static const struct sb_config_operations proc_sb_config_ops = {
static int proc_init_sb_config(struct sb_config *sc, struct super_block *src_sb)
{
struct proc_sb_config *cfg = container_of(sc, struct proc_sb_config, sc);
+ struct pid_namespace *pid_ns;
- cfg->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ if (cfg->sc.container)
+ pid_ns = cfg->sc.container->pid_ns;
+ else
+ pid_ns = task_active_pid_ns(current);
+
+ cfg->pid_ns = get_pid_ns(pid_ns);
cfg->sc.ops = &proc_sb_config_ops;
return 0;
}
@@ -292,14 +299,14 @@ struct proc_dir_entry proc_root = {
.name = "/proc",
};
-int pid_ns_prepare_proc(struct pid_namespace *ns)
+int pid_ns_prepare_proc(struct pid_namespace *ns, struct container *container)
{
struct proc_sb_config *cfg;
struct sb_config *sc;
struct vfsmount *mnt;
int ret;
- sc = __vfs_new_sb_config(&proc_fs_type, NULL, 0, SB_CONFIG_FOR_NEW);
+ sc = __vfs_new_sb_config(&proc_fs_type, NULL, container, 0, SB_CONFIG_FOR_NEW);
if (IS_ERR(sc))
return PTR_ERR(sc);
diff --git a/fs/sb_config.c b/fs/sb_config.c
index 4d9bfb982d41..c1ea2a98bd8d 100644
--- a/fs/sb_config.c
+++ b/fs/sb_config.c
@@ -19,6 +19,7 @@
#include <linux/magic.h>
#include <linux/security.h>
#include <linux/parser.h>
+#include <linux/container.h>
#include <linux/mnt_namespace.h>
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
@@ -108,7 +109,7 @@ static int vfs_parse_ms_mount_option(struct sb_config *sc, char *data)
/**
* vfs_parse_mount_option - Add a single mount option to a superblock config
- * @mc: The superblock configuration to modify
+ * @sc: The superblock configuration to modify
* @p: The option to apply.
*
* A single mount option in string form is applied to the superblock
@@ -148,7 +149,7 @@ EXPORT_SYMBOL(vfs_parse_mount_option);
/**
* generic_monolithic_mount_data - Parse key[=val][,key[=val]]* mount data
- * @mc: The superblock configuration to fill in.
+ * @sc: The superblock configuration to fill in.
* @data: The data to parse
*
* Parse a blob of data that's in key[=val][,key[=val]]* form. This can be
@@ -181,6 +182,7 @@ EXPORT_SYMBOL(generic_monolithic_mount_data);
* __vfs_new_sb_config - Create a superblock config.
* @fs_type: The filesystem type.
* @src_sb: A superblock from which this one derives (or NULL)
+ * @c: The container that will be opened in (or NULL)
* @ms_flags: Superblock flags and op flags (such as MS_REMOUNT)
* @purpose: The purpose that this configuration shall be used for.
*
@@ -191,6 +193,7 @@ EXPORT_SYMBOL(generic_monolithic_mount_data);
*/
struct sb_config *__vfs_new_sb_config(struct file_system_type *fs_type,
struct super_block *src_sb,
+ struct container *c,
unsigned int ms_flags,
enum sb_config_purpose purpose)
{
@@ -210,10 +213,17 @@ struct sb_config *__vfs_new_sb_config(struct file_system_type *fs_type,
sc->purpose = purpose;
sc->ms_flags = ms_flags;
sc->fs_type = get_filesystem(fs_type);
- sc->net_ns = get_net(current->nsproxy->net_ns);
- sc->user_ns = get_user_ns(current_user_ns());
sc->cred = get_current_cred();
+ if (!c) {
+ sc->net_ns = get_net(current->nsproxy->net_ns);
+ sc->user_ns = get_user_ns(current_user_ns());
+ } else {
+ sc->container = get_container(c);
+ sc->net_ns = get_net(c->ns->net_ns);
+ sc->user_ns = get_user_ns(c->cred->user_ns);
+ }
+
/* TODO: Make all filesystems support this unconditionally */
if (sc->fs_type->init_sb_config) {
ret = sc->fs_type->init_sb_config(sc, src_sb);
@@ -241,6 +251,7 @@ EXPORT_SYMBOL(__vfs_new_sb_config);
/**
* vfs_new_sb_config - Create a superblock config for a new mount.
* @fs_name: The name of the filesystem
+ * @container: The container to create in (or NULL)
*
* Open a filesystem and create a superblock config context for a new mount
* that will hold the mount options, device name, security details, etc.. Note
@@ -248,7 +259,8 @@ EXPORT_SYMBOL(__vfs_new_sb_config);
* determine whether the filesystem actually supports the superblock context
* itself.
*/
-struct sb_config *vfs_new_sb_config(const char *fs_name)
+struct sb_config *vfs_new_sb_config(const char *fs_name,
+ struct container *c)
{
struct file_system_type *fs_type;
struct sb_config *sc;
@@ -257,7 +269,7 @@ struct sb_config *vfs_new_sb_config(const char *fs_name)
if (!fs_type)
return ERR_PTR(-ENODEV);
- sc = __vfs_new_sb_config(fs_type, NULL, 0, SB_CONFIG_FOR_NEW);
+ sc = __vfs_new_sb_config(fs_type, NULL, c, 0, SB_CONFIG_FOR_NEW);
put_filesystem(fs_type);
return sc;
}
@@ -275,7 +287,7 @@ struct sb_config *vfs_sb_reconfig(struct vfsmount *mnt,
unsigned int ms_flags)
{
return __vfs_new_sb_config(mnt->mnt_sb->s_type, mnt->mnt_sb,
- ms_flags, SB_CONFIG_FOR_REMOUNT);
+ NULL, ms_flags, SB_CONFIG_FOR_REMOUNT);
}
/**
@@ -302,6 +314,8 @@ struct sb_config *vfs_dup_sb_config(struct sb_config *src_sc)
sc->device = NULL;
sc->security = NULL;
get_filesystem(sc->fs_type);
+ if (sc->container)
+ get_container(sc->container);
get_net(sc->net_ns);
get_user_ns(sc->user_ns);
get_cred(sc->cred);
@@ -347,6 +361,7 @@ void put_sb_config(struct sb_config *sc)
if (sc->cred)
put_cred(sc->cred);
kfree(sc->subtype);
+ put_container(sc->container);
put_filesystem(sc->fs_type);
kfree(sc->device);
kfree(sc);
diff --git a/include/linux/container.h b/include/linux/container.h
index 084ea9982fe6..073674fab160 100644
--- a/include/linux/container.h
+++ b/include/linux/container.h
@@ -36,6 +36,7 @@ struct container {
struct path root; /* The root of the container's fs namespace */
struct task_struct *init; /* The 'init' task for this container */
struct container *parent; /* Parent of this container. */
+ struct pid_namespace *pid_ns; /* The process ID namespace for this container */
void *security; /* LSM data */
struct list_head members; /* Member processes, guarded with ->lock */
struct list_head child_link; /* Link in parent->children */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index a5dca6abc4d5..265e9aa2ab0b 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -70,7 +70,7 @@ struct vfsmount {
int mnt_flags;
};
-struct file; /* forward dec */
+ struct file; /* forward dec */
struct path;
extern int mnt_want_write(struct vfsmount *mnt);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 4d179316e431..ac429dea2f84 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -79,6 +79,8 @@ static inline struct pid *get_pid(struct pid *pid)
return pid;
}
+struct container;
+
extern void put_pid(struct pid *pid);
extern struct task_struct *pid_task(struct pid *pid, enum pid_type);
extern struct task_struct *get_pid_task(struct pid *pid, enum pid_type);
@@ -117,7 +119,8 @@ extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns,
+ struct container *container);
extern void free_pid(struct pid *pid);
extern void disable_pid_allocation(struct pid_namespace *ns);
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 58ab28d81fc2..52f0b2db5dda 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -46,7 +46,8 @@ enum {
#ifdef CONFIG_PROC_FS
-extern int pid_ns_prepare_proc(struct pid_namespace *ns);
+extern int pid_ns_prepare_proc(struct pid_namespace *ns,
+ struct container *container);
extern void pid_ns_release_proc(struct pid_namespace *ns);
extern int proc_alloc_inum(unsigned int *pino);
extern void proc_free_inum(unsigned int inum);
diff --git a/include/linux/sb_config.h b/include/linux/sb_config.h
index 144258d82fa1..8bc7ac70b11a 100644
--- a/include/linux/sb_config.h
+++ b/include/linux/sb_config.h
@@ -46,6 +46,7 @@ enum sb_config_purpose {
struct sb_config {
const struct sb_config_operations *ops;
struct file_system_type *fs_type;
+ struct container *container; /* The container in which the mount will exist */
struct dentry *root; /* The root and superblock */
struct user_namespace *user_ns; /* The user namespace for this mount */
struct net *net_ns; /* The network namespace for this mount */
@@ -69,9 +70,11 @@ struct sb_config_operations {
int (*get_tree)(struct sb_config *sc);
};
-extern struct sb_config *vfs_new_sb_config(const char *fs_name);
+extern struct sb_config *vfs_new_sb_config(const char *fs_name,
+ struct container *c);
extern struct sb_config *__vfs_new_sb_config(struct file_system_type *fs_type,
struct super_block *src_sb,
+ struct container *c,
unsigned int ms_flags,
enum sb_config_purpose purpose);
extern struct sb_config *vfs_sb_reconfig(struct vfsmount *mnt,
diff --git a/kernel/container.c b/kernel/container.c
index d5849c07a76b..5ebbf548f01a 100644
--- a/kernel/container.c
+++ b/kernel/container.c
@@ -31,6 +31,7 @@ struct container init_container = {
.cred = &init_cred,
.ns = &init_nsproxy,
.init = &init_task,
+ .pid_ns = &init_pid_ns,
.members.next = &init_task.container_link,
.members.prev = &init_task.container_link,
.children = LIST_HEAD_INIT(init_container.children),
@@ -52,6 +53,8 @@ void put_container(struct container *c)
while (c && refcount_dec_and_test(&c->usage)) {
BUG_ON(!list_empty(&c->members));
+ if (c->pid_ns)
+ put_pid_ns(c->pid_ns);
if (c->ns)
put_nsproxy(c->ns);
path_put(&c->root);
@@ -491,6 +494,7 @@ static struct container *create_container(const char *name, unsigned int flags)
}
c->ns = ns;
+ c->pid_ns = get_pid_ns(c->ns->pid_ns_for_children);
c->root = fs->root;
c->seq = fs->seq;
fs->root.mnt = NULL;
diff --git a/kernel/fork.c b/kernel/fork.c
index d185c13820d7..68cd7367fcd5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1764,7 +1764,7 @@ static __latent_entropy struct task_struct *copy_process(
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+ pid = alloc_pid(p->nsproxy->pid_ns_for_children, container);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
diff --git a/kernel/pid.c b/kernel/pid.c
index fd1cde1e4576..adc65cdc2613 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -293,7 +293,7 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, struct container *container)
{
struct pid *pid;
enum pid_type type;
@@ -321,7 +321,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}
if (unlikely(is_child_reaper(pid))) {
- if (pid_ns_prepare_proc(ns)) {
+ if (pid_ns_prepare_proc(ns, container)) {
disable_pid_allocation(ns);
goto out_free;
}
Provide a system call to open a socket inside of a container, using that
container's network namespace. This allows netlink to be used to manage
the container.
fd = container_socket(int container_fd,
int domain, int type, int protocol);
Signed-off-by: David Howells <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 ++
kernel/sys_ni.c | 1 +
net/socket.c | 37 +++++++++++++++++++++++++++++---
5 files changed, 39 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 0d5a9875ead2..04a2f6b4799b 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -395,3 +395,4 @@
386 i386 fsmount sys_fsmount
387 i386 container_create sys_container_create
388 i386 fork_into_container sys_fork_into_container
+389 i386 container_socket sys_container_socket
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index e4005cc579b6..825c05462245 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
334 common fsmount sys_fsmount
335 common container_create sys_container_create
336 common fork_into_container sys_fork_into_container
+337 common container_socket sys_container_socket
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7ca6c287ce84..af4c0bbd2f10 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -912,5 +912,7 @@ asmlinkage long sys_container_create(const char __user *name, unsigned int flags
unsigned long spare3, unsigned long spare4,
unsigned long spare5);
asmlinkage long sys_fork_into_container(int containerfd);
+asmlinkage long sys_container_socket(int containerfd,
+ int domain, int type, int protocol);
#endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index b685ffe3591f..1f2fe4720df5 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -266,3 +266,4 @@ cond_syscall(sys_fsmount);
/* Containers */
cond_syscall(sys_container_create);
cond_syscall(sys_fork_into_container);
+cond_syscall(sys_container_socket);
diff --git a/net/socket.c b/net/socket.c
index c2564eb25c6b..69f0f72995fc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -89,6 +89,7 @@
#include <linux/magic.h>
#include <linux/slab.h>
#include <linux/xattr.h>
+#include <linux/container.h>
#include <linux/uaccess.h>
#include <asm/unistd.h>
@@ -1255,9 +1256,9 @@ int sock_create_kern(struct net *net, int family, int type, int protocol, struct
}
EXPORT_SYMBOL(sock_create_kern);
-SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
+static long __sys_socket(struct net *net, int family, int type, int protocol)
{
- int retval;
+ long retval;
struct socket *sock;
int flags;
@@ -1275,7 +1276,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
- retval = sock_create(family, type, protocol, &sock);
+ retval = __sock_create(net, family, type, protocol, &sock, 0);
if (retval < 0)
goto out;
@@ -1292,6 +1293,36 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
return retval;
}
+SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
+{
+ return __sys_socket(current->nsproxy->net_ns, family, type, protocol);
+}
+
+/*
+ * Create a socket inside a container.
+ */
+SYSCALL_DEFINE4(container_socket,
+ int, containerfd, int, family, int, type, int, protocol)
+{
+#ifdef CONFIG_CONTAINERS
+ struct fd f = fdget(containerfd);
+ long ret;
+
+ if (!f.file)
+ return -EBADF;
+ ret = -EINVAL;
+ if (is_container_file(f.file)) {
+ struct container *c = f.file->private_data;
+
+ ret = __sys_socket(c->ns->net_ns, family, type, protocol);
+ }
+ fdput(f);
+ return ret;
+#else
+ return -ENOSYS;
+#endif
+}
+
/*
* Create a pair of connected sockets.
*/
A container is then a kernel object that contains the following things:
(1) Namespaces.
(2) A root directory.
(3) A set of processes, including one designated as the 'init' process.
A container is created and attached to a file descriptor by:
int cfd = container_create(const char *name, unsigned int flags);
this inherits all the namespaces of the parent container unless otherwise
the mask calls for new namespaces.
CONTAINER_NEW_FS_NS
CONTAINER_NEW_EMPTY_FS_NS
CONTAINER_NEW_CGROUP_NS [root only]
CONTAINER_NEW_UTS_NS
CONTAINER_NEW_IPC_NS
CONTAINER_NEW_USER_NS
CONTAINER_NEW_PID_NS
CONTAINER_NEW_NET_NS
Other flags include:
CONTAINER_KILL_ON_CLOSE
CONTAINER_CLOSE_ON_EXEC
Note that I've added a pointer to the current container to task_struct.
This doesn't make the nsproxy pointer redundant as you can still make new
namespaces with clone().
I've also added a list_head to task_struct to form a list in the container
of its member processes. This is convenient, but redundant since the code
could iterate over all the tasks looking for ones that have a matching
task->container.
==================
FUTURE DEVELOPMENT
==================
(1) Setting up the container.
It should then be possible for the supervising process to modify the
new container by:
container_mount(int cfd,
const char *source,
const char *target, /* NULL -> root */
const char *filesystemtype,
unsigned long mountflags,
const void *data);
container_chroot(int cfd, const char *path);
container_bind_mount_across(int cfd,
const char *source,
const char *target); /* NULL -> root */
mkdirat(int cfd, const char *path, mode_t mode);
mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
int fd = openat(int cfd, const char *path,
unsigned int flags, mode_t mode);
int fd = container_socket(int cfd, int domain, int type,
int protocol);
Opening a netlink socket inside the container should allow management
of the container's network namespace.
(2) Starting the container.
Once all modifications are complete, the container's 'init' process
can be started by:
fork_into_container(int cfd);
This precludes further external modification of the mount tree within
the container. Before this point, the container is simply destroyed
if the container fd is closed.
(3) Waiting for the container to complete.
The container fd can then be polled to wait for init process therein
to complete and the exit code collected by:
container_wait(int container_fd, int *_wstatus, unsigned int wait,
struct rusage *rusage);
The container and everything in it can be terminated or killed off:
container_kill(int container_fd, int initonly, int signal);
If 'init' dies, all other processes in the container are preemptively
SIGKILL'd by the kernel.
By default, if the container is active and its fd is closed, the
container is left running and wil be cleaned up when its 'init' exits.
The default can be changed with the CONTAINER_KILL_ON_CLOSE flag.
(4) Supervising the container.
Given that we have an fd attached to the container, we could make it
such that the supervising process could monitor and override EPERM
returns for mount and other privileged operations within the
container.
(5) Device restriction.
Containers could come with a list of device IDs that the container is
allowed to open. Perhaps a list major numbers, each with a bitmap of
permitted minor numbers.
(6) Per-container keyring.
Each container could be given a per-container keyring for the holding
of integrity keys and filesystem keys. This list would be only
modifiable by the container's 'root' user and the supervisor process:
container_add_key(const char *type, const char *description,
const void *payload, size_t plen,
int container_fd);
The keys on the keyring would, however, be accessible/usable by all
processes within the keyring.
===============
EXAMPLE PROGRAM
===============
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */
#define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */
#define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace [priv] */
#define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */
#define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */
#define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */
#define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */
#define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */
#define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */
#define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */
#define CONTAINER__FLAG_MASK 0x000003ff
static inline int container_create(const char *name, unsigned int mask)
{
return syscall(333, name, mask, 0, 0, 0);
}
static inline int fork_into_container(int containerfd)
{
return syscall(334, containerfd);
}
int main()
{
pid_t pid;
int fd, ws;
fd = container_create("foo-test",
CONTAINER__FLAG_MASK & ~(
CONTAINER_NEW_EMPTY_FS_NS |
CONTAINER_NEW_CGROUP_NS));
if (fd == -1) {
perror("container_create");
exit(1);
}
system("cat /proc/containers");
switch ((pid = fork_into_container(fd))) {
case -1:
perror("fork_into_container");
exit(1);
case 0:
close(fd);
setenv("PS1", "container>", 1);
execl("/bin/bash", "bash", NULL);
perror("execl");
exit(1);
default:
if (waitpid(pid, &ws, 0) < 0) {
perror("waitpid");
exit(1);
}
}
close(fd);
exit(0);
}
Signed-off-by: David Howells <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 5
include/linux/container.h | 85 ++++++
include/linux/init_task.h | 4
include/linux/lsm_hooks.h | 21 +
include/linux/sched.h | 3
include/linux/security.h | 15 +
include/linux/syscalls.h | 3
include/uapi/linux/container.h | 28 ++
include/uapi/linux/magic.h | 1
init/Kconfig | 7
kernel/Makefile | 2
kernel/container.c | 462 ++++++++++++++++++++++++++++++++
kernel/exit.c | 1
kernel/fork.c | 7
kernel/namespaces.h | 15 +
kernel/nsproxy.c | 23 +-
kernel/sys_ni.c | 4
security/security.c | 13 +
20 files changed, 688 insertions(+), 13 deletions(-)
create mode 100644 include/linux/container.h
create mode 100644 include/uapi/linux/container.h
create mode 100644 kernel/container.c
create mode 100644 kernel/namespaces.h
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index abe6ea95e0e6..9ccd0f52f874 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -393,3 +393,4 @@
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
385 i386 fsopen sys_fsopen
386 i386 fsmount sys_fsmount
+387 i386 container_create sys_container_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0977c5079831..dab92591511e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -341,6 +341,7 @@
332 common statx sys_statx
333 common fsopen sys_fsopen
334 common fsmount sys_fsmount
+335 common container_create sys_container_create
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index 4e9ad16db79c..7e2d5fe5728b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -28,6 +28,7 @@
#include <linux/file.h>
#include <linux/sched/task.h>
#include <linux/sb_config.h>
+#include <linux/container.h>
#include "pnode.h"
#include "internal.h"
@@ -3510,6 +3511,10 @@ static void __init init_mount_tree(void)
set_fs_pwd(current->fs, &root);
set_fs_root(current->fs, &root);
+#ifdef CONFIG_CONTAINERS
+ path_get(&root);
+ init_container.root = root;
+#endif
}
void __init mnt_init(void)
diff --git a/include/linux/container.h b/include/linux/container.h
new file mode 100644
index 000000000000..084ea9982fe6
--- /dev/null
+++ b/include/linux/container.h
@@ -0,0 +1,85 @@
+/* Container objects
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CONTAINER_H
+#define _LINUX_CONTAINER_H
+
+#include <uapi/linux/container.h>
+#include <linux/refcount.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/path.h>
+#include <linux/seqlock.h>
+
+struct fs_struct;
+struct nsproxy;
+struct task_struct;
+
+/*
+ * The container object.
+ */
+struct container {
+ char name[24];
+ refcount_t usage;
+ int exit_code; /* The exit code of 'init' */
+ const struct cred *cred; /* Creds for this container, including userns */
+ struct nsproxy *ns; /* This container's namespaces */
+ struct path root; /* The root of the container's fs namespace */
+ struct task_struct *init; /* The 'init' task for this container */
+ struct container *parent; /* Parent of this container. */
+ void *security; /* LSM data */
+ struct list_head members; /* Member processes, guarded with ->lock */
+ struct list_head child_link; /* Link in parent->children */
+ struct list_head children; /* Child containers */
+ wait_queue_head_t waitq; /* Someone waiting for init to exit waits here */
+ unsigned long flags;
+#define CONTAINER_FLAG_INIT_STARTED 0 /* Init is started - certain ops now prohibited */
+#define CONTAINER_FLAG_DEAD 1 /* Init has died */
+#define CONTAINER_FLAG_KILL_ON_CLOSE 2 /* Kill init if container handle closed */
+ spinlock_t lock;
+ seqcount_t seq; /* Track changes in ->root */
+};
+
+extern struct container init_container;
+
+#ifdef CONFIG_CONTAINERS
+extern const struct file_operations containerfs_fops;
+
+extern int copy_container(unsigned long flags, struct task_struct *tsk,
+ struct container *container);
+extern void exit_container(struct task_struct *tsk);
+extern void put_container(struct container *c);
+
+static inline struct container *get_container(struct container *c)
+{
+ refcount_inc(&c->usage);
+ return c;
+}
+
+static inline bool is_container_file(struct file *file)
+{
+ return file->f_op == &containerfs_fops;
+}
+
+#else
+
+static inline int copy_container(unsigned long flags, struct task_struct *tsk,
+ struct container *container)
+{ return 0; }
+static inline void exit_container(struct task_struct *tsk) { }
+static inline void put_container(struct container *c) {}
+static inline struct container *get_container(struct container *c) { return NULL; }
+static inline bool is_container_file(struct file *file) { return false; }
+
+#endif /* CONFIG_CONTAINERS */
+
+#endif /* _LINUX_CONTAINER_H */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e049526bc188..488385ad79db 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -9,6 +9,7 @@
#include <linux/ipc.h>
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
+#include <linux/container.h>
#include <linux/securebits.h>
#include <linux/seqlock.h>
#include <linux/rbtree.h>
@@ -273,6 +274,9 @@ extern struct cred init_cred;
.signal = &init_signals, \
.sighand = &init_sighand, \
.nsproxy = &init_nsproxy, \
+ .container = &init_container, \
+ .container_link.next = &init_container.members, \
+ .container_link.prev = &init_container.members, \
.pending = { \
.list = LIST_HEAD_INIT(tsk.pending.list), \
.signal = {{0}}}, \
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 7064c0c15386..7b0d484a6a25 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1368,6 +1368,17 @@
* @inode we wish to get the security context of.
* @ctx is a pointer in which to place the allocated security context.
* @ctxlen points to the place to put the length of @ctx.
+ *
+ * Security hooks for containers:
+ *
+ * @container_alloc:
+ * Permit creation of a new container and assign security data.
+ * @container: The new container.
+ *
+ * @container_free:
+ * Free security data attached to a container.
+ * @container: The container.
+ *
* This is the main security structure.
*/
@@ -1699,6 +1710,12 @@ union security_list_options {
struct audit_context *actx);
void (*audit_rule_free)(void *lsmrule);
#endif /* CONFIG_AUDIT */
+
+ /* Container management security hooks */
+#ifdef CONFIG_CONTAINERS
+ int (*container_alloc)(struct container *container, unsigned int flags);
+ void (*container_free)(struct container *container);
+#endif
};
struct security_hook_heads {
@@ -1919,6 +1936,10 @@ struct security_hook_heads {
struct list_head audit_rule_match;
struct list_head audit_rule_free;
#endif /* CONFIG_AUDIT */
+#ifdef CONFIG_CONTAINERS
+ struct list_head container_alloc;
+ struct list_head container_free;
+#endif /* CONFIG_CONTAINERS */
};
/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index eba196521562..d9b92a98f99f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -33,6 +33,7 @@ struct backing_dev_info;
struct bio_list;
struct blk_plug;
struct cfs_rq;
+struct container;
struct fs_struct;
struct futex_pi_state;
struct io_context;
@@ -741,6 +742,8 @@ struct task_struct {
/* Namespaces: */
struct nsproxy *nsproxy;
+ struct container *container;
+ struct list_head container_link;
/* Signal handlers: */
struct signal_struct *signal;
diff --git a/include/linux/security.h b/include/linux/security.h
index 8c06e158c195..01bdf7637ec6 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -68,6 +68,7 @@ struct ctl_table;
struct audit_krule;
struct user_namespace;
struct timezone;
+struct container;
/* These functions are in security/commoncap.c */
extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
@@ -1672,6 +1673,20 @@ static inline void security_audit_rule_free(void *lsmrule)
#endif /* CONFIG_SECURITY */
#endif /* CONFIG_AUDIT */
+#ifdef CONFIG_CONTAINERS
+#ifdef CONFIG_SECURITY
+int security_container_alloc(struct container *container, unsigned int flags);
+void security_container_free(struct container *container);
+#else
+static inline int security_container_alloc(struct container *container,
+ unsigned int flags)
+{
+ return 0;
+}
+static inline void security_container_free(struct container *container) {}
+#endif
+#endif /* CONFIG_CONTAINERS */
+
#ifdef CONFIG_SECURITYFS
extern struct dentry *securityfs_create_file(const char *name, umode_t mode,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 07e4f775f24d..5a0324dd024c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -908,5 +908,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags);
asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
unsigned int flags);
+asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
+ unsigned long spare3, unsigned long spare4,
+ unsigned long spare5);
#endif
diff --git a/include/uapi/linux/container.h b/include/uapi/linux/container.h
new file mode 100644
index 000000000000..43748099b28d
--- /dev/null
+++ b/include/uapi/linux/container.h
@@ -0,0 +1,28 @@
+/* Container UAPI
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_CONTAINER_H
+#define _UAPI_LINUX_CONTAINER_H
+
+
+#define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */
+#define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */
+#define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace */
+#define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */
+#define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */
+#define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */
+#define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */
+#define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */
+#define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */
+#define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */
+#define CONTAINER__FLAG_MASK 0x000003ff
+
+#endif /* _UAPI_LINUX_CONTAINER_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 88ae83492f7c..758705412b44 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -85,5 +85,6 @@
#define BALLOON_KVM_MAGIC 0x13661366
#define ZSMALLOC_MAGIC 0x58295829
#define FS_FS_MAGIC 0x66736673
+#define CONTAINERFS_MAGIC 0x636f6e74
#endif /* __LINUX_MAGIC_H__ */
diff --git a/init/Kconfig b/init/Kconfig
index 1d3475fc9496..3a0ee88df6c8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1288,6 +1288,13 @@ config NET_NS
Allow user space to create what appear to be multiple instances
of the network stack.
+config CONTAINERS
+ bool "Container support"
+ default y
+ help
+ Allow userspace to create and manipulate containers as objects that
+ have namespaces and hold a set of processes.
+
endif # NAMESPACES
config SCHED_AUTOGROUP
diff --git a/kernel/Makefile b/kernel/Makefile
index 72aa080f91f0..117479b05fb1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o task_work.o \
extable.o params.o \
- kthread.o sys_ni.o nsproxy.o \
+ kthread.o sys_ni.o nsproxy.o container.o \
notifier.o ksysfs.o cred.o reboot.o \
async.o range.o smpboot.o ucount.o
diff --git a/kernel/container.c b/kernel/container.c
new file mode 100644
index 000000000000..eef1566835eb
--- /dev/null
+++ b/kernel/container.c
@@ -0,0 +1,462 @@
+/* Implement container objects.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <linux/init_task.h>
+#include <linux/fs.h>
+#include <linux/fs_struct.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/container.h>
+#include <linux/magic.h>
+#include <linux/syscalls.h>
+#include <linux/printk.h>
+#include <linux/security.h>
+#include "namespaces.h"
+
+struct container init_container = {
+ .name = ".init",
+ .usage = REFCOUNT_INIT(2),
+ .cred = &init_cred,
+ .ns = &init_nsproxy,
+ .init = &init_task,
+ .members.next = &init_task.container_link,
+ .members.prev = &init_task.container_link,
+ .children = LIST_HEAD_INIT(init_container.children),
+ .flags = (1 << CONTAINER_FLAG_INIT_STARTED),
+ .lock = __SPIN_LOCK_UNLOCKED(init_container.lock),
+ .seq = SEQCNT_ZERO(init_fs.seq),
+};
+
+#ifdef CONFIG_CONTAINERS
+
+static struct vfsmount *containerfs_mnt __read_mostly;
+
+/*
+ * Drop a ref on a container and clear it if no longer in use.
+ */
+void put_container(struct container *c)
+{
+ struct container *parent;
+
+ while (c && refcount_dec_and_test(&c->usage)) {
+ BUG_ON(!list_empty(&c->members));
+ if (c->ns)
+ put_nsproxy(c->ns);
+ path_put(&c->root);
+
+ parent = c->parent;
+ if (parent) {
+ spin_lock(&parent->lock);
+ list_del(&c->child_link);
+ spin_unlock(&parent->lock);
+ }
+
+ if (c->cred)
+ put_cred(c->cred);
+ security_container_free(c);
+ kfree(c);
+ c = parent;
+ }
+}
+
+/*
+ * Allow the user to poll for the container dying.
+ */
+static unsigned int containerfs_poll(struct file *file, poll_table *wait)
+{
+ struct container *container = file->private_data;
+ unsigned int mask = 0;
+
+ poll_wait(file, &container->waitq, wait);
+
+ if (test_bit(CONTAINER_FLAG_DEAD, &container->flags))
+ mask |= POLLHUP;
+
+ return mask;
+}
+
+static int containerfs_release(struct inode *inode, struct file *file)
+{
+ struct container *container = file->private_data;
+
+ put_container(container);
+ return 0;
+}
+
+const struct file_operations containerfs_fops = {
+ .poll = containerfs_poll,
+ .release = containerfs_release,
+};
+
+/*
+ * Indicate the name we want to display the container file as.
+ */
+static char *containerfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+ return dynamic_dname(dentry, buffer, buflen, "container:[%lu]",
+ d_inode(dentry)->i_ino);
+}
+
+static const struct dentry_operations containerfs_dentry_operations = {
+ .d_dname = containerfs_dname,
+};
+
+/*
+ * Allocate a container.
+ */
+static struct container *alloc_container(const char __user *name)
+{
+ struct container *c;
+ long len;
+ int ret;
+
+ c = kzalloc(sizeof(struct container), GFP_KERNEL);
+ if (!c)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&c->members);
+ INIT_LIST_HEAD(&c->children);
+ init_waitqueue_head(&c->waitq);
+ spin_lock_init(&c->lock);
+ refcount_set(&c->usage, 1);
+
+ ret = -EFAULT;
+ len = strncpy_from_user(c->name, name, sizeof(c->name));
+ if (len < 0)
+ goto err;
+ ret = -ENAMETOOLONG;
+ if (len >= sizeof(c->name))
+ goto err;
+ ret = -EINVAL;
+ if (strchr(c->name, '/'))
+ goto err;
+
+ c->name[len] = 0;
+ return c;
+
+err:
+ kfree(c);
+ return ERR_PTR(ret);
+}
+
+/*
+ * Create a supervisory file for a new container
+ */
+static struct file *create_container_file(struct container *c)
+{
+ struct inode *inode;
+ struct file *f;
+ struct path path;
+ int ret;
+
+ inode = alloc_anon_inode(containerfs_mnt->mnt_sb);
+ if (!inode)
+ return ERR_PTR(-ENFILE);
+ inode->i_fop = &containerfs_fops;
+
+ ret = -ENOMEM;
+ path.dentry = d_alloc_pseudo(containerfs_mnt->mnt_sb, &empty_name);
+ if (!path.dentry)
+ goto err_inode;
+ path.mnt = mntget(containerfs_mnt);
+
+ d_instantiate(path.dentry, inode);
+
+ f = alloc_file(&path, 0, &containerfs_fops);
+ if (IS_ERR(f)) {
+ ret = PTR_ERR(f);
+ goto err_file;
+ }
+
+ f->private_data = c;
+ return f;
+
+err_file:
+ path_put(&path);
+ return ERR_PTR(ret);
+
+err_inode:
+ iput(inode);
+ return ERR_PTR(ret);
+}
+
+static const struct super_operations containerfs_ops = {
+ .drop_inode = generic_delete_inode,
+ .destroy_inode = free_inode_nonrcu,
+ .statfs = simple_statfs,
+};
+
+/*
+ * containerfs should _never_ be mounted by userland - too much of security
+ * hassle, no real gain from having the whole whorehouse mounted. So we don't
+ * need any operations on the root directory. However, we need a non-trivial
+ * d_name - container: will go nicely and kill the special-casing in procfs.
+ */
+static struct dentry *containerfs_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name,
+ void *data)
+{
+ return mount_pseudo(fs_type, "container:", &containerfs_ops,
+ &containerfs_dentry_operations, CONTAINERFS_MAGIC);
+}
+
+static struct file_system_type container_fs_type = {
+ .name = "containerfs",
+ .mount = containerfs_mount,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init init_container_fs(void)
+{
+ int ret;
+
+ ret = register_filesystem(&container_fs_type);
+ if (ret < 0)
+ panic("Cannot register containerfs\n");
+
+ containerfs_mnt = kern_mount(&container_fs_type);
+ if (IS_ERR(containerfs_mnt))
+ panic("Cannot mount containerfs: %ld\n",
+ PTR_ERR(containerfs_mnt));
+
+ return 0;
+}
+
+fs_initcall(init_container_fs);
+
+/*
+ * Handle fork/clone.
+ *
+ * A process inherits its parent's container. The first process into the
+ * container is its 'init' process and the life of everything else in there is
+ * dependent upon that.
+ */
+int copy_container(unsigned long flags, struct task_struct *tsk,
+ struct container *container)
+{
+ struct container *c = container ?: tsk->container;
+ int ret = -ECANCELED;
+
+ spin_lock(&c->lock);
+
+ if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) {
+ list_add_tail(&tsk->container_link, &c->members);
+ get_container(c);
+ tsk->container = c;
+ if (!c->init) {
+ set_bit(CONTAINER_FLAG_INIT_STARTED, &c->flags);
+ c->init = tsk;
+ }
+ ret = 0;
+ }
+
+ spin_unlock(&c->lock);
+ return ret;
+}
+
+/*
+ * Remove a dead process from a container.
+ *
+ * If the 'init' process in a container dies, we kill off all the other
+ * processes in the container.
+ */
+void exit_container(struct task_struct *tsk)
+{
+ struct task_struct *p;
+ struct container *c = tsk->container;
+ struct siginfo si = {
+ .si_signo = SIGKILL,
+ .si_code = SI_KERNEL,
+ };
+
+ spin_lock(&c->lock);
+
+ list_del(&tsk->container_link);
+
+ if (c->init == tsk) {
+ c->init = NULL;
+ c->exit_code = tsk->exit_code;
+ smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */
+ set_bit(CONTAINER_FLAG_DEAD, &c->flags);
+ wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD);
+
+ list_for_each_entry(p, &c->members, container_link) {
+ si.si_pid = task_tgid_vnr(p);
+ send_sig_info(SIGKILL, &si, p);
+ }
+ }
+
+ spin_unlock(&c->lock);
+ put_container(c);
+}
+
+/*
+ * Create some creds for the container. We don't want to pin things we don't
+ * have to, so drop all keyrings from the new cred. The LSM gets to audit the
+ * cred struct when security_container_alloc() is invoked.
+ */
+static const struct cred *create_container_creds(unsigned int flags)
+{
+ struct cred *new;
+ int ret;
+
+ new = prepare_creds();
+ if (!new)
+ return ERR_PTR(-ENOMEM);
+
+#ifdef CONFIG_KEYS
+ key_put(new->thread_keyring);
+ new->thread_keyring = NULL;
+ key_put(new->process_keyring);
+ new->process_keyring = NULL;
+ key_put(new->session_keyring);
+ new->session_keyring = NULL;
+ key_put(new->request_key_auth);
+ new->request_key_auth = NULL;
+#endif
+
+ if (flags & CONTAINER_NEW_USER_NS) {
+ ret = create_user_ns(new);
+ if (ret < 0)
+ goto err;
+ new->euid = new->user_ns->owner;
+ new->egid = new->user_ns->group;
+ }
+
+ new->fsuid = new->suid = new->uid = new->euid;
+ new->fsgid = new->sgid = new->gid = new->egid;
+ return new;
+
+err:
+ abort_creds(new);
+ return ERR_PTR(ret);
+}
+
+/*
+ * Create a new container.
+ */
+static struct container *create_container(const char *name, unsigned int flags)
+{
+ struct container *parent, *c;
+ struct fs_struct *fs;
+ struct nsproxy *ns;
+ const struct cred *cred;
+ int ret;
+
+ c = alloc_container(name);
+ if (IS_ERR(c))
+ return c;
+
+ if (flags & CONTAINER_KILL_ON_CLOSE)
+ __set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags);
+
+ cred = create_container_creds(flags);
+ if (IS_ERR(cred)) {
+ ret = PTR_ERR(cred);
+ goto err_cont;
+ }
+ c->cred = cred;
+
+ ret = -ENOMEM;
+ fs = copy_fs_struct(current->fs);
+ if (!fs)
+ goto err_cont;
+
+ ns = create_new_namespaces(
+ (flags & CONTAINER_NEW_FS_NS ? CLONE_NEWNS : 0) |
+ (flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP : 0) |
+ (flags & CONTAINER_NEW_UTS_NS ? CLONE_NEWUTS : 0) |
+ (flags & CONTAINER_NEW_IPC_NS ? CLONE_NEWIPC : 0) |
+ (flags & CONTAINER_NEW_PID_NS ? CLONE_NEWPID : 0) |
+ (flags & CONTAINER_NEW_NET_NS ? CLONE_NEWNET : 0),
+ current->nsproxy, cred->user_ns, fs);
+ if (IS_ERR(ns)) {
+ ret = PTR_ERR(ns);
+ goto err_fs;
+ }
+
+ c->ns = ns;
+ c->root = fs->root;
+ c->seq = fs->seq;
+ fs->root.mnt = NULL;
+ fs->root.dentry = NULL;
+
+ ret = security_container_alloc(c, flags);
+ if (ret < 0)
+ goto err_fs;
+
+ parent = current->container;
+ get_container(parent);
+ c->parent = parent;
+ spin_lock(&parent->lock);
+ list_add_tail(&c->child_link, &parent->children);
+ spin_unlock(&parent->lock);
+ return c;
+
+err_fs:
+ free_fs_struct(fs);
+err_cont:
+ put_container(c);
+ return ERR_PTR(ret);
+}
+
+/*
+ * Create a new container object.
+ */
+SYSCALL_DEFINE5(container_create,
+ const char __user *, name,
+ unsigned int, flags,
+ unsigned long, spare3,
+ unsigned long, spare4,
+ unsigned long, spare5)
+{
+ struct container *c;
+ struct file *f;
+ int ret, fd;
+
+ if (!name ||
+ flags & ~CONTAINER__FLAG_MASK ||
+ spare3 != 0 || spare4 != 0 || spare5 != 0)
+ return -EINVAL;
+ if ((flags & (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) ==
+ (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS))
+ return -EINVAL;
+
+ c = create_container(name, flags);
+ if (IS_ERR(c))
+ return PTR_ERR(c);
+
+ f = create_container_file(c);
+ if (IS_ERR(f)) {
+ ret = PTR_ERR(f);
+ goto err_cont;
+ }
+
+ ret = get_unused_fd_flags(flags & CONTAINER_FD_CLOEXEC ? O_CLOEXEC : 0);
+ if (ret < 0)
+ goto err_file;
+
+ fd = ret;
+ fd_install(fd, f);
+ return fd;
+
+err_file:
+ fput(f);
+ return ret;
+err_cont:
+ put_container(c);
+ return ret;
+}
+
+#endif /* CONFIG_CONTAINERS */
diff --git a/kernel/exit.c b/kernel/exit.c
index 31b8617aee04..1ff87f7e40a2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -875,6 +875,7 @@ void __noreturn do_exit(long code)
if (group_dead)
disassociate_ctty(1);
exit_task_namespaces(tsk);
+ exit_container(tsk);
exit_task_work(tsk);
exit_thread(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index aec6672d3f0e..ff2779426fe9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1728,9 +1728,12 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_namespaces(clone_flags, p);
if (retval)
goto bad_fork_cleanup_mm;
- retval = copy_io(clone_flags, p);
+ retval = copy_container(clone_flags, p, NULL);
if (retval)
goto bad_fork_cleanup_namespaces;
+ retval = copy_io(clone_flags, p);
+ if (retval)
+ goto bad_fork_cleanup_container;
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
goto bad_fork_cleanup_io;
@@ -1918,6 +1921,8 @@ static __latent_entropy struct task_struct *copy_process(
bad_fork_cleanup_io:
if (p->io_context)
exit_io_context(p);
+bad_fork_cleanup_container:
+ exit_container(p);
bad_fork_cleanup_namespaces:
exit_task_namespaces(p);
bad_fork_cleanup_mm:
diff --git a/kernel/namespaces.h b/kernel/namespaces.h
new file mode 100644
index 000000000000..c44e3cf0e254
--- /dev/null
+++ b/kernel/namespaces.h
@@ -0,0 +1,15 @@
+/* Local namespaces defs
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+extern struct nsproxy *create_new_namespaces(unsigned long flags,
+ struct nsproxy *nsproxy,
+ struct user_namespace *user_ns,
+ struct fs_struct *new_fs);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f6c5d330059a..4bb5184b3a80 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -27,6 +27,7 @@
#include <linux/syscalls.h>
#include <linux/cgroup.h>
#include <linux/perf_event.h>
+#include "namespaces.h"
static struct kmem_cache *nsproxy_cachep;
@@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void)
* Return the newly created nsproxy. Do not attach this to the task,
* leave it to the caller to do proper locking and attach it to task.
*/
-static struct nsproxy *create_new_namespaces(unsigned long flags,
- struct task_struct *tsk, struct user_namespace *user_ns,
+struct nsproxy *create_new_namespaces(unsigned long flags,
+ struct nsproxy *nsproxy, struct user_namespace *user_ns,
struct fs_struct *new_fs)
{
struct nsproxy *new_nsp;
@@ -72,39 +73,39 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
if (!new_nsp)
return ERR_PTR(-ENOMEM);
- new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
+ new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns, user_ns, new_fs);
if (IS_ERR(new_nsp->mnt_ns)) {
err = PTR_ERR(new_nsp->mnt_ns);
goto out_ns;
}
- new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
+ new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy->uts_ns);
if (IS_ERR(new_nsp->uts_ns)) {
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
- new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
+ new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
err = PTR_ERR(new_nsp->ipc_ns);
goto out_ipc;
}
new_nsp->pid_ns_for_children =
- copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
+ copy_pid_ns(flags, user_ns, nsproxy->pid_ns_for_children);
if (IS_ERR(new_nsp->pid_ns_for_children)) {
err = PTR_ERR(new_nsp->pid_ns_for_children);
goto out_pid;
}
new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
- tsk->nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns);
if (IS_ERR(new_nsp->cgroup_ns)) {
err = PTR_ERR(new_nsp->cgroup_ns);
goto out_cgroup;
}
- new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
+ new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
goto out_net;
@@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
(CLONE_NEWIPC | CLONE_SYSVSEM))
return -EINVAL;
- new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
+ new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);
@@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
- *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
+ *new_nsp = create_new_namespaces(unshare_flags, current->nsproxy, user_ns,
new_fs ? new_fs : current->fs);
if (IS_ERR(*new_nsp)) {
err = PTR_ERR(*new_nsp);
@@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
if (nstype && (ns->ops->type != nstype))
goto out;
- new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
+ new_nsproxy = create_new_namespaces(0, tsk->nsproxy, current_user_ns(), tsk->fs);
if (IS_ERR(new_nsproxy)) {
err = PTR_ERR(new_nsproxy);
goto out;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a0fe764bd5dd..99b1e1f58d05 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -262,3 +262,7 @@ cond_syscall(sys_pkey_free);
/* fd-based mount */
cond_syscall(sys_fsopen);
cond_syscall(sys_fsmount);
+
+/* Containers */
+cond_syscall(sys_container_create);
+
diff --git a/security/security.c b/security/security.c
index f4136ca5cb1b..b5c5b5ae1266 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1668,3 +1668,16 @@ int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule,
actx);
}
#endif /* CONFIG_AUDIT */
+
+#ifdef CONFIG_CONTAINERS
+
+int security_container_alloc(struct container *container, unsigned int flags)
+{
+ return call_int_hook(container_alloc, 0, container, flags);
+}
+
+void security_container_free(struct container *container)
+{
+ call_void_hook(container_free, container);
+}
+#endif /* CONFIG_CONTAINERS */
Rename linux/container.h to linux/container_dev.h so that linux/container.h
can be used for containers.
Signed-off-by: David Howells <[email protected]>
---
drivers/acpi/container.c | 2 +-
drivers/base/container.c | 2 +-
include/linux/container.h | 25 -------------------------
include/linux/container_dev.h | 25 +++++++++++++++++++++++++
4 files changed, 27 insertions(+), 27 deletions(-)
delete mode 100644 include/linux/container.h
create mode 100644 include/linux/container_dev.h
diff --git a/drivers/acpi/container.c b/drivers/acpi/container.c
index 12c240903c18..435db0694405 100644
--- a/drivers/acpi/container.c
+++ b/drivers/acpi/container.c
@@ -23,7 +23,7 @@
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
#include <linux/acpi.h>
-#include <linux/container.h>
+#include <linux/container_dev.h>
#include "internal.h"
diff --git a/drivers/base/container.c b/drivers/base/container.c
index ecbfbe2e908f..003bf5634f6b 100644
--- a/drivers/base/container.c
+++ b/drivers/base/container.c
@@ -9,7 +9,7 @@
* published by the Free Software Foundation.
*/
-#include <linux/container.h>
+#include <linux/container_dev.h>
#include "base.h"
diff --git a/include/linux/container.h b/include/linux/container.h
deleted file mode 100644
index 3c03e6fd2035..000000000000
--- a/include/linux/container.h
+++ /dev/null
@@ -1,25 +0,0 @@
-/*
- * Definitions for container bus type.
- *
- * Copyright (C) 2013, Intel Corporation
- * Author: Rafael J. Wysocki <[email protected]>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-#include <linux/device.h>
-
-/* drivers/base/power/container.c */
-extern struct bus_type container_subsys;
-
-struct container_dev {
- struct device dev;
- int (*offline)(struct container_dev *cdev);
-};
-
-static inline struct container_dev *to_container_dev(struct device *dev)
-{
- return container_of(dev, struct container_dev, dev);
-}
diff --git a/include/linux/container_dev.h b/include/linux/container_dev.h
new file mode 100644
index 000000000000..3c03e6fd2035
--- /dev/null
+++ b/include/linux/container_dev.h
@@ -0,0 +1,25 @@
+/*
+ * Definitions for container bus type.
+ *
+ * Copyright (C) 2013, Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+
+/* drivers/base/power/container.c */
+extern struct bus_type container_subsys;
+
+struct container_dev {
+ struct device dev;
+ int (*offline)(struct container_dev *cdev);
+};
+
+static inline struct container_dev *to_container_dev(struct device *dev)
+{
+ return container_of(dev, struct container_dev, dev);
+}
[Added missing cc to containers list]
On Mon, 2017-05-22 at 17:22 +0100, David Howells wrote:
> Here are a set of patches to define a container object for the kernel
> and to provide some methods to create and manipulate them.
>
> The reason I think this is necessary is that the kernel has no idea
> how to direct upcalls to what userspace considers to be a container -
> current Linux practice appears to make a "container" just an
> arbitrarily chosen junction of namespaces, control groups and files,
> which may be changed individually within the "container".
This sounds like a step in the wrong direction: the strength of the
current container interfaces in Linux is that people who set up
containers don't have to agree what they look like. So I can set up a
user namespace without a mount namespace or an architecture emulation
container with only a mount namespace.
But ignoring my fun foibles with containers and to give a concrete
example in terms of a popular orchestration system: in kubernetes,
where certain namespaces are shared across pods, do you imagine the
kernel's view of the "container" to be the pod or what kubernetes
thinks of as the container? This is important, because half the
examples you give below are network related and usually pods share a
network namespace.
> The kernel upcall mechanism then needs to decide which set of
> namespaces, etc., it must exec the appropriate upcall program.
> Examples of this include:
>
> (1) The DNS resolver. The DNS cache in the kernel should probably
> be per-network namespace, but in userspace the program, its
> libraries and its config data are associated with a mount tree and a
> user namespace and it gets run in a particular pid namespace.
All persistent (written to fs data) has to be mount ns associated;
there are no ifs, ands and buts to that. I agree this implies that if
you want to run a separate network namespace, you either take DNS from
the parent (a lot of containers do) or you set up a daemon to run
within the mount namespace. I agree the latter is a slightly fiddly
operation you have to get right, but that's why we have orchestration
systems.
What is it we could do with the above that we cannot do today?
> (2) NFS ID mapper. The NFS ID mapping cache should also probably be
> per-network namespace.
I think this is a view but not the only one: Right at the moment, NFS
ID mapping is used as the one of the ways we can get the user namespace
ID mapping writes to file problems fixed ... that makes it a property
of the mount namespace for a lot of containers. There are many other
instances where they do exactly as you say, but what I'm saying is that
we don't want to lose the flexibility we currently have.
> (3) nfsdcltrack. A way for NFSD to access stable storage for
> tracking of persistent state. Again, network-namespace dependent,
> but also perhaps mount-namespace dependent.
So again, given we can set this up to work today, this sounds like more
a restriction that will bite us than an enhancement that gives us extra
features.
> (4) General request-key upcalls. Not particularly namespace
> dependent, apart from keyrings being somewhat governed by the user
> namespace and the upcall being configured by the mount namespace.
All mount namespaces have an owning user namespace, so the data
relations are already there in the kernel, is the problem simply
finding them?
> These patches are built on top of the mount context patchset so that
> namespaces can be properly propagated over submounts/automounts.
I'll stop here ... you get the idea that I think this is imposing a set
of restrictions that will come back to bite us later. If this is just
for the sake of figuring out how to get keyring upcalls to work, then
I'm sure we can come up with something.
James
This is interesting...
Adding a container object seems a bit odd to me because there are so
many different ways to make containers, aka not all namespaces are
always used as well as not all cgroups, various LSM objects sometimes
apply, mounts blah blah blah. The OCI spec was made to cover all these
things so why a kernel object? I don't exactly see a future where the
container runtimes convert to this unless it covers all the same mods
as the mods in the OCI spec, not saying it needs to abide by the spec,
just saying it should allow all the same things. Which really just
seems, imo, like a pain for the kernel to have to maintain.
On Mon, May 22, 2017 at 5:22 PM, David Howells <[email protected]> wrote:
>
> Here are a set of patches to define a container object for the kernel and
> to provide some methods to create and manipulate them.
>
> The reason I think this is necessary is that the kernel has no idea how to
> direct upcalls to what userspace considers to be a container - current
> Linux practice appears to make a "container" just an arbitrarily chosen
> junction of namespaces, control groups and files, which may be changed
> individually within the "container".
>
> The kernel upcall mechanism then needs to decide which set of namespaces,
> etc., it must exec the appropriate upcall program. Examples of this
> include:
>
> (1) The DNS resolver. The DNS cache in the kernel should probably be
> per-network namespace, but in userspace the program, its libraries and
> its config data are associated with a mount tree and a user namespace
> and it gets run in a particular pid namespace.
>
> (2) NFS ID mapper. The NFS ID mapping cache should also probably be
> per-network namespace.
>
> (3) nfsdcltrack. A way for NFSD to access stable storage for tracking
> of persistent state. Again, network-namespace dependent, but also
> perhaps mount-namespace dependent.
>
> (4) General request-key upcalls. Not particularly namespace dependent,
> apart from keyrings being somewhat governed by the user namespace and
> the upcall being configured by the mount namespace.
Can't these all become namespace-aware without adding the notion of a
"container" to the kernel.
>
> These patches are built on top of the mount context patchset so that
> namespaces can be properly propagated over submounts/automounts.
>
> These patches implement a container object that holds the following things:
>
> (1) Namespaces.
>
> (2) A root directory.
>
> (3) A set of processes, including a designated 'init' process.
>
> (4) The creator's credentials, including ownership.
>
> (5) A place to hang security for the container, allowing policies to be
> set per-container.
>
> I also want to add:
>
> (6) Control groups.
>
> (7) A per-container keyring that can be added to from outside of the
> container, even once the container is live, for the provision of
> filesystem authentication/encryption keys in advance of the container
> being started.
>
> You can get a list of containers by examining /proc/containers - but I'm
> not sure how much value this gets you. Note that the container in which
> you are running is called "<current>" and you can only see other containers
> that were started from within yours. Containers are therefore effectively
> hierarchical and an init_container is set up when the system boots.
>
>
> Some management operations are provided:
>
> (1) int fd = container_create(const char *name, unsigned int flags);
>
> Create a container of the given name and return a handle to it as a
> file descriptor. flags indicates what namespaces should be inherited
> from the caller and what should be replaced new. It is possible to
> set up a container with a null root filesystem that can be mounted
> later.
>
> (2) int fsfd = fsopen(const char *fsname, int container_fd,
> unsigned int flags);
>
> Prepare a mount context inside the container. This uses all the
> containers namespaces instead of the caller's.
>
> (3) fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
> unsigned int flags);
>
> Mount a prepared superblock. dfd can be given container_fd to use the
> container to which it refers as the root of the pathwalk.
>
> If path is "/" and at_flags is AT_FSMOUNT_CONTAINER_ROOT, then this
> will attempt to mount the root of the container and create a mount
> namespace for it. The container must've been created with
> CONTAINER_NEW_EMPTY_FS_NS.
>
> (4) pid_t pid = fork_into_container(int container_fd);
>
> Create the init process in a container. The process uses that
> container's namespaces instead of the caller's.
>
> (5) int sfd = container_socket(int container_fd,
> int domain, int type, int protocol);
>
> Create a socket inside a container. The socket gets the container's
> namespaces. This allows netlink operations to be called within that
> container to set it up from outside (at least in theory).
>
> (6) mkdirat(int dfd, ...);
> mknodat(int dfd, ...);
> openat(int dfd, ...);
>
> Supplying a container fd as dfd makes the pathwalk happen relative to
> the root of the container. Note that the path must be *relative*.
>
> And some need to be/could be added:
>
> (7) Directly set a container's namespaces to allow cross-container
> sharing.
>
> (8) Adjust the control group membership of a container.
>
> (9) Add a key inside a container keyring.
>
> (10) Kill/suspend/freeze/reboot container, both from inside and out.
>
> (11) Set container's root dir.
>
> (12) Set the container's security policy.
>
> (13) Allow overlayfs to access filesystems outside of the container in
> which it is being created.
>
>
> Kernel upcalls are invoked in the root of the container that incurs them
> rather than in the init namespace context. There's still some awkwardness
> here if you, say, share a network namespace between containers. Either the
> upcall binaries and configuration must be duplicated between sharing
> containers or a container must be elected as the one in which such upcalls
> will be done.
>
>
> Some further thoughts:
>
> (*) Should there be an AT_IN_CONTAINER flag to provide to syscalls that
> take a container in lieu of AT_FDCWD or a directory fd? The problem
> is that such as mkdirat() and openat() don't have an at_flags
> argument.
>
> (*) Should there be a container hierarchy at all? It seems that this is
> only really necessary for /proc/containers. Do we want to allow
> containers-within-containers?
>
> (*) Should each container automatically have its own pid namespace such
> that its 'init' process always appears as pid 1?
>
> (*) Does this allow kernel upcalls to be accounted against the correct
> control group?
>
> (*) Should each container have a 'list' of accessible device numbers such
> that certain device files can be made usable within a container? And
> can devtmpfs/udev be made to show the correct file set for each
> container?
>
>
> The patches can be found here also:
>
> http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=container
>
> Note that this is dependent on the mount-context branch.
>
> David
> ---
> David Howells (9):
> containers: Rename linux/container.h to linux/container_dev.h
> Implement containers as kernel objects
> Provide /proc/containers
> Allow processes to be forked and upcalled into a container
> Open a socket inside a container
> Allow fs syscall dfd arguments to take a container fd
> Make fsopen() able to initiate mounting into a container
> Honour CONTAINER_NEW_EMPTY_FS_NS
> Sample program for driving container objects
>
>
> arch/x86/entry/syscalls/syscall_32.tbl | 3
> arch/x86/entry/syscalls/syscall_64.tbl | 3
> drivers/acpi/container.c | 2
> drivers/base/container.c | 2
> fs/fsopen.c | 33 +-
> fs/libfs.c | 3
> fs/namei.c | 52 ++-
> fs/namespace.c | 108 +++++-
> fs/nfs/namespace.c | 2
> fs/nfs/nfs4namespace.c | 4
> fs/proc/root.c | 13 +
> fs/sb_config.c | 29 +-
> include/linux/container.h | 91 ++++-
> include/linux/container_dev.h | 25 +
> include/linux/cred.h | 3
> include/linux/init_task.h | 4
> include/linux/kmod.h | 1
> include/linux/lsm_hooks.h | 25 +
> include/linux/mount.h | 5
> include/linux/nsproxy.h | 7
> include/linux/pid.h | 5
> include/linux/proc_ns.h | 3
> include/linux/sb_config.h | 5
> include/linux/sched.h | 3
> include/linux/sched/task.h | 4
> include/linux/security.h | 20 +
> include/linux/syscalls.h | 6
> include/uapi/linux/container.h | 28 ++
> include/uapi/linux/fcntl.h | 2
> include/uapi/linux/magic.h | 1
> init/Kconfig | 7
> init/main.c | 4
> kernel/Makefile | 2
> kernel/container.c | 576 ++++++++++++++++++++++++++++++++
> kernel/cred.c | 45 ++-
> kernel/exit.c | 1
> kernel/fork.c | 117 ++++++-
> kernel/kmod.c | 13 +
> kernel/kthread.c | 3
> kernel/namespaces.h | 15 +
> kernel/nsproxy.c | 34 +-
> kernel/pid.c | 4
> kernel/sys_ni.c | 5
> net/socket.c | 37 ++
> samples/containers/test-container.c | 162 +++++++++
> security/security.c | 18 +
> security/selinux/hooks.c | 5
> 47 files changed, 1408 insertions(+), 132 deletions(-)
> create mode 100644 include/linux/container_dev.h
> create mode 100644 include/uapi/linux/container.h
> create mode 100644 kernel/container.c
> create mode 100644 kernel/namespaces.h
> create mode 100644 samples/containers/test-container.c
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu
>> The reason I think this is necessary is that the kernel has no idea
>> how to direct upcalls to what userspace considers to be a container -
>> current Linux practice appears to make a "container" just an
>> arbitrarily chosen junction of namespaces, control groups and files,
>> which may be changed individually within the "container".
Just want to point out that if the kernel APIs for containers massively
change, then the OCI will have to completely rework how we describe
containers (and so will all existing runtimes).
Not to mention that while I don't like how hard it is (from a runtime
perspective) to actually set up a container securely, there are
undoubtedly benefits to having namespaces split out. The network
namespace being separate means that in certain contexts you actually
don't want to create a new network namespace when creating a container.
I had some ideas about how you could implement bridging in userspace (as
an unprivileged user, for rootless containers) but if you can't join
namespaces individually then such a setup is not practically possible.
--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
I had replied but not to the thread with the containers mailing list.
See https://marc.info/?l=linux-cgroups&m=149547317006676&w=2
On Mon, May 22, 2017 at 5:53 PM, James Bottomley
<[email protected]> wrote:
> [Added missing cc to containers list]
> On Mon, 2017-05-22 at 17:22 +0100, David Howells wrote:
>> Here are a set of patches to define a container object for the kernel
>> and to provide some methods to create and manipulate them.
>>
>> The reason I think this is necessary is that the kernel has no idea
>> how to direct upcalls to what userspace considers to be a container -
>> current Linux practice appears to make a "container" just an
>> arbitrarily chosen junction of namespaces, control groups and files,
>> which may be changed individually within the "container".
>
> This sounds like a step in the wrong direction: the strength of the
> current container interfaces in Linux is that people who set up
> containers don't have to agree what they look like. So I can set up a
> user namespace without a mount namespace or an architecture emulation
> container with only a mount namespace.
>
> But ignoring my fun foibles with containers and to give a concrete
> example in terms of a popular orchestration system: in kubernetes,
> where certain namespaces are shared across pods, do you imagine the
> kernel's view of the "container" to be the pod or what kubernetes
> thinks of as the container? This is important, because half the
> examples you give below are network related and usually pods share a
> network namespace.
I am glad you pointed this out because I was trying to make the same
point, various definitions of containers differ and who is to say
whether the various container runtimes (runc, rkt, systemd-nspawn) or
consumers of containers (kubernetes) won't modify their definition in
the future. How will this scale as new LSMs like Landlock or new
namespaces are added in the future will they be included in the
container kernel object as well...
Seems like a lot more maintenance for something that is really just
making the keyring namespace-aware... unless there are other things I
missed.
>
>> The kernel upcall mechanism then needs to decide which set of
>> namespaces, etc., it must exec the appropriate upcall program.
>> Examples of this include:
>>
>> (1) The DNS resolver. The DNS cache in the kernel should probably
>> be per-network namespace, but in userspace the program, its
>> libraries and its config data are associated with a mount tree and a
>> user namespace and it gets run in a particular pid namespace.
>
> All persistent (written to fs data) has to be mount ns associated;
> there are no ifs, ands and buts to that. I agree this implies that if
> you want to run a separate network namespace, you either take DNS from
> the parent (a lot of containers do) or you set up a daemon to run
> within the mount namespace. I agree the latter is a slightly fiddly
> operation you have to get right, but that's why we have orchestration
> systems.
>
> What is it we could do with the above that we cannot do today?
>
>> (2) NFS ID mapper. The NFS ID mapping cache should also probably be
>> per-network namespace.
>
> I think this is a view but not the only one: Right at the moment, NFS
> ID mapping is used as the one of the ways we can get the user namespace
> ID mapping writes to file problems fixed ... that makes it a property
> of the mount namespace for a lot of containers. There are many other
> instances where they do exactly as you say, but what I'm saying is that
> we don't want to lose the flexibility we currently have.
>
>> (3) nfsdcltrack. A way for NFSD to access stable storage for
>> tracking of persistent state. Again, network-namespace dependent,
>> but also perhaps mount-namespace dependent.
>
> So again, given we can set this up to work today, this sounds like more
> a restriction that will bite us than an enhancement that gives us extra
> features.
>
>> (4) General request-key upcalls. Not particularly namespace
>> dependent, apart from keyrings being somewhat governed by the user
>> namespace and the upcall being configured by the mount namespace.
>
> All mount namespaces have an owning user namespace, so the data
> relations are already there in the kernel, is the problem simply
> finding them?
>
>> These patches are built on top of the mount context patchset so that
>> namespaces can be properly propagated over submounts/automounts.
>
> I'll stop here ... you get the idea that I think this is imposing a set
> of restrictions that will come back to bite us later. If this is just
> for the sake of figuring out how to get keyring upcalls to work, then
> I'm sure we can come up with something.
>
> James
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu
On Mon, 2017-05-22 at 09:53 -0700, James Bottomley wrote:
> [Added missing cc to containers list]
> On Mon, 2017-05-22 at 17:22 +0100, David Howells wrote:
> > Here are a set of patches to define a container object for the kernel
> > and to provide some methods to create and manipulate them.
> >
> > The reason I think this is necessary is that the kernel has no idea
> > how to direct upcalls to what userspace considers to be a container -
> > current Linux practice appears to make a "container" just an
> > arbitrarily chosen junction of namespaces, control groups and files,
> > which may be changed individually within the "container".
>
> This sounds like a step in the wrong direction: the strength of the
> current container interfaces in Linux is that people who set up
> containers don't have to agree what they look like. So I can set up a
> user namespace without a mount namespace or an architecture emulation
> container with only a mount namespace.
>
Does this really mandate what they look like though? AFAICT, you can
still spawn disconnected namespaces to your heart's content. What this
does is provide a container for several different namespaces so that the
kernel can actually be aware of the association between them. The way
you populate the different namespaces looks to be pretty flexible.
> But ignoring my fun foibles with containers and to give a concrete
> example in terms of a popular orchestration system: in kubernetes,
> where certain namespaces are shared across pods, do you imagine the
> kernel's view of the "container" to be the pod or what kubernetes
> thinks of as the container? This is important, because half the
> examples you give below are network related and usually pods share a
> network namespace.
>
> > The kernel upcall mechanism then needs to decide which set of
> > namespaces, etc., it must exec the appropriate upcall program.
> > Examples of this include:
> >
> > (1) The DNS resolver. The DNS cache in the kernel should probably
> > be per-network namespace, but in userspace the program, its
> > libraries and its config data are associated with a mount tree and a
> > user namespace and it gets run in a particular pid namespace.
>
> All persistent (written to fs data) has to be mount ns associated;
> there are no ifs, ands and buts to that. I agree this implies that if
> you want to run a separate network namespace, you either take DNS from
> the parent (a lot of containers do) or you set up a daemon to run
> within the mount namespace. I agree the latter is a slightly fiddly
> operation you have to get right, but that's why we have orchestration
> systems.
>
> What is it we could do with the above that we cannot do today?
>
Spawn a task directly from the kernel, already set up in the correct
namespaces, a'la call_usermodehelper. So far there is no way to do that,
and it is something we'd very much desire. Ian Kent has made several
passes at it recently.
> > (2) NFS ID mapper. The NFS ID mapping cache should also probably be
> > per-network namespace.
>
> I think this is a view but not the only one: Right at the moment, NFS
> ID mapping is used as the one of the ways we can get the user namespace
> ID mapping writes to file problems fixed ... that makes it a property
> of the mount namespace for a lot of containers. There are many other
> instances where they do exactly as you say, but what I'm saying is that
> we don't want to lose the flexibility we currently have.
>
> > (3) nfsdcltrack. A way for NFSD to access stable storage for
> > tracking of persistent state. Again, network-namespace dependent,
> > but also perhaps mount-namespace dependent.
Definitely mount-namespace dependent.
>
> So again, given we can set this up to work today, this sounds like more
> a restriction that will bite us than an enhancement that gives us extra
> features.
>
How do you set this up to work today?
AFAIK, if you want to run knfsd in a container today, you're out of luck
for any non-trivial configuration. The main reason is that most of knfsd
is namespace-ized in the network namespace, but there is no clear way to
associate that with a mount namespace, which is what we need to do this
properly inside a container. I think David's patches would get us there.
> > (4) General request-key upcalls. Not particularly namespace
> > dependent, apart from keyrings being somewhat governed by the user
> > namespace and the upcall being configured by the mount namespace.
>
> All mount namespaces have an owning user namespace, so the data
> relations are already there in the kernel, is the problem simply
> finding them?
>
> > These patches are built on top of the mount context patchset so that
> > namespaces can be properly propagated over submounts/automounts.
>
> I'll stop here ... you get the idea that I think this is imposing a set
> of restrictions that will come back to bite us later. If this is just
> for the sake of figuring out how to get keyring upcalls to work, then
> I'm sure we can come up with something.
>
--
Jeff Layton <[email protected]>
David Howells <[email protected]> writes:
> Here are a set of patches to define a container object for the kernel and
> to provide some methods to create and manipulate them.
>
> The reason I think this is necessary is that the kernel has no idea how to
> direct upcalls to what userspace considers to be a container - current
> Linux practice appears to make a "container" just an arbitrarily chosen
> junction of namespaces, control groups and files, which may be changed
> individually within the "container".
>
I think this might possibly be a useful abstraction for solving the
keyring upcalls if it was something created implicitly.
fork_into_container for use by keyring upcalls is currently a security
vulnerability as it allows escaping all of a containers cgroups. But
you have that on your list of things to fix. However you don't have
seccomp and a few other things.
Before we had kthreadd in the kernel upcalls always had issues because
the code to reset all of the userspace bits and make the forked
task suitable for running upcalls was always missing some detail. It is
a very bug-prone kind of idiom that you are talking about. It is doubly
bug-prone because the wrongness is visible to userspace and as such
might get become a frozen KABI guarantee.
Let me suggest a concrete alternative:
- At the time of mount observer the mounters user namespace.
- Find the mounters pid namespace.
- If the mounters pid namespace is owned by the mounters user namespace
walk up the pid namespace tree to the first pid namespace owned by
that user namespace.
- If the mounters pid namespace is not owned by the mounters user
namespace fail the mount it is going to need to make upcalls as
will not be possible.
- Hold a reference to the pid namespace that was found.
Then when an upcall needs to be made fork a child of the init process
of the specified pid namespace. Or fail if the init process of the
pid namespace has died.
That should always work and it does not require keeping expensive state
where we did not have it previously. Further because the semantics are
fork a child of a particular pid namespace's init as features get added
to the kernel this code remains well defined.
For ordinary request-key upcalls we should be able to use the same rules
and just not save/restore things in the kernel.
A huge advantage of my alternative (other than not being a bit-rot
magnet) is that it should drop into existing container infrastructure
without problems. The rule for container implementors is simple to use
security key infrastructure you need to have created a pid namespace in
your user namespace.
Eric
On Mon, 2017-05-22 at 14:34 -0400, Jeff Layton wrote:
> On Mon, 2017-05-22 at 09:53 -0700, James Bottomley wrote:
> > [Added missing cc to containers list]
> > On Mon, 2017-05-22 at 17:22 +0100, David Howells wrote:
> > > Here are a set of patches to define a container object for the
> > > kernel and to provide some methods to create and manipulate them.
> > >
> > > The reason I think this is necessary is that the kernel has no
> > > idea how to direct upcalls to what userspace considers to be a
> > > container - current Linux practice appears to make a "container"
> > > just an arbitrarily chosen junction of namespaces, control groups
> > > and files, which may be changed individually within the
> > > "container".
> >
> > This sounds like a step in the wrong direction: the strength of the
> > current container interfaces in Linux is that people who set up
> > containers don't have to agree what they look like. So I can set
> > up a user namespace without a mount namespace or an architecture
> > emulation container with only a mount namespace.
> >
>
> Does this really mandate what they look like though? AFAICT, you can
> still spawn disconnected namespaces to your heart's content. What
> this does is provide a container for several different namespaces so
> that the kernel can actually be aware of the association between
> them.
Yes, because it imposes a view of what is in a container. As the
several replies have pointed out (and indeed as I pointed out below for
kubernetes), this isn't something the orchestration systems would find
usable.
> The way you populate the different namespaces looks to be pretty
> flexible.
OK, but look at it another way: If we provides a container API no
actual consumer of container technologies wants to use just because we
think it makes certain tasks easy, is it really a good API?
Containers are multi-layered and complex. If you're not ready for this
as a user, then you should use an orchestration system that prevents
you from screwing up.
> > But ignoring my fun foibles with containers and to give a concrete
> > example in terms of a popular orchestration system: in kubernetes,
> > where certain namespaces are shared across pods, do you imagine the
> > kernel's view of the "container" to be the pod or what kubernetes
> > thinks of as the container? This is important, because half the
> > examples you give below are network related and usually pods share
> > a network namespace.
> >
> > > The kernel upcall mechanism then needs to decide which set of
> > > namespaces, etc., it must exec the appropriate upcall program.
> > > Examples of this include:
> > >
> > > (1) The DNS resolver. The DNS cache in the kernel should
> > > probably be per-network namespace, but in userspace the program,
> > > its libraries and its config data are associated with a mount
> > > tree and a user namespace and it gets run in a particular pid
> > > namespace.
> >
> > All persistent (written to fs data) has to be mount ns associated;
> > there are no ifs, ands and buts to that. I agree this implies that
> > if you want to run a separate network namespace, you either take
> > DNS from the parent (a lot of containers do) or you set up a daemon
> > to run within the mount namespace. I agree the latter is a
> > slightly fiddly operation you have to get right, but that's why we
> > have orchestration systems.
> >
> > What is it we could do with the above that we cannot do today?
> >
>
> Spawn a task directly from the kernel, already set up in the correct
> namespaces, a'la call_usermodehelper. So far there is no way to do
> that,
Today the usermode helper has to be namespace aware. We spawn it into
the root namespace and it jumps into the correct namespace/cgroup
combination and re-executes itself or simply performs the requisite
task on behalf of the container. Is this simple, no; does it work,
yes, provided the host OS is aware of what the container orchestration
system wants it to do.
> and it is something we'd very much desire. Ian Kent has made several
> passes at it recently.
Well, every time we try to remove some of the complexity from
userspace, we end up wrapping around the axle of what exactly we're
trying to achieve, yes.
> > > (2) NFS ID mapper. The NFS ID mapping cache should also
> > > probably be per-network namespace.
> >
> > I think this is a view but not the only one: Right at the moment,
> > NFS ID mapping is used as the one of the ways we can get the user
> > namespace ID mapping writes to file problems fixed ... that makes
> > it a property of the mount namespace for a lot of containers.
> > There are many other instances where they do exactly as you say,
> > but what I'm saying is that we don't want to lose the flexibility
> > we currently have.
> >
> > > (3) nfsdcltrack. A way for NFSD to access stable storage for
> > > tracking of persistent state. Again, network-namespace
> > > dependent, but also perhaps mount-namespace dependent.
>
> Definitely mount-namespace dependent.
>
> >
> > So again, given we can set this up to work today, this sounds like
> > more a restriction that will bite us than an enhancement that gives
> > us extra features.
> >
>
> How do you set this up to work today?
Well, as above, it spawns into the root, you jump it to where it should
be and re-execute or simply handle in the host.
> AFAIK, if you want to run knfsd in a container today, you're out of
> luck for any non-trivial configuration.
Well "running knfsd in a container" is actually different from having a
containerised nfs export. My understanding was that thanks to the work
of Stas Kinsbursky, the latter has mostly worked since the 3.9 kernel
for v3 and below. I assume the current issue is that there's a problem
with v4?
James
> The main reason is that most of knfsd is namespace-ized in the
> network namespace, but there is no clear way to associate that with a
> mount namespace, which is what we need to do this properly inside a
> container. I think David's patches would get us there.
>
> > > (4) General request-key upcalls. Not particularly namespace
> > > dependent, apart from keyrings being somewhat governed by the
> > > user namespace and the upcall being configured by the mount
> > > namespace.
> >
> > All mount namespaces have an owning user namespace, so the data
> > relations are already there in the kernel, is the problem simply
> > finding them?
> >
> > > These patches are built on top of the mount context patchset so
> > > that namespaces can be properly propagated over
> > > submounts/automounts.
> >
> > I'll stop here ... you get the idea that I think this is imposing a
> > set of restrictions that will come back to bite us later. If this
> > is just for the sake of figuring out how to get keyring upcalls to
> > work, then I'm sure we can come up with something.
> >
>
On Mon, 2017-05-22 at 12:21 -0700, James Bottomley wrote:
> On Mon, 2017-05-22 at 14:34 -0400, Jeff Layton wrote:
> > On Mon, 2017-05-22 at 09:53 -0700, James Bottomley wrote:
> > > [Added missing cc to containers list]
> > > On Mon, 2017-05-22 at 17:22 +0100, David Howells wrote:
> > > > Here are a set of patches to define a container object for the
> > > > kernel and to provide some methods to create and manipulate them.
> > > >
> > > > The reason I think this is necessary is that the kernel has no
> > > > idea how to direct upcalls to what userspace considers to be a
> > > > container - current Linux practice appears to make a "container"
> > > > just an arbitrarily chosen junction of namespaces, control groups
> > > > and files, which may be changed individually within the
> > > > "container".
> > >
> > > This sounds like a step in the wrong direction: the strength of the
> > > current container interfaces in Linux is that people who set up
> > > containers don't have to agree what they look like. So I can set
> > > up a user namespace without a mount namespace or an architecture
> > > emulation container with only a mount namespace.
> > >
> >
> > Does this really mandate what they look like though? AFAICT, you can
> > still spawn disconnected namespaces to your heart's content. What
> > this does is provide a container for several different namespaces so
> > that the kernel can actually be aware of the association between
> > them.
>
> Yes, because it imposes a view of what is in a container. As the
> several replies have pointed out (and indeed as I pointed out below for
> kubernetes), this isn't something the orchestration systems would find
> usable.
>
> > The way you populate the different namespaces looks to be pretty
> > flexible.
>
> OK, but look at it another way: If we provides a container API no
> actual consumer of container technologies wants to use just because we
> think it makes certain tasks easy, is it really a good API?
>
> Containers are multi-layered and complex. If you're not ready for this
> as a user, then you should use an orchestration system that prevents
> you from screwing up.
>
> > > But ignoring my fun foibles with containers and to give a concrete
> > > example in terms of a popular orchestration system: in kubernetes,
> > > where certain namespaces are shared across pods, do you imagine the
> > > kernel's view of the "container" to be the pod or what kubernetes
> > > thinks of as the container? This is important, because half the
> > > examples you give below are network related and usually pods share
> > > a network namespace.
> > >
> > > > The kernel upcall mechanism then needs to decide which set of
> > > > namespaces, etc., it must exec the appropriate upcall program.
> > > > Examples of this include:
> > > >
> > > > (1) The DNS resolver. The DNS cache in the kernel should
> > > > probably be per-network namespace, but in userspace the program,
> > > > its libraries and its config data are associated with a mount
> > > > tree and a user namespace and it gets run in a particular pid
> > > > namespace.
> > >
> > > All persistent (written to fs data) has to be mount ns associated;
> > > there are no ifs, ands and buts to that. I agree this implies that
> > > if you want to run a separate network namespace, you either take
> > > DNS from the parent (a lot of containers do) or you set up a daemon
> > > to run within the mount namespace. I agree the latter is a
> > > slightly fiddly operation you have to get right, but that's why we
> > > have orchestration systems.
> > >
> > > What is it we could do with the above that we cannot do today?
> > >
> >
> > Spawn a task directly from the kernel, already set up in the correct
> > namespaces, a'la call_usermodehelper. So far there is no way to do
> > that,
>
> Today the usermode helper has to be namespace aware. We spawn it into
> the root namespace and it jumps into the correct namespace/cgroup
> combination and re-executes itself or simply performs the requisite
> task on behalf of the container. Is this simple, no; does it work,
> yes, provided the host OS is aware of what the container orchestration
> system wants it to do.
>
> > and it is something we'd very much desire. Ian Kent has made several
> > passes at it recently.
>
> Well, every time we try to remove some of the complexity from
> userspace, we end up wrapping around the axle of what exactly we're
> trying to achieve, yes.
>
> > > > (2) NFS ID mapper. The NFS ID mapping cache should also
> > > > probably be per-network namespace.
> > >
> > > I think this is a view but not the only one: Right at the moment,
> > > NFS ID mapping is used as the one of the ways we can get the user
> > > namespace ID mapping writes to file problems fixed ... that makes
> > > it a property of the mount namespace for a lot of containers.
> > > There are many other instances where they do exactly as you say,
> > > but what I'm saying is that we don't want to lose the flexibility
> > > we currently have.
> > >
> > > > (3) nfsdcltrack. A way for NFSD to access stable storage for
> > > > tracking of persistent state. Again, network-namespace
> > > > dependent, but also perhaps mount-namespace dependent.
> >
> > Definitely mount-namespace dependent.
> >
> > >
> > > So again, given we can set this up to work today, this sounds like
> > > more a restriction that will bite us than an enhancement that gives
> > > us extra features.
> > >
> >
> > How do you set this up to work today?
>
> Well, as above, it spawns into the root, you jump it to where it should
> be and re-execute or simply handle in the host.
>
> > AFAIK, if you want to run knfsd in a container today, you're out of
> > luck for any non-trivial configuration.
>
> Well "running knfsd in a container" is actually different from having a
> containerised nfs export. My understanding was that thanks to the work
> of Stas Kinsbursky, the latter has mostly worked since the 3.9 kernel
> for v3 and below. I assume the current issue is that there's a problem
> with v4?
>
Yes -- v3 mostly works because the equivalent state-tracking (rpc.statd)
is run as a long-running daemon.
nfsdcltrack uses call_usermodehelper, so for that you need to be able to
determine what mount namespace to run the thing in. All we really know
in knfsd when we want to do an upcall is the net namespace. We could
really use a way to associate the two and spawn the thing in the correct
container (or pass it enough info for it to setns() into the right
ones).
In principle, we could just ensure that we do all of this sort of thing
with long-running daemons that are started whenever the container
starts. But...having to run daemons full-time for infrequently-used
services sort of sucks and requires it to be setup. UMH helpers just get
run as long as the binary is in the right place.
I've also been reading over Eric suggestion, and that seems like it
might work as well though.
> > The main reason is that most of knfsd is namespace-ized in the
> > network namespace, but there is no clear way to associate that with a
> > mount namespace, which is what we need to do this properly inside a
> > container. I think David's patches would get us there.
> >
> > > > (4) General request-key upcalls. Not particularly namespace
> > > > dependent, apart from keyrings being somewhat governed by the
> > > > user namespace and the upcall being configured by the mount
> > > > namespace.
> > >
> > > All mount namespaces have an owning user namespace, so the data
> > > relations are already there in the kernel, is the problem simply
> > > finding them?
> > >
> > > > These patches are built on top of the mount context patchset so
> > > > that namespaces can be properly propagated over
> > > > submounts/automounts.
> > >
> > > I'll stop here ... you get the idea that I think this is imposing a
> > > set of restrictions that will come back to bite us later. If this
> > > is just for the sake of figuring out how to get keyring upcalls to
> > > work, then I'm sure we can come up with something.
> > >
>
>
--
Jeff Layton <[email protected]>
On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
> David Howells <[email protected]> writes:
>
> > Here are a set of patches to define a container object for the kernel and
> > to provide some methods to create and manipulate them.
> >
> > The reason I think this is necessary is that the kernel has no idea how to
> > direct upcalls to what userspace considers to be a container - current
> > Linux practice appears to make a "container" just an arbitrarily chosen
> > junction of namespaces, control groups and files, which may be changed
> > individually within the "container".
> >
>
> I think this might possibly be a useful abstraction for solving the
> keyring upcalls if it was something created implicitly.
>
> fork_into_container for use by keyring upcalls is currently a security
> vulnerability as it allows escaping all of a containers cgroups. But
> you have that on your list of things to fix. However you don't have
> seccomp and a few other things.
>
> Before we had kthreadd in the kernel upcalls always had issues because
> the code to reset all of the userspace bits and make the forked
> task suitable for running upcalls was always missing some detail. It is
> a very bug-prone kind of idiom that you are talking about. It is doubly
> bug-prone because the wrongness is visible to userspace and as such
> might get become a frozen KABI guarantee.
>
> Let me suggest a concrete alternative:
>
> - At the time of mount observer the mounters user namespace.
> - Find the mounters pid namespace.
> - If the mounters pid namespace is owned by the mounters user namespace
> walk up the pid namespace tree to the first pid namespace owned by
> that user namespace.
> - If the mounters pid namespace is not owned by the mounters user
> namespace fail the mount it is going to need to make upcalls as
> will not be possible.
> - Hold a reference to the pid namespace that was found.
>
> Then when an upcall needs to be made fork a child of the init process
> of the specified pid namespace. Or fail if the init process of the
> pid namespace has died.
>
> That should always work and it does not require keeping expensive state
> where we did not have it previously. Further because the semantics are
> fork a child of a particular pid namespace's init as features get added
> to the kernel this code remains well defined.
>
> For ordinary request-key upcalls we should be able to use the same rules
> and just not save/restore things in the kernel.
>
OK, that does seem like a reasonable idea. Note that it's not just
request-key upcalls here that we're interested in, but anything that
we'd typically spawn from kthreadd otherwise.
That said, I worry a little about this. If the init process does a setns
at the wrong time, suddenly you're doing the upcall in different
namespaces than you intended.
Might it be better to use the init process of the container as the
template like you suggest, but snapshot its "context" at a particular
point in time instead?
knfsd could do this when it's started, for instance...
> A huge advantage of my alternative (other than not being a bit-rot
> magnet) is that it should drop into existing container infrastructure
> without problems. The rule for container implementors is simple to use
> security key infrastructure you need to have created a pid namespace in
> your user namespace.
>
> Eric
--
Jeff Layton <[email protected]>
On Mon, 2017-05-22 at 09:53 -0700, James Bottomley wrote:
> [Added missing cc to containers list]
> On Mon, 2017-05-22 at 17:22 +0100, David Howells wrote:
> > Here are a set of patches to define a container object for the kernelÂ
> > and to provide some methods to create and manipulate them.
> >
> > The reason I think this is necessary is that the kernel has no ideaÂ
> > how to direct upcalls to what userspace considers to be a container -
> > current Linux practice appears to make a "container" just anÂ
> > arbitrarily chosen junction of namespaces, control groups and files,Â
> > which may be changed individually within the "container".
>
> This sounds like a step in the wrong direction: the strength of the
> current container interfaces in Linux is that people who set up
> containers don't have to agree what they look like.  So I can set up a
> user namespace without a mount namespace or an architecture emulation
> container with only a mount namespace.
>
> But ignoring my fun foibles with containers and to give a concrete
> example in terms of a popular orchestration system: in kubernetes,
> where certain namespaces are shared across pods, do you imagine the
> kernel's view of the "container" to be the pod or what kubernetes
> thinks of as the container?  This is important, because half the
> examples you give below are network related and usually pods share a
> network namespace.
>
> > The kernel upcall mechanism then needs to decide which set ofÂ
> > namespaces, etc., it must exec the appropriate upcall program.Â
> > Â Examples of this include:
> >
> >  (1) The DNS resolver.  The DNS cache in the kernel should probablyÂ
> > be per-network namespace, but in userspace the program, its
> > libraries and its config data are associated with a mount tree and aÂ
> > user namespace and it gets run in a particular pid namespace.
>
> All persistent (written to fs data) has to be mount ns associated;
> there are no ifs, ands and buts to that.  I agree this implies that if
> you want to run a separate network namespace, you either take DNS from
> the parent (a lot of containers do) or you set up a daemon to run
> within the mount namespace.  I agree the latter is a slightly fiddly
> operation you have to get right, but that's why we have orchestration
> systems.
>
> What is it we could do with the above that we cannot do today?
>
> >  (2) NFS ID mapper.  The NFS ID mapping cache should also probably be
> > Â Â Â Â Â per-network namespace.
>
> I think this is a view but not the only one:Â Â Right at the moment, NFS
> ID mapping is used as the one of the ways we can get the user namespace
> ID mapping writes to file problems fixed ... that makes it a property
> of the mount namespace for a lot of containers.  There are many other
> instances where they do exactly as you say, but what I'm saying is that
> we don't want to lose the flexibility we currently have.
>
> >  (3) nfsdcltrack.  A way for NFSD to access stable storage forÂ
> > tracking of persistent state.  Again, network-namespace dependent,Â
> > but also perhaps mount-namespace dependent.
>
> So again, given we can set this up to work today, this sounds like more
> a restriction that will bite us than an enhancement that gives us extra
> features.
>
> >  (4) General request-key upcalls.  Not particularly namespaceÂ
> > dependent, apart from keyrings being somewhat governed by the user
> > namespace and the upcall being configured by the mount namespace.
>
> All mount namespaces have an owning user namespace, so the data
> relations are already there in the kernel, is the problem simply
> finding them?
>
> > These patches are built on top of the mount context patchset so that
> > namespaces can be properly propagated over submounts/automounts.
>
> I'll stop here ... you get the idea that I think this is imposing a set
> of restrictions that will come back to bite us later.  If this is just
> for the sake of figuring out how to get keyring upcalls to work, then
> I'm sure we can come up with something.
You talk about a number of things I'm simply not aware of so I can't answer your
questions. But your points do sound like issues that need to be covered.
I think you mentioned user space used NFS ID mapper works fine.
I wonder, could you give more detail on that please.
Perhaps nsenter(1) is being used, I tried that as a possible usermode helper
solution and it probably did "work" in the sense of in container execution but
no-one liked it, it seems kernel folk expect to do things, well, in kernel.
Not only that there were other problems, probably request key sub system not
being namespace aware, or id caching within nfs or somewhere else, and there was
a question of not being able to cater for user namespace usage.
Anyway I do have a different view from my own experiences.
First there are a number of subsystems involved in creating a process from
within a container that has the container environment and, AFAICS (from the
usermode helper experience), it needs to be done from outside the container. For
example sub systems that need to be handled properly are the namespaces (and the
pid namespace in particular is tricky), credentials and cgroups, to name those
that come immediately to mind. I just couldn't get all that right after a number
of tries.
On Mon, 2017-05-22 at 17:22 +0100, David Howells wrote:
> Here are a set of patches to define a container object for the kernel and
> to provide some methods to create and manipulate them.
>
> The reason I think this is necessary is that the kernel has no idea how to
> direct upcalls to what userspace considers to be a container - current
> Linux practice appears to make a "container" just an arbitrarily chosen
> junction of namespaces, control groups and files, which may be changed
> individually within the "container".
>
> The kernel upcall mechanism then needs to decide which set of namespaces,
> etc., it must exec the appropriate upcall program.  Examples of this
> include:
>
>  (1) The DNS resolver.  The DNS cache in the kernel should probably be
> Â Â Â Â Â per-network namespace, but in userspace the program, its libraries and
> Â Â Â Â Â its config data are associated with a mount tree and a user namespace
> Â Â Â Â Â and it gets run in a particular pid namespace.
>
>  (2) NFS ID mapper.  The NFS ID mapping cache should also probably be
> Â Â Â Â Â per-network namespace.
>
>  (3) nfsdcltrack.  A way for NFSD to access stable storage for tracking
>      of persistent state.  Again, network-namespace dependent, but also
> Â Â Â Â Â perhaps mount-namespace dependent.
>
>  (4) General request-key upcalls.  Not particularly namespace dependent,
> Â Â Â Â Â apart from keyrings being somewhat governed by the user namespace and
> Â Â Â Â Â the upcall being configured by the mount namespace.
>
> These patches are built on top of the mount context patchset so that
> namespaces can be properly propagated over submounts/automounts.
>
> These patches implement a container object that holds the following things:
>
> Â (1) Namespaces.
>
> Â (2) A root directory.
>
> Â (3) A set of processes, including a designated 'init' process.
>
> Â (4) The creator's credentials, including ownership.
>
> Â (5) A place to hang security for the container, allowing policies to be
> Â Â Â Â Â set per-container.
>
> I also want to add:
>
> Â (6) Control groups.
>
> Â (7) A per-container keyring that can be added to from outside of the
> Â Â Â Â Â container, even once the container is live, for the provision of
> Â Â Â Â Â filesystem authentication/encryption keys in advance of the container
> Â Â Â Â Â being started.
It's hard to decide which of these has higher priority, I think both essential
to a container implementation.
Ian
On Mon, 2017-05-22 at 12:21 -0700, James Bottomley wrote:
>
> > > >  (3) nfsdcltrack.  A way for NFSD to access stable storage forÂ
> > > > tracking of persistent state.  Again, network-namespaceÂ
> > > > dependent, but also perhaps mount-namespace dependent.
> >
> > Definitely mount-namespace dependent.
> >
> > >
> > > So again, given we can set this up to work today, this sounds likeÂ
> > > more a restriction that will bite us than an enhancement that givesÂ
> > > us extra features.
> > >
> >
> > How do you set this up to work today?
>
> Well, as above, it spawns into the root, you jump it to where it should
> be and re-execute or simply handle in the host.Â
>
> > AFAIK, if you want to run knfsd in a container today, you're out ofÂ
> > luck for any non-trivial configuration.
>
> Well "running knfsd in a container" is actually different from having a
> containerised nfs export.  My understanding was that thanks to the work
> of Stas Kinsbursky, the latter has mostly worked since the 3.9 kernel
> for v3 and below.  I assume the current issue is that there's a problem
> with v4?
Oh, ok, I thought that, say, a docker (NFS) volumes-from a container to another
container didn't work for any version of NFS.
Certainly didn't work last time I tried, it was a while ago though.
Ian
Jeff Layton <[email protected]> writes:
> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
>> David Howells <[email protected]> writes:
>>
>> > Here are a set of patches to define a container object for the kernel and
>> > to provide some methods to create and manipulate them.
>> >
>> > The reason I think this is necessary is that the kernel has no idea how to
>> > direct upcalls to what userspace considers to be a container - current
>> > Linux practice appears to make a "container" just an arbitrarily chosen
>> > junction of namespaces, control groups and files, which may be changed
>> > individually within the "container".
>> >
>>
>> I think this might possibly be a useful abstraction for solving the
>> keyring upcalls if it was something created implicitly.
>>
>> fork_into_container for use by keyring upcalls is currently a security
>> vulnerability as it allows escaping all of a containers cgroups. But
>> you have that on your list of things to fix. However you don't have
>> seccomp and a few other things.
>>
>> Before we had kthreadd in the kernel upcalls always had issues because
>> the code to reset all of the userspace bits and make the forked
>> task suitable for running upcalls was always missing some detail. It is
>> a very bug-prone kind of idiom that you are talking about. It is doubly
>> bug-prone because the wrongness is visible to userspace and as such
>> might get become a frozen KABI guarantee.
>>
>> Let me suggest a concrete alternative:
>>
>> - At the time of mount observer the mounters user namespace.
>> - Find the mounters pid namespace.
>> - If the mounters pid namespace is owned by the mounters user namespace
>> walk up the pid namespace tree to the first pid namespace owned by
>> that user namespace.
>> - If the mounters pid namespace is not owned by the mounters user
>> namespace fail the mount it is going to need to make upcalls as
>> will not be possible.
>> - Hold a reference to the pid namespace that was found.
>>
>> Then when an upcall needs to be made fork a child of the init process
>> of the specified pid namespace. Or fail if the init process of the
>> pid namespace has died.
>>
>> That should always work and it does not require keeping expensive state
>> where we did not have it previously. Further because the semantics are
>> fork a child of a particular pid namespace's init as features get added
>> to the kernel this code remains well defined.
>>
>> For ordinary request-key upcalls we should be able to use the same rules
>> and just not save/restore things in the kernel.
>>
>
> OK, that does seem like a reasonable idea. Note that it's not just
> request-key upcalls here that we're interested in, but anything that
> we'd typically spawn from kthreadd otherwise.
General user mode helper *Nod*.
> That said, I worry a little about this. If the init process does a setns
> at the wrong time, suddenly you're doing the upcall in different
> namespaces than you intended.
>
> Might it be better to use the init process of the container as the
> template like you suggest, but snapshot its "context" at a particular
> point in time instead?
>
> knfsd could do this when it's started, for instance...
The danger of a snapshot it time is something important (like cgroup
membership) might change.
It might be necessary to have this be an opt-in. Perhaps even to the
point of starting a dedicated kthreadd.
Right now I think we need to figure out what it will take to solve this
in the kernel because I strongly suspect that solving this in userspace
is a cop out and we really aren't providing enough information to
userspace to run the helper in the proper context. And I strongly
suspect that providing enough information from the kernel will be
roughly equivalent to solving this in the kernel.
The only big issue I have had with the suggestion of a dedicated thread
in the past is the overhead something like that will breing with it.
Eric
James Bottomley <[email protected]> wrote:
> This sounds like a step in the wrong direction: the strength of the
> current container interfaces in Linux is that people who set up
> containers don't have to agree what they look like.
It may be a strength, but it is also a problem.
> So I can set up a user namespace without a mount namespace or an
> architecture emulation container with only a mount namespace.
(I presume you mean with only the mount namespace separate)
Yep. You can do that with this too.
> But ignoring my fun foibles with containers and to give a concrete
> example in terms of a popular orchestration system: in kubernetes,
> where certain namespaces are shared across pods, do you imagine the
> kernel's view of the "container" to be the pod or what kubernetes
> thinks of as the container?
Why not both? If the net_ns is created in the pod container, then probably
network-related upcalls should be directed there. Unless instructed
otherwise, upon creation a container object will inherit the caller's
namespaces.
> This is important, because half the examples you give below are network
> related and usually pods share a network namespace.
Yeah - I'm more familiar with upcalls made by NFS, AFS and keyrings.
> > (1) The DNS resolver. ...
>
> All persistent (written to fs data) has to be mount ns associated;
> there are no ifs, ands and buts to that. I agree this implies that if
> you want to run a separate network namespace, you either take DNS from
> the parent (a lot of containers do)
My intention is to make the DNS cache per-network namespace within the kernel.
Currently, there's only one and it's shared between all namespaces, but that
won't work if you end up with two net namespaces that direct the same
addresses to different things.
> or you set up a daemon to run within the mount namespace.
That's not currently an option: the DNS service upcalls only, and
/sbin/request-key is invoked in the init_ns. This performs the network
accesses in the wrong network namespace.
> I agree the latter is a slightly fiddly operation you have to get right, but
> that's why we have orchestration systems.
An orchestration system can use this. This is not a replacement for
Kubernetes or Docker or whatever.
> What is it we could do with the above that we cannot do today?
Upcall into an appropriate set of namespaces and keep the results separate by
network namespace.
> > (2) NFS ID mapper. The NFS ID mapping cache should also probably be
> > per-network namespace.
>
> I think this is a view but not the only one: Right at the moment, NFS
> ID mapping is used as the one of the ways we can get the user namespace
> ID mapping writes to file problems fixed ... that makes it a property
> of the mount namespace for a lot of containers.
In some ways it's really a property of the server, and two different servers
may appear in two separate network namespaces with the same apparent name and
address.
It's not a property of the mount namespace because mount namespaces share
superblocks, and this is done at the superblock level.
Possibly it should be done on the vfsmount, as a filter on the interaction
between userspace and kernel.
> There are many other instances where they do exactly as you say, but what
> I'm saying is that we don't want to lose the flexibility we currently have.
You don't really lose any flexibility; if anything, you gain it.
(Note that in case your objection is that I haven't yet implemented the
ability to set namespaces arbitrarily in a namespace, that's on list of things
to do that I included, as is adjusting the control groups.)
> All mount namespaces have an owning user namespace, so the data
> relations are already there in the kernel, is the problem simply
> finding them?
The superblocks used by the vfsmounts in a mount namespace aren't all
necessarily in the same user_ns, so none of:
sb->s_user_ns == current_user_ns()
sb->s_user_ns == current->ns->mnt_ns->user_ns
current->ns->mnt_ns->user_ns == current_user_ns()
need hold true that I can see.
> > These patches are built on top of the mount context patchset so that
> > namespaces can be properly propagated over submounts/automounts.
>
> I'll stop here ... you get the idea that I think this is imposing a set
> of restrictions that will come back to bite us later.
What restrictions am I imposing?
> If this is just for the sake of figuring out how to get keyring upcalls to
> work, then I'm sure we can come up with something.
No, it's not just for that, though, admittedly, all of the upcall mechanisms I
outlined use request_key() at the core.
Really, a container is an anchor for the resources you need to make an upcall,
but it can also be used to anchor other things.
One thing I've been asked for by a number of people is a per-container keyring
for the provision of authentication keys, fs decryption keys and other things
- but there's no actual container off which this can be hung.
Another thing that could be useful is a list of what device files a container
may access, so that we can allow limited mounting by the container root user
within the container.
Now these could be made into their own namespaces or added to one that already
exists - perhaps the mount namespace being the most logical.
David
On Tue, May 23, 2017 at 12:22 AM, Jeff Layton <[email protected]> wrote:
> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
>> David Howells <[email protected]> writes:
>>
>> > Here are a set of patches to define a container object for the kernel and
>> > to provide some methods to create and manipulate them.
>> >
>> > The reason I think this is necessary is that the kernel has no idea how to
>> > direct upcalls to what userspace considers to be a container - current
>> > Linux practice appears to make a "container" just an arbitrarily chosen
>> > junction of namespaces, control groups and files, which may be changed
>> > individually within the "container".
>> >
>>
>> I think this might possibly be a useful abstraction for solving the
>> keyring upcalls if it was something created implicitly.
>>
>> fork_into_container for use by keyring upcalls is currently a security
>> vulnerability as it allows escaping all of a containers cgroups. But
>> you have that on your list of things to fix. However you don't have
>> seccomp and a few other things.
>>
>> Before we had kthreadd in the kernel upcalls always had issues because
>> the code to reset all of the userspace bits and make the forked
>> task suitable for running upcalls was always missing some detail. It is
>> a very bug-prone kind of idiom that you are talking about. It is doubly
>> bug-prone because the wrongness is visible to userspace and as such
>> might get become a frozen KABI guarantee.
>>
>> Let me suggest a concrete alternative:
>>
>> - At the time of mount observer the mounters user namespace.
>> - Find the mounters pid namespace.
>> - If the mounters pid namespace is owned by the mounters user namespace
>> walk up the pid namespace tree to the first pid namespace owned by
>> that user namespace.
>> - If the mounters pid namespace is not owned by the mounters user
>> namespace fail the mount it is going to need to make upcalls as
>> will not be possible.
>> - Hold a reference to the pid namespace that was found.
>>
>> Then when an upcall needs to be made fork a child of the init process
>> of the specified pid namespace. Or fail if the init process of the
>> pid namespace has died.
>>
>> That should always work and it does not require keeping expensive state
>> where we did not have it previously. Further because the semantics are
>> fork a child of a particular pid namespace's init as features get added
>> to the kernel this code remains well defined.
>>
>> For ordinary request-key upcalls we should be able to use the same rules
>> and just not save/restore things in the kernel.
>>
>
> OK, that does seem like a reasonable idea. Note that it's not just
> request-key upcalls here that we're interested in, but anything that
> we'd typically spawn from kthreadd otherwise.
Generalizing it will expose the kernel to exploits, today containers
setup the mount namespace for images from the net, outdated
filesystems, and users just do it, it is easy. Having kthread running
inside such contexts is not a good idea. That's today usecases.
> That said, I worry a little about this. If the init process does a setns
> at the wrong time, suddenly you're doing the upcall in different
> namespaces than you intended.
That init process or whatever process inside owns that context and files.
Maybe for some cases it is better to use userspace that you can talk
to through a standard kernel bus endpoint and request a resource as it
is done within modern apps. The application at the other end acts
using kthread helpers in the appropriate context.
--
tixxdz
On Tue, 2017-05-23 at 07:54 -0500, Eric W. Biederman wrote:
> Jeff Layton <[email protected]> writes:
>
> > On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
> > > David Howells <[email protected]> writes:
> > >
> > > > Here are a set of patches to define a container object for the kernel and
> > > > to provide some methods to create and manipulate them.
> > > >
> > > > The reason I think this is necessary is that the kernel has no idea how to
> > > > direct upcalls to what userspace considers to be a container - current
> > > > Linux practice appears to make a "container" just an arbitrarily chosen
> > > > junction of namespaces, control groups and files, which may be changed
> > > > individually within the "container".
> > > >
> > >
> > > I think this might possibly be a useful abstraction for solving the
> > > keyring upcalls if it was something created implicitly.
> > >
> > > fork_into_container for use by keyring upcalls is currently a security
> > > vulnerability as it allows escaping all of a containers cgroups. But
> > > you have that on your list of things to fix. However you don't have
> > > seccomp and a few other things.
> > >
> > > Before we had kthreadd in the kernel upcalls always had issues because
> > > the code to reset all of the userspace bits and make the forked
> > > task suitable for running upcalls was always missing some detail. It is
> > > a very bug-prone kind of idiom that you are talking about. It is doubly
> > > bug-prone because the wrongness is visible to userspace and as such
> > > might get become a frozen KABI guarantee.
> > >
> > > Let me suggest a concrete alternative:
> > >
> > > - At the time of mount observer the mounters user namespace.
> > > - Find the mounters pid namespace.
> > > - If the mounters pid namespace is owned by the mounters user namespace
> > > walk up the pid namespace tree to the first pid namespace owned by
> > > that user namespace.
> > > - If the mounters pid namespace is not owned by the mounters user
> > > namespace fail the mount it is going to need to make upcalls as
> > > will not be possible.
> > > - Hold a reference to the pid namespace that was found.
> > >
> > > Then when an upcall needs to be made fork a child of the init process
> > > of the specified pid namespace. Or fail if the init process of the
> > > pid namespace has died.
> > >
> > > That should always work and it does not require keeping expensive state
> > > where we did not have it previously. Further because the semantics are
> > > fork a child of a particular pid namespace's init as features get added
> > > to the kernel this code remains well defined.
> > >
> > > For ordinary request-key upcalls we should be able to use the same rules
> > > and just not save/restore things in the kernel.
> > >
> >
> > OK, that does seem like a reasonable idea. Note that it's not just
> > request-key upcalls here that we're interested in, but anything that
> > we'd typically spawn from kthreadd otherwise.
>
> General user mode helper *Nod*.
>
> > That said, I worry a little about this. If the init process does a setns
> > at the wrong time, suddenly you're doing the upcall in different
> > namespaces than you intended.
> >
> > Might it be better to use the init process of the container as the
> > template like you suggest, but snapshot its "context" at a particular
> > point in time instead?
> >
> > knfsd could do this when it's started, for instance...
>
> The danger of a snapshot it time is something important (like cgroup
> membership) might change.
>
This is also a problem with relying on the userland program to do a
setns() and whatnot to set itself up for running in the container. If
something is added that it doesn't know about you'll just end up
inheriting whatever kthreadd had. If we don't get that right, we can end
up giving userland a security hole.
> It might be necessary to have this be an opt-in. Perhaps even to the
> point of starting a dedicated kthreadd.
>
I think we could live with that in knfsd-land. We could spawn a kthreadd
thread whenever a new nfsd_net is created. Then we'd just need something
like call_usermodehelper that puts the task create request on the right
kthreadd list. Running one more thread in your containerized NFS server
shouldn't be too onerous, I wouldn't think.
Once we start getting into uses with keyrings and the like though, I'm
not sure how workable that would be.
> Right now I think we need to figure out what it will take to solve this
> in the kernel because I strongly suspect that solving this in userspace
> is a cop out and we really aren't providing enough information to
> userspace to run the helper in the proper context. And I strongly
> suspect that providing enough information from the kernel will be
> roughly equivalent to solving this in the kernel.
>
> The only big issue I have had with the suggestion of a dedicated thread
> in the past is the overhead something like that will breing with it.
>
Yes, I don't see how you can do these sorts of upcalls properly without
either more help from the kernel, or without providing the kernel with
enough info to do it properly.
I don't quite get the arguments that have been made about loss of
flexibility either. The basic idea here is to communicate to the kernel
how a container is structured so that it can spawn processes inside of
it as necessary.
--
Jeff Layton <[email protected]>
On Tue, May 23, 2017 at 2:54 PM, Eric W. Biederman
<[email protected]> wrote:
> Jeff Layton <[email protected]> writes:
>
>> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
>>> David Howells <[email protected]> writes:
>>>
>>> > Here are a set of patches to define a container object for the kernel and
>>> > to provide some methods to create and manipulate them.
>>> >
>>> > The reason I think this is necessary is that the kernel has no idea how to
>>> > direct upcalls to what userspace considers to be a container - current
>>> > Linux practice appears to make a "container" just an arbitrarily chosen
>>> > junction of namespaces, control groups and files, which may be changed
>>> > individually within the "container".
>>> >
>>>
>>> I think this might possibly be a useful abstraction for solving the
>>> keyring upcalls if it was something created implicitly.
>>>
>>> fork_into_container for use by keyring upcalls is currently a security
>>> vulnerability as it allows escaping all of a containers cgroups. But
>>> you have that on your list of things to fix. However you don't have
>>> seccomp and a few other things.
>>>
>>> Before we had kthreadd in the kernel upcalls always had issues because
>>> the code to reset all of the userspace bits and make the forked
>>> task suitable for running upcalls was always missing some detail. It is
>>> a very bug-prone kind of idiom that you are talking about. It is doubly
>>> bug-prone because the wrongness is visible to userspace and as such
>>> might get become a frozen KABI guarantee.
>>>
>>> Let me suggest a concrete alternative:
>>>
>>> - At the time of mount observer the mounters user namespace.
>>> - Find the mounters pid namespace.
>>> - If the mounters pid namespace is owned by the mounters user namespace
>>> walk up the pid namespace tree to the first pid namespace owned by
>>> that user namespace.
>>> - If the mounters pid namespace is not owned by the mounters user
>>> namespace fail the mount it is going to need to make upcalls as
>>> will not be possible.
>>> - Hold a reference to the pid namespace that was found.
>>>
>>> Then when an upcall needs to be made fork a child of the init process
>>> of the specified pid namespace. Or fail if the init process of the
>>> pid namespace has died.
>>>
>>> That should always work and it does not require keeping expensive state
>>> where we did not have it previously. Further because the semantics are
>>> fork a child of a particular pid namespace's init as features get added
>>> to the kernel this code remains well defined.
>>>
>>> For ordinary request-key upcalls we should be able to use the same rules
>>> and just not save/restore things in the kernel.
>>>
>>
>> OK, that does seem like a reasonable idea. Note that it's not just
>> request-key upcalls here that we're interested in, but anything that
>> we'd typically spawn from kthreadd otherwise.
>
> General user mode helper *Nod*.
>
>> That said, I worry a little about this. If the init process does a setns
>> at the wrong time, suddenly you're doing the upcall in different
>> namespaces than you intended.
>>
>> Might it be better to use the init process of the container as the
>> template like you suggest, but snapshot its "context" at a particular
>> point in time instead?
>>
>> knfsd could do this when it's started, for instance...
>
> The danger of a snapshot it time is something important (like cgroup
> membership) might change.
>
> It might be necessary to have this be an opt-in. Perhaps even to the
> point of starting a dedicated kthreadd.
>
> Right now I think we need to figure out what it will take to solve this
> in the kernel because I strongly suspect that solving this in userspace
> is a cop out and we really aren't providing enough information to
> userspace to run the helper in the proper context. And I strongly
> suspect that providing enough information from the kernel will be
> roughly equivalent to solving this in the kernel.
Maybe it depends on the cases, a general approach can be too difficult
to handle especially from the security point. Maybe it is better to
identify what operations need what context, and a userspace
service/proxy can act using kthreadd with the right context... at
least the shift to this model has been done for years now in the
mobile industry.
--
tixxdz
Aleksa Sarai <[email protected]> wrote:
> >> The reason I think this is necessary is that the kernel has no idea
> >> how to direct upcalls to what userspace considers to be a container -
> >> current Linux practice appears to make a "container" just an
> >> arbitrarily chosen junction of namespaces, control groups and files,
> >> which may be changed individually within the "container".
>
> Just want to point out that if the kernel APIs for containers massively
> change, then the OCI will have to completely rework how we describe containers
> (and so will all existing runtimes).
>
> Not to mention that while I don't like how hard it is (from a runtime
> perspective) to actually set up a container securely, there are undoubtedly
> benefits to having namespaces split out. The network namespace being separate
> means that in certain contexts you actually don't want to create a new network
> namespace when creating a container.
Yep, I quite agree.
However, certain things need to be made per-net namespace that *aren't*. DNS
results, for instance.
As an example, I could set up a client machine with two ethernet ports, set up
two DNS+NFS servers, each of which think they're called "foo.bar" and attach
each server to a different port on the client machine. Then I could create a
pair of containers on the client machine and route the network in each
container to a different port. Now there's a problem because the names of the
cached DNS records for each port overlap.
Further, the NFS idmapper needs to be able to direct its calls to the
appropriate network.
> I had some ideas about how you could implement bridging in userspace (as an
> unprivileged user, for rootless containers) but if you can't join namespaces
> individually then such a setup is not practically possible.
I'm not proposing to take away the ability to arbitrarily set the namespaces
in a container. I haven't implemented it yet, but it was on the to-do list:
(7) Directly set a container's namespaces to allow cross-container
sharing.
David
On Tue, May 23, 2017, at 10:30 AM, Djalal Harouni wrote:
>
> Maybe it depends on the cases, a general approach can be too difficult
> to handle especially from the security point. Maybe it is better to
> identify what operations need what context, and a userspace
> service/proxy can act using kthreadd with the right context... at
> least the shift to this model has been done for years now in the
> mobile industry.
Why not drop the upcall model in favor of having userspace
monitor events via a (more efficient) protocol and react to them on its own?
It's just generally more flexible and avoids all of those issues like
replicating the seccomp configuration, etc.
Something like inotify/signalfd could be a precedent around having a read()/poll()able
fd. /proc/keys-requests ?
Then if you create a new user namespace, and open /proc/keys-requests, the
kernel will always write to that instead of calling /sbin/request-key.
On Tue, 2017-05-23 at 14:52 +0100, David Howells wrote:
> James Bottomley <[email protected]> wrote:
>
> > This sounds like a step in the wrong direction: the strength of the
> > current container interfaces in Linux is that people who set up
> > containers don't have to agree what they look like.
>
> It may be a strength, but it is also a problem.
>
> > So I can set up a user namespace without a mount namespace or an
> > architecture emulation container with only a mount namespace.
>
> (I presume you mean with only the mount namespace separate)
>
> Yep. You can do that with this too.
>
> > But ignoring my fun foibles with containers and to give a concrete
> > example in terms of a popular orchestration system: in kubernetes,
> > where certain namespaces are shared across pods, do you imagine the
> > kernel's view of the "container" to be the pod or what kubernetes
> > thinks of as the container?
>
> Why not both? If the net_ns is created in the pod container, then
> probably
> network-related upcalls should be directed there. Unless instructed
> otherwise, upon creation a container object will inherit the caller's
> namespaces.
The pod isn't a container, it's a collection of containers. Lets say
each container has a separate mount namespace but shares a network
namespace (this is a gross simplification, there are many other ways
you can set up a pod, but this one illustrates the point). For your
upcall you'd have to pick a kubernetes container and you don't have the
information to do that, even with your current patches, because what
kubernetes has done. This is where your view of "container" doesn't
match the kubernetes view.
> > This is important, because half the examples you give below are
> > network related and usually pods share a network namespace.
>
> Yeah - I'm more familiar with upcalls made by NFS, AFS and keyrings.
OK, so rather than getting into the technical back and forth below can
we agree that the kernel can't have a unitary view of "container"
because the current use cases (the orchestration systems) don't have
one? Then the next step becomes how can we add an abstraction that
gives you what you want (as far as I can tell basically identifying a
set of namespaces for an upcall) in a way that doesn't bind the kernel
to have a unitary view of a container? And then we can tack the ideas
on to the Jeff/Eric subthread.
James
David Howells <[email protected]> writes:
> Aleksa Sarai <[email protected]> wrote:
>
>> >> The reason I think this is necessary is that the kernel has no idea
>> >> how to direct upcalls to what userspace considers to be a container -
>> >> current Linux practice appears to make a "container" just an
>> >> arbitrarily chosen junction of namespaces, control groups and files,
>> >> which may be changed individually within the "container".
>>
>> Just want to point out that if the kernel APIs for containers massively
>> change, then the OCI will have to completely rework how we describe containers
>> (and so will all existing runtimes).
>>
>> Not to mention that while I don't like how hard it is (from a runtime
>> perspective) to actually set up a container securely, there are undoubtedly
>> benefits to having namespaces split out. The network namespace being separate
>> means that in certain contexts you actually don't want to create a new network
>> namespace when creating a container.
>
> Yep, I quite agree.
>
> However, certain things need to be made per-net namespace that *aren't*. DNS
> results, for instance.
>
> As an example, I could set up a client machine with two ethernet ports, set up
> two DNS+NFS servers, each of which think they're called "foo.bar" and attach
> each server to a different port on the client machine. Then I could create a
> pair of containers on the client machine and route the network in each
> container to a different port. Now there's a problem because the names of the
> cached DNS records for each port overlap.
Please look at ip netns add. It does solve this in userspace rather
simply.
> Further, the NFS idmapper needs to be able to direct its calls to the
> appropriate network.
Eric
Jessica Frazelle <[email protected]> wrote:
> Adding a container object seems a bit odd to me because there are so
> many different ways to make containers, aka not all namespaces are
> always used
This is already dealt with to some extent. It can create/inherit namespaces
like fork - except that you get an extra option (literally, create with no
mount namespace and create that when you provide a root mount).
Modifying the namespace subscriptions is on the to-do list.
> as well as not all cgroups,
cgroups are on the to-do list.
> various LSM objects sometimes apply,
I added a hook for the LSM to use.
> mounts blah blah blah.
You can mount into the container and you can create sockets in the container
from outside the container.
> The OCI spec
This?
https://github.com/opencontainers/runtime-spec/blob/master/README.md
> was made to cover all these things so why a kernel object?
Because there are some things the kernel doesn't do that it should (upcalling
into the correct namespace junction for example), and some things I've been
asked to add for which there's no clear place to do so.
> I don't exactly see a future where the container runtimes convert to this
> unless it covers all the same mods as the mods in the OCI spec, not saying
> it needs to abide by the spec, just saying it should allow all the same
> things.
I haven't looked at the OCI spec as yet.
Note that this is *not* a replacement for a container application. I'm not
trying to deprecate Docker or whatever. It's something for those container
applications to use.
> Which really just seems, imo, like a pain for the kernel to have to
> maintain.
Namespaces are a pain, particularly as lots of things exist in more than one
of the things.
David
Eric W. Biederman <[email protected]> wrote:
> > As an example, I could set up a client machine with two ethernet ports,
> > set up two DNS+NFS servers, each of which think they're called "foo.bar"
> > and attach each server to a different port on the client machine. Then I
> > could create a pair of containers on the client machine and route the
> > network in each container to a different port. Now there's a problem
> > because the names of the cached DNS records for each port overlap.
>
> Please look at ip netns add.
warthog>man ip | grep setns
warthog1>
> It does solve this in userspace rather simply.
Ummm... How? The kernel DNS resolver is not namespace aware.
David
David Howells <[email protected]> writes:
> Eric W. Biederman <[email protected]> wrote:
>
>> > As an example, I could set up a client machine with two ethernet ports,
>> > set up two DNS+NFS servers, each of which think they're called "foo.bar"
>> > and attach each server to a different port on the client machine. Then I
>> > could create a pair of containers on the client machine and route the
>> > network in each container to a different port. Now there's a problem
>> > because the names of the cached DNS records for each port overlap.
>>
>> Please look at ip netns add.
>
> warthog>man ip | grep setns
> warthog1>
Not setns netns
>> It does solve this in userspace rather simply.
>
> Ummm... How? The kernel DNS resolver is not namespace aware.
But it works fine if called in the proper context and we have a defacto
standard for where to put all of the files (the tricky part) if you are
dealing with multiple network namespaces simultaneously.
Eric
David Howells <[email protected]> writes:
> Another thing that could be useful is a list of what device files a container
> may access, so that we can allow limited mounting by the container root user
> within the container.
That is totally not why that isn't allowed, and won't be allowed any
time soon.
The issue is that the filesystem implementations in the kernel are not
prepared to handle hostile filesystem data structures so that that is
the definition of a kernel exploit. The attack surface of the kernel
gets quite a bit larger in that case.
Perhaps if all of the filesystems data structures had a hmac on them we
could allow something like this.
Once we can make it safe it is easy to add an appropriate interface. We
most defintiely don't need a ``container'' data structure in the kernel
to do that.
A completely unprivileged fuse is much more likely to work for this use
case.
And we do already have have the device cgroup which sort of does
this.
Eric
Colin Walters <[email protected]> wrote:
> Why not drop the upcall model in favor of having userspace monitor events
> via a (more efficient) protocol and react to them on its own?
(1) That's not necessarily more efficient. You now have the overhead of a
permanently running userspace daemon in every relevant namespace
combination.
(2) You then have to work out how to route to the appropriate daemon.
> It's just generally more flexible
Actually, it's less flexible. You can't easily get at the caller's
namespaces.
> and avoids all of those issues like replicating the seccomp configuration,
> etc.
So does my container implementation.
> Something like inotify/signalfd could be a precedent around having a read()/poll()able
> fd. /proc/keys-requests ?
>
> Then if you create a new user namespace, and open /proc/keys-requests, the
> kernel will always write to that instead of calling /sbin/request-key.
That's not good enough. You're basically making it one daemon per user
namespace and ignoring all the other namespaces.
[Also note that the kernel would have to paste a temporary authorisation key
into the daemon's session keyring for each key that requires instantiation].
David
On Tue, 2017-05-23 at 10:54 -0400, Colin Walters wrote:
> On Tue, May 23, 2017, at 10:30 AM, Djalal Harouni wrote:
> >
> > Maybe it depends on the cases, a general approach can be too difficult
> > to handle especially from the security point. Maybe it is better to
> > identify what operations need what context, and a userspace
> > service/proxy can act using kthreadd with the right context... at
> > least the shift to this model has been done for years now in the
> > mobile industry.
>
> Why not drop the upcall model in favor of having userspace
> monitor events via a (more efficient) protocol and react to them on its own?
> It's just generally more flexible and avoids all of those issues like
> replicating the seccomp configuration, etc.
>
> Something like inotify/signalfd could be a precedent around having a read()/poll()able
> fd. /proc/keys-requests ?
>
> Then if you create a new user namespace, and open /proc/keys-requests, the
> kernel will always write to that instead of calling /sbin/request-key.
Case in point:
nfsdcltrack was originally nfsdcld, a long running daemon that used
rpc_pipefs to talk to the kernel. That meant that you had to make sure
it gets enabled by systemd (or sysvinit, etc). If it dies, then you also
want to ensure that it gets restarted lest the kernel server hang,
etc...
It was pretty universally hated, as it was just one more daemon that you
needed to run to work a proper nfs server. So, I was encouraged to
switch it to a call_usermodehelper upcall and since then it has just
worked, as long as the binary is installed.
It's quite easy to say:
"You're doing it wrong. You just need to run all of these services as
long-running daemons."
But, that ignores the fact that handling long-running daemons for
infrequently used upcalls is actually quite painful to manage in
practice.
--
Jeff Layton <[email protected]>
On Tue, May 23, 2017, at 11:31 AM, Jeff Layton wrote:
>
> nfsdcltrack was originally nfsdcld, a long running daemon that used
> rpc_pipefs to talk to the kernel. That meant that you had to make sure
> it gets enabled by systemd (or sysvinit, etc). If it dies, then you also
> want to ensure that it gets restarted lest the kernel server hang,
> etc...
>
> It was pretty universally hated, as it was just one more daemon that you
> needed to run to work a proper nfs server. So, I was encouraged to
> switch it to a call_usermodehelper upcall and since then it has just
> worked, as long as the binary is installed.
Note that with the "read()/write() fd" model you don't need
a whole process just to do that...the functionality could be rolled into systemd
or equivalent easily enough.
> "You're doing it wrong. You just need to run all of these services as
> long-running daemons."
Also, I imagine we could figure out a clean model to do *activation*
from kernel -> userspace too. systemd's socket activation model
where pid 1 activates units on demand is quite nice and obviates
the need to configure things on in advance.
David Howells <[email protected]> writes:
> Here are a set of patches to define a container object for the kernel and
> to provide some methods to create and manipulate them.
Just so this discussion has some clarity.
Nacked-by: "Eric W. Biederman" <[email protected]>
As a user visible entity I see nothing this container data structure
helps solve it only muddies the waters and makes things more brittle.
Embracing the complexity of namespaces head on tends to mean all of the
goofy scary semantic corner cases are visible from the first version of
the design, and so developers can't take short cuts that result in
buggy kernel code that persists for decades. I am rather tired of
finding and fixing those.
Eric
On Tue, 2017-05-23 at 10:17 -0500, Eric W. Biederman wrote:
> David Howells <[email protected]> writes:
> > Eric W. Biederman <[email protected]> wrote:
> > > It does solve this in userspace rather simply.
> >
> > Ummm... How? The kernel DNS resolver is not namespace aware.
>
> But it works fine if called in the proper context and we have a
> defacto standard for where to put all of the files (the tricky part)
> if you are dealing with multiple network namespaces simultaneously.
I think you're missing each other's points slightly.
What David is pointing out is that the kernel has a DNS cache
(net/dns_resolver/) it can do name to IP translations, but isn't
namespaced. Once it has one entry all containers would see it if they
cause a lookup to go through the kernel cache, so going through the
cache you can't have a name resolving to different IP addresses on a
per container basis.
I think Eric's point is that if you need the same DNS names resolving
to different IP addresses on a per container basis, you can do this in
userspace today but you have to disable the in-kernel DNS cache.
James
Eric W. Biederman <[email protected]> wrote:
> Let me suggest a concrete alternative:
>
> - At the time of mount observer the mounters user namespace.
Looking at sget(), I don't think a mounter can see a superblock outside of
their namespace. There is something icky in there whereby all automounts are
currently transferred into the init_user_ns though (something to fix in my
mount-context series) :-/
> - Find the mounters pid namespace.
> - If the mounters pid namespace is owned by the mounters user namespace
> walk up the pid namespace tree to the first pid namespace owned by
> that user namespace.
> - If the mounters pid namespace is not owned by the mounters user
> namespace fail the mount it is going to need to make upcalls as
> will not be possible.
Take the following scenario:
(1) Create a process with a new network namespace. Set up the network to
route out of ethernet port 1.
(2) Create a child process with new network and user namespaces. Set up the
network to route out of ethernet port 2.
(3) Mount an NFS volume in the process created in (2).
The mount in (3) will fail unconditionally.
> - Hold a reference to the pid namespace that was found.
Take the following scenario:
(1) Create a process with new network and pid namespaces. Set up the network
to route out of ethernet port 1.
(2) Create a child process with new network and pid namespaces. Set up the
network to route out of ethernet port 2.
(3) Mount an NFS volume in the process created in (2).
(4) Create another child process with new network and pid namespaces. Set up
the network to route out of ethernet port 3.
(5) In the process created in (4), access the NFS volume created in (3).
The user namespace is the same all the way through.
Now you're holding a ref to the pid namespace created in (1) - but that is of
no use to you. The upcall must take place in the network namespace that
routes out through port 2.
David
James Bottomley <[email protected]> wrote:
> What David is pointing out is that the kernel has a DNS cache
> (net/dns_resolver/) it can do name to IP translations, but isn't
> namespaced. Once it has one entry all containers would see it if they
> cause a lookup to go through the kernel cache, so going through the
> cache you can't have a name resolving to different IP addresses on a
> per container basis.
Yes - and the transport to userspace, the request_key() upcall, isn't
namespaced either. Namespacing it isn't entirely simple since we have to set
the right mount namespace (for execve, config, etc.), plus any other relevant
namespaces (such as network) - which is dependent on key type.
I can't record the mount namespace in the network namespace because that would
create a dependency loop:
mnt_ns -> mnt -> sb -> net_ns -> mnt_ns
> I think Eric's point is that if you need the same DNS names resolving
> to different IP addresses on a per container basis, you can do this in
> userspace today but you have to disable the in-kernel DNS cache.
You could disable the in-kernel dns resolver in your config, but then you
don't get referrals in NFS. Also, CIFS, AFS and other filesystems would be
affected. If you're fine with the restrictions, then there is no problem.
David
David Howells <[email protected]> writes:
> James Bottomley <[email protected]> wrote:
>
>> What David is pointing out is that the kernel has a DNS cache
>> (net/dns_resolver/) it can do name to IP translations, but isn't
>> namespaced. Once it has one entry all containers would see it if they
>> cause a lookup to go through the kernel cache, so going through the
>> cache you can't have a name resolving to different IP addresses on a
>> per container basis.
>
> Yes - and the transport to userspace, the request_key() upcall, isn't
> namespaced either. Namespacing it isn't entirely simple since we have to set
> the right mount namespace (for execve, config, etc.), plus any other relevant
> namespaces (such as network) - which is dependent on key type.
>
> I can't record the mount namespace in the network namespace because that would
> create a dependency loop:
>
> mnt_ns -> mnt -> sb -> net_ns -> mnt_ns
I have already given a concrete suggest on how this might be untangled.
So I won't repeat it here.
>> I think Eric's point is that if you need the same DNS names resolving
>> to different IP addresses on a per container basis, you can do this in
>> userspace today but you have to disable the in-kernel DNS cache.
>
> You could disable the in-kernel dns resolver in your config, but then you
> don't get referrals in NFS. Also, CIFS, AFS and other filesystems would be
> affected. If you're fine with the restrictions, then there is no
> problem.
I haven't been arguing that at all. I was only pointing out that this
issue is not an issue with DNS. Userspace handles this all fine.
The issue is exclusively with this request_key api and generally user
mode upcalls.
I have no problem seeing that there is an issue with the kernel code.
I am well aware of the problem. Unfortunately the people who cared
enough to start addressing this have not been able to write kernel
code that fixes this.
My personal experience when I tried to use the request_key api at
the beginning of this was it was too hard to test. There was no room
for goofing up as at that time it was impossible to invalidate a cached
reply from userspace if you happened to know it was wrong. Which meant
that if something incorrect was cached it required rebooting the kernel.
I have a lot of sympathy with the view that the best way to do
some of this is with socket activations or perhaps something with rpc
portmapper. Where something like inetd is used to start the user space
component on-demand. I won't call that a solution to this case but I do
think it makes a good example to compare with.
When you need run something in a clean context having that something
only need to worry about the contents of the data it is receiving and
not about it's environment as suid applications do is a nice
simplification.
The entire user mode helper paradigm removes from user space the freedom
to specify what context it's code should run in. In a world where
everything is global that is fine. But in a world with containers where
not everything is global it becomes a royal pain.
And I am very very sympathetic to solving this. The only solution that
I know would work is to capture the context at some point in a process
and then to use that process to fork user mode helpers.
So far no one has even bothered to seriously try the one solution that
is guaranteed to work because it takes a lot of changes to kernel code.
I believe the last effort snagged on what a pain it is to refactor the
user mode helper infrastructure.
I don't see in your code any of that work.
I am glad to see that you also see the problem. At least when it comes
to the request_key api.
What I am hoping to see is someone who has the will to dig in and
understand all of the interactions and refactor the kernel to solve
the problem.
This is not a case where our user space interfaces are preventing a
solution to this problem (as your patchset implies). This is a case
where things need to be refactored kernel side to solve this.
So far this attempt is just another in the bazillion or so bad
half-assed attempts to solve this problem I have seen over the years.
Eric
On Wed, 2017-05-24 at 03:26 -0500, Eric W. Biederman wrote:
>
> So far no one has even bothered to seriously try the one solution that
> is guaranteed to work because it takes a lot of changes to kernel code.
> I believe the last effort snagged on what a pain it is to refactor the
> user mode helper infrastructure.
Yes, that's mostly true in my case although I wouldn't say I haven't looked at
it seriously but equally I haven't got anything towards it yet either, sorry.
I'm likely going to revisit this based on a couple of approaches.
One is just what you describe and I had already been looking at this some time
ago. It seems to me that adding a work queue type that starts and retains a
process until the work queue is destroyed (similar to the way the work queue sub
system starts a fail over thread for use under resource exhaustion) would be a
sensible way to do it.
This doesn't mean I think it's a good idea for reasons I've outlined in the past
but the approach does warrant the effort to work out if it can be used without
problems.
And there's also the request key infrastructure which, as it is now, gets in the
road of verifying results, *sigh*.
Ian
T24gTW9uLCAyMDE3LTA1LTIyIGF0IDE0OjA0IC0wNTAwLCBFcmljIFcuIEJpZWRlcm1hbiB3cm90
ZToNCj4gRGF2aWQgSG93ZWxscyA8ZGhvd2VsbHNAcmVkaGF0LmNvbT4gd3JpdGVzOg0KPiANCj4g
PiBIZXJlIGFyZSBhIHNldCBvZiBwYXRjaGVzIHRvIGRlZmluZSBhIGNvbnRhaW5lciBvYmplY3Qg
Zm9yIHRoZQ0KPiA+IGtlcm5lbCBhbmQNCj4gPiB0byBwcm92aWRlIHNvbWUgbWV0aG9kcyB0byBj
cmVhdGUgYW5kIG1hbmlwdWxhdGUgdGhlbS4NCj4gPiANCj4gPiBUaGUgcmVhc29uIEkgdGhpbmsg
dGhpcyBpcyBuZWNlc3NhcnkgaXMgdGhhdCB0aGUga2VybmVsIGhhcyBubyBpZGVhDQo+ID4gaG93
IHRvDQo+ID4gZGlyZWN0IHVwY2FsbHMgdG8gd2hhdCB1c2Vyc3BhY2UgY29uc2lkZXJzIHRvIGJl
IGEgY29udGFpbmVyIC0NCj4gPiBjdXJyZW50DQo+ID4gTGludXggcHJhY3RpY2UgYXBwZWFycyB0
byBtYWtlIGEgImNvbnRhaW5lciIganVzdCBhbiBhcmJpdHJhcmlseQ0KPiA+IGNob3Nlbg0KPiA+
IGp1bmN0aW9uIG9mIG5hbWVzcGFjZXMsIGNvbnRyb2wgZ3JvdXBzIGFuZCBmaWxlcywgd2hpY2gg
bWF5IGJlDQo+ID4gY2hhbmdlZA0KPiA+IGluZGl2aWR1YWxseSB3aXRoaW4gdGhlICJjb250YWlu
ZXIiLg0KPiA+IA0KPiANCj4gSSB0aGluayB0aGlzIG1pZ2h0IHBvc3NpYmx5IGJlIGEgdXNlZnVs
IGFic3RyYWN0aW9uIGZvciBzb2x2aW5nIHRoZQ0KPiBrZXlyaW5nIHVwY2FsbHMgaWYgaXQgd2Fz
IHNvbWV0aGluZyBjcmVhdGVkIGltcGxpY2l0bHkuDQo+IA0KPiBmb3JrX2ludG9fY29udGFpbmVy
IGZvciB1c2UgYnkga2V5cmluZyB1cGNhbGxzIGlzIGN1cnJlbnRseSBhDQo+IHNlY3VyaXR5DQo+
IHZ1bG5lcmFiaWxpdHkgYXMgaXQgYWxsb3dzIGVzY2FwaW5nIGFsbCBvZiBhIGNvbnRhaW5lcnMg
Y2dyb3Vwcy7CoMKgQnV0DQo+IHlvdSBoYXZlIHRoYXQgb24geW91ciBsaXN0IG9mIHRoaW5ncyB0
byBmaXguwqDCoEhvd2V2ZXIgeW91IGRvbid0IGhhdmUNCj4gc2VjY29tcCBhbmQgYSBmZXcgb3Ro
ZXIgdGhpbmdzLg0KPiANCj4gQmVmb3JlIHdlIGhhZCBrdGhyZWFkZCBpbiB0aGUga2VybmVsIHVw
Y2FsbHMgYWx3YXlzIGhhZCBpc3N1ZXMNCj4gYmVjYXVzZQ0KPiB0aGUgY29kZSB0byByZXNldCBh
bGwgb2YgdGhlIHVzZXJzcGFjZSBiaXRzIGFuZCBtYWtlIHRoZSBmb3JrZWQNCj4gdGFzayBzdWl0
YWJsZSBmb3IgcnVubmluZyB1cGNhbGxzIHdhcyBhbHdheXMgbWlzc2luZyBzb21lIGRldGFpbC7C
oMKgSXQNCj4gaXMNCj4gYSB2ZXJ5IGJ1Zy1wcm9uZSBraW5kIG9mIGlkaW9tIHRoYXQgeW91IGFy
ZSB0YWxraW5nIGFib3V0LsKgwqBJdCBpcw0KPiBkb3VibHkNCj4gYnVnLXByb25lIGJlY2F1c2Ug
dGhlIHdyb25nbmVzcyBpcyB2aXNpYmxlIHRvIHVzZXJzcGFjZSBhbmQgYXMgc3VjaA0KPiBtaWdo
dCBnZXQgYmVjb21lIGEgZnJvemVuIEtBQkkgZ3VhcmFudGVlLg0KPiANCj4gTGV0IG1lIHN1Z2dl
c3QgYSBjb25jcmV0ZSBhbHRlcm5hdGl2ZToNCj4gDQo+IC0gQXQgdGhlIHRpbWUgb2YgbW91bnQg
b2JzZXJ2ZXIgdGhlIG1vdW50ZXJzIHVzZXIgbmFtZXNwYWNlLg0KPiAtIEZpbmQgdGhlIG1vdW50
ZXJzIHBpZCBuYW1lc3BhY2UuDQo+IC0gSWYgdGhlIG1vdW50ZXJzIHBpZCBuYW1lc3BhY2UgaXMg
b3duZWQgYnkgdGhlIG1vdW50ZXJzIHVzZXINCj4gbmFtZXNwYWNlDQo+IMKgIHdhbGsgdXAgdGhl
IHBpZCBuYW1lc3BhY2UgdHJlZSB0byB0aGUgZmlyc3QgcGlkIG5hbWVzcGFjZSBvd25lZCBieQ0K
PiDCoCB0aGF0IHVzZXIgbmFtZXNwYWNlLg0KPiAtIElmIHRoZSBtb3VudGVycyBwaWQgbmFtZXNw
YWNlIGlzIG5vdCBvd25lZCBieSB0aGUgbW91bnRlcnMgdXNlcg0KPiDCoCBuYW1lc3BhY2UgZmFp
bCB0aGUgbW91bnQgaXQgaXMgZ29pbmcgdG8gbmVlZCB0byBtYWtlIHVwY2FsbHMgYXMNCj4gwqAg
d2lsbCBub3QgYmUgcG9zc2libGUuDQo+IC0gSG9sZCBhIHJlZmVyZW5jZSB0byB0aGUgcGlkIG5h
bWVzcGFjZSB0aGF0IHdhcyBmb3VuZC4NCj4gDQo+IFRoZW4gd2hlbiBhbiB1cGNhbGwgbmVlZHMg
dG8gYmUgbWFkZSBmb3JrIGEgY2hpbGQgb2YgdGhlIGluaXQgcHJvY2Vzcw0KPiBvZiB0aGUgc3Bl
Y2lmaWVkIHBpZCBuYW1lc3BhY2UuwqDCoE9yIGZhaWwgaWYgdGhlIGluaXQgcHJvY2VzcyBvZiB0
aGUNCj4gcGlkIG5hbWVzcGFjZSBoYXMgZGllZC4NCj4gDQo+IFRoYXQgc2hvdWxkIGFsd2F5cyB3
b3JrIGFuZCBpdCBkb2VzIG5vdCByZXF1aXJlIGtlZXBpbmcgZXhwZW5zaXZlDQo+IHN0YXRlDQo+
IHdoZXJlIHdlIGRpZCBub3QgaGF2ZSBpdCBwcmV2aW91c2x5LsKgwqBGdXJ0aGVyIGJlY2F1c2Ug
dGhlIHNlbWFudGljcw0KPiBhcmUNCj4gZm9yayBhIGNoaWxkIG9mIGEgcGFydGljdWxhciBwaWQg
bmFtZXNwYWNlJ3MgaW5pdCBhcyBmZWF0dXJlcyBnZXQNCj4gYWRkZWQNCj4gdG8gdGhlIGtlcm5l
bCB0aGlzIGNvZGUgcmVtYWlucyB3ZWxsIGRlZmluZWQuDQo+IA0KPiBGb3Igb3JkaW5hcnkgcmVx
dWVzdC1rZXkgdXBjYWxscyB3ZSBzaG91bGQgYmUgYWJsZSB0byB1c2UgdGhlIHNhbWUNCj4gcnVs
ZXMNCj4gYW5kIGp1c3Qgbm90IHNhdmUvcmVzdG9yZSB0aGluZ3MgaW4gdGhlIGtlcm5lbC4NCj4g
DQo+IEEgaHVnZSBhZHZhbnRhZ2Ugb2YgbXkgYWx0ZXJuYXRpdmUgKG90aGVyIHRoYW4gbm90IGJl
aW5nIGEgYml0LXJvdA0KPiBtYWduZXQpIGlzIHRoYXQgaXQgc2hvdWxkIGRyb3AgaW50byBleGlz
dGluZyBjb250YWluZXIgaW5mcmFzdHJ1Y3R1cmUNCj4gd2l0aG91dCBwcm9ibGVtcy7CoMKgVGhl
IHJ1bGUgZm9yIGNvbnRhaW5lciBpbXBsZW1lbnRvcnMgaXMgc2ltcGxlIHRvDQo+IHVzZQ0KPiBz
ZWN1cml0eSBrZXkgaW5mcmFzdHJ1Y3R1cmUgeW91IG5lZWQgdG8gaGF2ZSBjcmVhdGVkIGEgcGlk
IG5hbWVzcGFjZQ0KPiBpbg0KPiB5b3VyIHVzZXIgbmFtZXNwYWNlLg0KPiANCg0KV2hpbGUgdGhp
cyBtYXkgYmUgcGFydCBvZiBhIHNvbHV0aW9uLCBJIGRvbid0IHNlZSBob3cgaXQgY2FuIGRlYWwg
d2l0aA0KaXNzdWVzIHN1Y2ggYXMgdGhlIG5lZWQgdG8gc2V0IHVwIGFuIFJQQ1NFQ19HU1Mgc2Vz
c2lvbiBvbiBiZWhhbGYgb2YNCnRoZSB1c2VyLiBUaGUgaXNzdWUgdGhlcmUgaXMgdGhhdCB3aGls
ZSB0aGUgbW91bnQgbWF5IGhhdmUgYmVlbiBjcmVhdGVkDQppbiBhIHBhcmVudCBuYW1lc3BhY2Us
IHRoZSBhY3R1YWwgY2FsbCB0byBraW5pdCB0byBzZXQgdXAgYSBrZXJiZXJvcw0KY29udGV4dCBp
cyBsaWtlbHkgdG8gaGF2ZSBiZWVuIG1hZGUgaW5zaWRlIHRoZSBjb250YWluZXIuIEl0IG1heSBl
dmVuDQpoYXZlIGJlZW4gZG9uZSB1c2luZyBhIGNvbXBsZXRlbHkgc2VwYXJhdGUgbmV0IG5hbWVz
cGFjZS4gU28gaW4gb3JkZXINCnRvIHNldCB1cCBteSBSUENTRUNfR1NTIHNlc3Npb24sIEkgbWF5
IG5lZWQgdG8gZG8gc28gZnJvbSBpbnNpZGUgdGhlDQp1c2VyIGNvbnRhaW5lci4NCg0KSW4gdGhh
dCBraW5kIG9mIGVudmlyb25tZW50LCBtaWdodCBpdCBwZXJoYXBzIG1ha2Ugc2Vuc2UgdG8ganVz
dCBhbGxvdw0KYW4gdXBjYWxsIGV4ZWN1dGFibGUgcnVubmluZyBpbiB0aGUgcm9vdCBpbml0IG5h
bWVzcGFjZSB0byB0dW5uZWwNCnRocm91Z2ggKHVzaW5nIHNldG5zKCkpIHNvIGl0IGNhbiBhY3R1
YWxseSBleGVjdXRlIGluIHRoZSBjb250ZXh0IG9mDQp0aGUgY2hpbGQgY29udGFpbmVyPyBUaGF0
IHdvdWxkIGtlZXAgc2VjdXJpdHkgcG9saWN5IHdpdGggdGhlIGluaXQNCm5hbWVzcGFjZSwgYnV0
IHdvdWxkIGFsc28gZW5zdXJlIHRoYXQgdGhlIGNvbnRhaW5lciBlbnZpcm9ubWVudCBydWxlcw0K
bWF5IGJlIGFwcGxpZWQgaWYgYW5kIHdoZW4gYXBwcm9wcmlhdGUuDQoNCkluIGFkZGl0aW9uIHRv
IHRvZGF5J3MgdXBjYWxsIG1lY2hhbmlzbSwgd2Ugd291bGQgbmVlZCB0aGUgYWJpbGl0eSB0bw0K
cGFzcyBpbiB0aGUgbnNwcm94eSAoYW5kIHJvb3QgZGlyZWN0b3J5KSBmb3IgdGhlIGNvbmZpbmVk
IHByb2Nlc3MgdGhhdA0KdHJpZ2dlcmVkIHRoZSB1cGNhbGwgYW5kL29yIHRoZSBuYW1lc3BhY2Ug
Zm9yIHRoZSBtb3VudHBvaW50LiBJJ20NCmFzc3VtaW5nIHRoYXQgY291bGQgYmUgZG9uZSBieSBw
YXNzaW5nIGluIGEgZmlsZSBkZXNjcmlwdG9yIHRvIHRoZQ0KYXBwcm9wcmlhdGUgL3Byb2MgZW50
cmllcz8NCg0KVGhlIGRvd25zaWRlIG9mIGFuIGFwcHJvYWNoIGxpa2UgdGhpcyBpcyB0aGF0IGl0
IHJlcXVpcmVzIGNvbnRhaW5lcg0KYXdhcmVuZXNzIGluIHRoZSB1cGNhbGwgZXhlY3V0YWJsZXMg
dGhlbXNlbHZlcy4gSWYgdGhlIGV4ZWN1dGFibGVzDQpkb24ndCBrbm93IHdoYXQgdGhleSBhcmUg
ZG9pbmcsIHRoZXkgY291bGQgZW5kIHVwIGxlYWtpbmcgaW5mb3JtYXRpb24NCmZyb20gdGhlIGlu
aXQgbmFtZXNwYWNlIHRvIHRoZSBwcm9jZXNzIHJ1bm5pbmcgaW4gdGhlIGNvbnRhaW5lciB2aWEg
dGhlDQprZXlyaW5nLg0KDQpDaGVlcnMNCiAgVHJvbmQNCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QN
CkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lciwgUHJpbWFyeURhdGENCnRyb25kLm15a2xlYnVz
dEBwcmltYXJ5ZGF0YS5jb20NCg==
On Sat, 2017-05-27 at 17:45 +0000, Trond Myklebust wrote:
> On Mon, 2017-05-22 at 14:04 -0500, Eric W. Biederman wrote:
> > David Howells <[email protected]> writes:
> >
> > > Here are a set of patches to define a container object for the
> > > kernel and
> > > to provide some methods to create and manipulate them.
> > >
> > > The reason I think this is necessary is that the kernel has no
> > > idea
> > > how to
> > > direct upcalls to what userspace considers to be a container -
> > > current
> > > Linux practice appears to make a "container" just an arbitrarily
> > > chosen
> > > junction of namespaces, control groups and files, which may be
> > > changed
> > > individually within the "container".
> > >
> >
> > I think this might possibly be a useful abstraction for solving the
> > keyring upcalls if it was something created implicitly.
> >
> > fork_into_container for use by keyring upcalls is currently a
> > security
> > vulnerability as it allows escaping all of a containers cgroups.
> > But
> > you have that on your list of things to fix. However you don't
> > have
> > seccomp and a few other things.
> >
> > Before we had kthreadd in the kernel upcalls always had issues
> > because
> > the code to reset all of the userspace bits and make the forked
> > task suitable for running upcalls was always missing some detail.
> > It
> > is
> > a very bug-prone kind of idiom that you are talking about. It is
> > doubly
> > bug-prone because the wrongness is visible to userspace and as such
> > might get become a frozen KABI guarantee.
> >
> > Let me suggest a concrete alternative:
> >
> > - At the time of mount observer the mounters user namespace.
> > - Find the mounters pid namespace.
> > - If the mounters pid namespace is owned by the mounters user
> > namespace
> > walk up the pid namespace tree to the first pid namespace owned
> > by
> > that user namespace.
> > - If the mounters pid namespace is not owned by the mounters user
> > namespace fail the mount it is going to need to make upcalls as
> > will not be possible.
> > - Hold a reference to the pid namespace that was found.
> >
> > Then when an upcall needs to be made fork a child of the init
> > process
> > of the specified pid namespace. Or fail if the init process of the
> > pid namespace has died.
> >
> > That should always work and it does not require keeping expensive
> > state
> > where we did not have it previously. Further because the semantics
> > are
> > fork a child of a particular pid namespace's init as features get
> > added
> > to the kernel this code remains well defined.
> >
> > For ordinary request-key upcalls we should be able to use the same
> > rules
> > and just not save/restore things in the kernel.
> >
> > A huge advantage of my alternative (other than not being a bit-rot
> > magnet) is that it should drop into existing container
> > infrastructure
> > without problems. The rule for container implementors is simple to
> > use
> > security key infrastructure you need to have created a pid
> > namespace
> > in
> > your user namespace.
> >
>
> While this may be part of a solution, I don't see how it can deal
> with issues such as the need to set up an RPCSEC_GSS session on
> behalf of the user. The issue there is that while the mount may have
> been created in a parent namespace, the actual call to kinit to set
> up a kerberos context is likely to have been made inside the
> container. It may even have been done using a completely separate net
> namespace. So in order to set up my RPCSEC_GSS session, I may need to
> do so from inside the user container.
So perhaps the way to deal with this is to have a dynamic upcall
interface where you're expected to write the path to the upcall binary
(the initial upcall would be grandfathered to the root namespaces).
For a container, we could make this capture the nsproxy at time of
write, meaning that as long as the orchestration system sets up
everything it wants and then writes the upcall binary, we always know
the namespace environment to execute it in (we'll have to hunt for a
parallel method for doing this for cgroups). The in-kernel subsystem
executing the upcall would have to be aware there were multiple
possible ones and know how to look for the one it needs based on
triggering parameters (likely net ns). We'd probably have to tie the
lifetime of the nsproxy to the mount ns, so it would be destroyed and
removed from the upcall list as soon as the mount ns goes away.
The great thing about this is that the kernel makes no assumptions at
all about what the environment is: the orchestration system tells it
when it's ready, so when it's built all the necessary OS
virtualizations.
> In that kind of environment, might it perhaps make sense to just
> allow an upcall executable running in the root init namespace to
> tunnel through (using setns()) so it can actually execute in the
> context of the child container? That would keep security policy with
> the init namespace, but would also ensure that the container
> environment rules may be applied if and when appropriate.
So I think having the container tell you when it's constructed the
upcall container, by writing the upcall path does all this for you.
> In addition to today's upcall mechanism, we would need the ability to
> pass in the nsproxy (and root directory) for the confined process
> that triggered the upcall and/or the namespace for the mountpoint.
> I'm assuming that could be done by passing in a file descriptor to
> the appropriate /proc entries?
OK, so the proposed approach does this too by capturing the nsproxy at
the moment you declare the upcall path for the container.
> The downside of an approach like this is that it requires container
> awareness in the upcall executables themselves. If the executables
> don't know what they are doing, they could end up leaking information
> from the init namespace to the process running in the container via
> the keyring.
This would depend on security policy. Right at the moment, with the
proposed nsproxy capture I think if we don't find a registered upcall,
we do have to execute the root one (because that's what we do today)
meaning the upcall binary in the host has to be container aware. I
don't think any of the container upcalls have to be.
The only remaining problem is how does the container orchestration
system know which upcalls it is supposed to be containerising ... this
sounds like a full list we need out of the kernel and some sort of
metadata on the container creator.
James
On 2017-05-22 17:22, David Howells wrote:
> A container is then a kernel object that contains the following things:
>
> (1) Namespaces.
>
> (2) A root directory.
>
> (3) A set of processes, including one designated as the 'init' process.
>
> A container is created and attached to a file descriptor by:
>
> int cfd = container_create(const char *name, unsigned int flags);
>
> this inherits all the namespaces of the parent container unless otherwise
> the mask calls for new namespaces.
>
> CONTAINER_NEW_FS_NS
> CONTAINER_NEW_EMPTY_FS_NS
> CONTAINER_NEW_CGROUP_NS [root only]
> CONTAINER_NEW_UTS_NS
> CONTAINER_NEW_IPC_NS
> CONTAINER_NEW_USER_NS
> CONTAINER_NEW_PID_NS
> CONTAINER_NEW_NET_NS
>
> Other flags include:
>
> CONTAINER_KILL_ON_CLOSE
> CONTAINER_CLOSE_ON_EXEC
Hi David,
I wanted to respond to this thread to attempt some constructive feedback,
better late than never. I had a look at your fsopen/fsmount() patchset(s) to
support this patchset which was interesting, but doesn't directly affect my
work. The primary patch of interest to the audit kernel folks (Paul Moore and
me) is this patch while the rest of the patchset is interesting, but not likely
to directly affect us. This patch has most of what we need to solve our
problem.
Paul and I agree that audit is going to have a difficult time identifying
containers or even namespaces without some change to the kernel. The audit
subsystem in the kernel needs at least a basic clue about which container
caused an event to be able to report this at the appropriate level and ignore
it at other levels to avoid a DoS.
We also agree that there will need to be some sort of trigger from userspace to
indicate the creation of a container and its allocated resources and we're not
really picky how that is done, such as a clone flag, a syscall or a sysfs write
(or even a read, I suppose), but there will need to be some permission
restrictions, obviously. (I'd like to see capabilities used for this by adding
a specific container bit to the capabilities bitmask.)
I doubt we will be able to accomodate all definitions or concepts of a
container in a timely fashion. We'll need to start somewhere with a minimum
definition so that we can get traction and actually move forward before another
compelling shared kernel microservice method leaves our entire community
behind. I'd like to declare that a container is a full set of cloned
namespaces, but this is inefficient, overly constricting and unnecessary for
our needs. If we could agree on a minimum definition of a container (which may
have only one specific cloned namespace) then we have something on which to
build. I could even see a container being defined by a trigger sent from
userspace about a process (task) from which all its children are considered to
be within that container, subject to further nesting.
In the simplest usable model for audit, if a container (definition implies and)
starts a PID namespace, then the container ID could simply be the container's
"init" process PID in the initial PID namespace. This assumes that as soon as
that process vanishes, that entire container and all its children are killed
off (which you've done). There may be some container orchestration systems
that don't use a unique PID namespace per container and that imposing this will
cause them challenges.
If containers have at minimum a unique mount namespace then the root path
dentry inode device and inode number could be used, but there are likely better
identifiers. Again, there may be container orchestrators that don't use a
unique mount namespace per container and that imposing this will cause
challenges.
I expect there are similar examples for each of the other namespaces.
If we could pick one namespace type for consensus for which each container has
a unique instance of that namespace, we could use the dev/ino tuple from that
namespace as had originally been suggested by Aristeu Rozanski more than 4
years ago as part of the set of namespace IDs. I had also attempted to
solve this problem by using the namespace' proc inode, then switched over to
generate a unique kernel serial number for each namespace and then went back to
namespace proc dev/ino once Al Viro implemented nsfs:
v1 https://lkml.org/lkml/2014/4/22/662
v2 https://lkml.org/lkml/2014/5/9/637
v3 https://lkml.org/lkml/2014/5/20/287
v4 https://lkml.org/lkml/2014/8/20/844
v5 https://lkml.org/lkml/2014/10/6/25
v6 https://lkml.org/lkml/2015/4/17/48
v7 https://lkml.org/lkml/2015/5/12/773
These patches don't use a container ID, but track all namespaces in use for an
event. This has the benefit of punting this tracking to userspace for some
other tool to analyse and determine to which container an event belongs.
This will use a lot of bandwidth in audit log files when a single
container ID that doesn't require nesting information to be complete
would be a much more efficient use of audit log bandwidth.
If we rely only on the setting of arbitrary container names from userspace,
then we must provide a map or tree back to the initial audit domain for that
running kernel to be able to differentiate between potentially identical
container names assigned in a nested container system. If we assign a
container serial number sequentially (atomic64_inc) from the kernel on request
from userspace like the sessionID and log the creation with all nsIDs and the
parent container serial number and/or container name, the nesting is clear due
to lack of ambiguity in potential duplicate names in nesting. If a container
serial number is used, the tree of inheritance of nested containers can be
rebuilt from the audit records showing what containers were spawned from what
parent.
As was suggested in one of the previous threads, if there are any events not
associated with a task (incoming network packets) we log the namespace ID and
then only concern ourselves with its container serial number or container name
once it becomes associated with a task at which point that tracking will be
more important anyways.
I'm not convinced that a userspace or kernel generated UUID is that useful
since they are large, not human readable and may not be globally unique given
the "pets vs cattle" direction we are going with potentially identical
conditions in hosts or containers spawning containers, but I see no need to
restrict them.
How do we deal with setns()? Once it is determined that action is permitted,
given the new combinaiton of namespaces and potential membership in a different
container, record the transition from one container to another including all
namespaces if the latter are a different subset than the target container
initial set.
David, this patch of yours provides most of what we need, but there is a danger
that some compromises (complete freedom of which namespaces to clone) will make
it unusable for our needs unless other mechanisms are added (internal container
serial number).
To answer Andy's inevitable question: We want to be able to attribute audit
events, whether they are generated by userspace or by a kernel event, to a
specific container. Since the kernel has no concept of a container, it needs
at least a rudimentary one to be able to track activity of kernel objects,
similar to what is already done with the loginuid (auid) and sessionid, neither
of which are kernel concepts, but the kernel keeps track of these as a service
to userspace. We are able to track activity by task, but we don't know when
that task or its namespaces (both resources) were allocated to a nebulous
"container". This resource tracking is required for security
certifications.
Thanks.
> Note that I've added a pointer to the current container to task_struct.
> This doesn't make the nsproxy pointer redundant as you can still make new
> namespaces with clone().
>
> I've also added a list_head to task_struct to form a list in the container
> of its member processes. This is convenient, but redundant since the code
> could iterate over all the tasks looking for ones that have a matching
> task->container.
>
>
> ==================
> FUTURE DEVELOPMENT
> ==================
>
> (1) Setting up the container.
>
> It should then be possible for the supervising process to modify the
> new container by:
>
> container_mount(int cfd,
> const char *source,
> const char *target, /* NULL -> root */
> const char *filesystemtype,
> unsigned long mountflags,
> const void *data);
> container_chroot(int cfd, const char *path);
> container_bind_mount_across(int cfd,
> const char *source,
> const char *target); /* NULL -> root */
> mkdirat(int cfd, const char *path, mode_t mode);
> mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
> int fd = openat(int cfd, const char *path,
> unsigned int flags, mode_t mode);
> int fd = container_socket(int cfd, int domain, int type,
> int protocol);
>
> Opening a netlink socket inside the container should allow management
> of the container's network namespace.
>
> (2) Starting the container.
>
> Once all modifications are complete, the container's 'init' process
> can be started by:
>
> fork_into_container(int cfd);
>
> This precludes further external modification of the mount tree within
> the container. Before this point, the container is simply destroyed
> if the container fd is closed.
>
> (3) Waiting for the container to complete.
>
> The container fd can then be polled to wait for init process therein
> to complete and the exit code collected by:
>
> container_wait(int container_fd, int *_wstatus, unsigned int wait,
> struct rusage *rusage);
>
> The container and everything in it can be terminated or killed off:
>
> container_kill(int container_fd, int initonly, int signal);
>
> If 'init' dies, all other processes in the container are preemptively
> SIGKILL'd by the kernel.
>
> By default, if the container is active and its fd is closed, the
> container is left running and wil be cleaned up when its 'init' exits.
> The default can be changed with the CONTAINER_KILL_ON_CLOSE flag.
>
> (4) Supervising the container.
>
> Given that we have an fd attached to the container, we could make it
> such that the supervising process could monitor and override EPERM
> returns for mount and other privileged operations within the
> container.
>
> (5) Device restriction.
>
> Containers could come with a list of device IDs that the container is
> allowed to open. Perhaps a list major numbers, each with a bitmap of
> permitted minor numbers.
>
> (6) Per-container keyring.
>
> Each container could be given a per-container keyring for the holding
> of integrity keys and filesystem keys. This list would be only
> modifiable by the container's 'root' user and the supervisor process:
>
> container_add_key(const char *type, const char *description,
> const void *payload, size_t plen,
> int container_fd);
>
> The keys on the keyring would, however, be accessible/usable by all
> processes within the keyring.
>
>
> ===============
> EXAMPLE PROGRAM
> ===============
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/wait.h>
>
> #define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */
> #define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */
> #define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace [priv] */
> #define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */
> #define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */
> #define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */
> #define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */
> #define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */
> #define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */
> #define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */
> #define CONTAINER__FLAG_MASK 0x000003ff
>
> static inline int container_create(const char *name, unsigned int mask)
> {
> return syscall(333, name, mask, 0, 0, 0);
> }
>
> static inline int fork_into_container(int containerfd)
> {
> return syscall(334, containerfd);
> }
>
> int main()
> {
> pid_t pid;
> int fd, ws;
>
> fd = container_create("foo-test",
> CONTAINER__FLAG_MASK & ~(
> CONTAINER_NEW_EMPTY_FS_NS |
> CONTAINER_NEW_CGROUP_NS));
> if (fd == -1) {
> perror("container_create");
> exit(1);
> }
>
> system("cat /proc/containers");
>
> switch ((pid = fork_into_container(fd))) {
> case -1:
> perror("fork_into_container");
> exit(1);
> case 0:
> close(fd);
> setenv("PS1", "container>", 1);
> execl("/bin/bash", "bash", NULL);
> perror("execl");
> exit(1);
> default:
> if (waitpid(pid, &ws, 0) < 0) {
> perror("waitpid");
> exit(1);
> }
> }
> close(fd);
> exit(0);
> }
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> fs/namespace.c | 5
> include/linux/container.h | 85 ++++++
> include/linux/init_task.h | 4
> include/linux/lsm_hooks.h | 21 +
> include/linux/sched.h | 3
> include/linux/security.h | 15 +
> include/linux/syscalls.h | 3
> include/uapi/linux/container.h | 28 ++
> include/uapi/linux/magic.h | 1
> init/Kconfig | 7
> kernel/Makefile | 2
> kernel/container.c | 462 ++++++++++++++++++++++++++++++++
> kernel/exit.c | 1
> kernel/fork.c | 7
> kernel/namespaces.h | 15 +
> kernel/nsproxy.c | 23 +-
> kernel/sys_ni.c | 4
> security/security.c | 13 +
> 20 files changed, 688 insertions(+), 13 deletions(-)
> create mode 100644 include/linux/container.h
> create mode 100644 include/uapi/linux/container.h
> create mode 100644 kernel/container.c
> create mode 100644 kernel/namespaces.h
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index abe6ea95e0e6..9ccd0f52f874 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -393,3 +393,4 @@
> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
> 385 i386 fsopen sys_fsopen
> 386 i386 fsmount sys_fsmount
> +387 i386 container_create sys_container_create
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 0977c5079831..dab92591511e 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -341,6 +341,7 @@
> 332 common statx sys_statx
> 333 common fsopen sys_fsopen
> 334 common fsmount sys_fsmount
> +335 common container_create sys_container_create
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4e9ad16db79c..7e2d5fe5728b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -28,6 +28,7 @@
> #include <linux/file.h>
> #include <linux/sched/task.h>
> #include <linux/sb_config.h>
> +#include <linux/container.h>
>
> #include "pnode.h"
> #include "internal.h"
> @@ -3510,6 +3511,10 @@ static void __init init_mount_tree(void)
>
> set_fs_pwd(current->fs, &root);
> set_fs_root(current->fs, &root);
> +#ifdef CONFIG_CONTAINERS
> + path_get(&root);
> + init_container.root = root;
> +#endif
> }
>
> void __init mnt_init(void)
> diff --git a/include/linux/container.h b/include/linux/container.h
> new file mode 100644
> index 000000000000..084ea9982fe6
> --- /dev/null
> +++ b/include/linux/container.h
> @@ -0,0 +1,85 @@
> +/* Container objects
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_CONTAINER_H
> +#define _LINUX_CONTAINER_H
> +
> +#include <uapi/linux/container.h>
> +#include <linux/refcount.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/wait.h>
> +#include <linux/path.h>
> +#include <linux/seqlock.h>
> +
> +struct fs_struct;
> +struct nsproxy;
> +struct task_struct;
> +
> +/*
> + * The container object.
> + */
> +struct container {
> + char name[24];
> + refcount_t usage;
> + int exit_code; /* The exit code of 'init' */
> + const struct cred *cred; /* Creds for this container, including userns */
> + struct nsproxy *ns; /* This container's namespaces */
> + struct path root; /* The root of the container's fs namespace */
> + struct task_struct *init; /* The 'init' task for this container */
> + struct container *parent; /* Parent of this container. */
> + void *security; /* LSM data */
> + struct list_head members; /* Member processes, guarded with ->lock */
> + struct list_head child_link; /* Link in parent->children */
> + struct list_head children; /* Child containers */
> + wait_queue_head_t waitq; /* Someone waiting for init to exit waits here */
> + unsigned long flags;
> +#define CONTAINER_FLAG_INIT_STARTED 0 /* Init is started - certain ops now prohibited */
> +#define CONTAINER_FLAG_DEAD 1 /* Init has died */
> +#define CONTAINER_FLAG_KILL_ON_CLOSE 2 /* Kill init if container handle closed */
> + spinlock_t lock;
> + seqcount_t seq; /* Track changes in ->root */
> +};
> +
> +extern struct container init_container;
> +
> +#ifdef CONFIG_CONTAINERS
> +extern const struct file_operations containerfs_fops;
> +
> +extern int copy_container(unsigned long flags, struct task_struct *tsk,
> + struct container *container);
> +extern void exit_container(struct task_struct *tsk);
> +extern void put_container(struct container *c);
> +
> +static inline struct container *get_container(struct container *c)
> +{
> + refcount_inc(&c->usage);
> + return c;
> +}
> +
> +static inline bool is_container_file(struct file *file)
> +{
> + return file->f_op == &containerfs_fops;
> +}
> +
> +#else
> +
> +static inline int copy_container(unsigned long flags, struct task_struct *tsk,
> + struct container *container)
> +{ return 0; }
> +static inline void exit_container(struct task_struct *tsk) { }
> +static inline void put_container(struct container *c) {}
> +static inline struct container *get_container(struct container *c) { return NULL; }
> +static inline bool is_container_file(struct file *file) { return false; }
> +
> +#endif /* CONFIG_CONTAINERS */
> +
> +#endif /* _LINUX_CONTAINER_H */
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index e049526bc188..488385ad79db 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -9,6 +9,7 @@
> #include <linux/ipc.h>
> #include <linux/pid_namespace.h>
> #include <linux/user_namespace.h>
> +#include <linux/container.h>
> #include <linux/securebits.h>
> #include <linux/seqlock.h>
> #include <linux/rbtree.h>
> @@ -273,6 +274,9 @@ extern struct cred init_cred;
> .signal = &init_signals, \
> .sighand = &init_sighand, \
> .nsproxy = &init_nsproxy, \
> + .container = &init_container, \
> + .container_link.next = &init_container.members, \
> + .container_link.prev = &init_container.members, \
> .pending = { \
> .list = LIST_HEAD_INIT(tsk.pending.list), \
> .signal = {{0}}}, \
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 7064c0c15386..7b0d484a6a25 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1368,6 +1368,17 @@
> * @inode we wish to get the security context of.
> * @ctx is a pointer in which to place the allocated security context.
> * @ctxlen points to the place to put the length of @ctx.
> + *
> + * Security hooks for containers:
> + *
> + * @container_alloc:
> + * Permit creation of a new container and assign security data.
> + * @container: The new container.
> + *
> + * @container_free:
> + * Free security data attached to a container.
> + * @container: The container.
> + *
> * This is the main security structure.
> */
>
> @@ -1699,6 +1710,12 @@ union security_list_options {
> struct audit_context *actx);
> void (*audit_rule_free)(void *lsmrule);
> #endif /* CONFIG_AUDIT */
> +
> + /* Container management security hooks */
> +#ifdef CONFIG_CONTAINERS
> + int (*container_alloc)(struct container *container, unsigned int flags);
> + void (*container_free)(struct container *container);
> +#endif
> };
>
> struct security_hook_heads {
> @@ -1919,6 +1936,10 @@ struct security_hook_heads {
> struct list_head audit_rule_match;
> struct list_head audit_rule_free;
> #endif /* CONFIG_AUDIT */
> +#ifdef CONFIG_CONTAINERS
> + struct list_head container_alloc;
> + struct list_head container_free;
> +#endif /* CONFIG_CONTAINERS */
> };
>
> /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index eba196521562..d9b92a98f99f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -33,6 +33,7 @@ struct backing_dev_info;
> struct bio_list;
> struct blk_plug;
> struct cfs_rq;
> +struct container;
> struct fs_struct;
> struct futex_pi_state;
> struct io_context;
> @@ -741,6 +742,8 @@ struct task_struct {
>
> /* Namespaces: */
> struct nsproxy *nsproxy;
> + struct container *container;
> + struct list_head container_link;
>
> /* Signal handlers: */
> struct signal_struct *signal;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 8c06e158c195..01bdf7637ec6 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -68,6 +68,7 @@ struct ctl_table;
> struct audit_krule;
> struct user_namespace;
> struct timezone;
> +struct container;
>
> /* These functions are in security/commoncap.c */
> extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
> @@ -1672,6 +1673,20 @@ static inline void security_audit_rule_free(void *lsmrule)
> #endif /* CONFIG_SECURITY */
> #endif /* CONFIG_AUDIT */
>
> +#ifdef CONFIG_CONTAINERS
> +#ifdef CONFIG_SECURITY
> +int security_container_alloc(struct container *container, unsigned int flags);
> +void security_container_free(struct container *container);
> +#else
> +static inline int security_container_alloc(struct container *container,
> + unsigned int flags)
> +{
> + return 0;
> +}
> +static inline void security_container_free(struct container *container) {}
> +#endif
> +#endif /* CONFIG_CONTAINERS */
> +
> #ifdef CONFIG_SECURITYFS
>
> extern struct dentry *securityfs_create_file(const char *name, umode_t mode,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 07e4f775f24d..5a0324dd024c 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -908,5 +908,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags);
> asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
> unsigned int flags);
> +asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
> + unsigned long spare3, unsigned long spare4,
> + unsigned long spare5);
>
> #endif
> diff --git a/include/uapi/linux/container.h b/include/uapi/linux/container.h
> new file mode 100644
> index 000000000000..43748099b28d
> --- /dev/null
> +++ b/include/uapi/linux/container.h
> @@ -0,0 +1,28 @@
> +/* Container UAPI
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _UAPI_LINUX_CONTAINER_H
> +#define _UAPI_LINUX_CONTAINER_H
> +
> +
> +#define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */
> +#define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */
> +#define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace */
> +#define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */
> +#define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */
> +#define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */
> +#define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */
> +#define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */
> +#define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */
> +#define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */
> +#define CONTAINER__FLAG_MASK 0x000003ff
> +
> +#endif /* _UAPI_LINUX_CONTAINER_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 88ae83492f7c..758705412b44 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -85,5 +85,6 @@
> #define BALLOON_KVM_MAGIC 0x13661366
> #define ZSMALLOC_MAGIC 0x58295829
> #define FS_FS_MAGIC 0x66736673
> +#define CONTAINERFS_MAGIC 0x636f6e74
>
> #endif /* __LINUX_MAGIC_H__ */
> diff --git a/init/Kconfig b/init/Kconfig
> index 1d3475fc9496..3a0ee88df6c8 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1288,6 +1288,13 @@ config NET_NS
> Allow user space to create what appear to be multiple instances
> of the network stack.
>
> +config CONTAINERS
> + bool "Container support"
> + default y
> + help
> + Allow userspace to create and manipulate containers as objects that
> + have namespaces and hold a set of processes.
> +
> endif # NAMESPACES
>
> config SCHED_AUTOGROUP
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 72aa080f91f0..117479b05fb1 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
> sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
> signal.o sys.o kmod.o workqueue.o pid.o task_work.o \
> extable.o params.o \
> - kthread.o sys_ni.o nsproxy.o \
> + kthread.o sys_ni.o nsproxy.o container.o \
> notifier.o ksysfs.o cred.o reboot.o \
> async.o range.o smpboot.o ucount.o
>
> diff --git a/kernel/container.c b/kernel/container.c
> new file mode 100644
> index 000000000000..eef1566835eb
> --- /dev/null
> +++ b/kernel/container.c
> @@ -0,0 +1,462 @@
> +/* Implement container objects.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <linux/init_task.h>
> +#include <linux/fs.h>
> +#include <linux/fs_struct.h>
> +#include <linux/mount.h>
> +#include <linux/file.h>
> +#include <linux/container.h>
> +#include <linux/magic.h>
> +#include <linux/syscalls.h>
> +#include <linux/printk.h>
> +#include <linux/security.h>
> +#include "namespaces.h"
> +
> +struct container init_container = {
> + .name = ".init",
> + .usage = REFCOUNT_INIT(2),
> + .cred = &init_cred,
> + .ns = &init_nsproxy,
> + .init = &init_task,
> + .members.next = &init_task.container_link,
> + .members.prev = &init_task.container_link,
> + .children = LIST_HEAD_INIT(init_container.children),
> + .flags = (1 << CONTAINER_FLAG_INIT_STARTED),
> + .lock = __SPIN_LOCK_UNLOCKED(init_container.lock),
> + .seq = SEQCNT_ZERO(init_fs.seq),
> +};
> +
> +#ifdef CONFIG_CONTAINERS
> +
> +static struct vfsmount *containerfs_mnt __read_mostly;
> +
> +/*
> + * Drop a ref on a container and clear it if no longer in use.
> + */
> +void put_container(struct container *c)
> +{
> + struct container *parent;
> +
> + while (c && refcount_dec_and_test(&c->usage)) {
> + BUG_ON(!list_empty(&c->members));
> + if (c->ns)
> + put_nsproxy(c->ns);
> + path_put(&c->root);
> +
> + parent = c->parent;
> + if (parent) {
> + spin_lock(&parent->lock);
> + list_del(&c->child_link);
> + spin_unlock(&parent->lock);
> + }
> +
> + if (c->cred)
> + put_cred(c->cred);
> + security_container_free(c);
> + kfree(c);
> + c = parent;
> + }
> +}
> +
> +/*
> + * Allow the user to poll for the container dying.
> + */
> +static unsigned int containerfs_poll(struct file *file, poll_table *wait)
> +{
> + struct container *container = file->private_data;
> + unsigned int mask = 0;
> +
> + poll_wait(file, &container->waitq, wait);
> +
> + if (test_bit(CONTAINER_FLAG_DEAD, &container->flags))
> + mask |= POLLHUP;
> +
> + return mask;
> +}
> +
> +static int containerfs_release(struct inode *inode, struct file *file)
> +{
> + struct container *container = file->private_data;
> +
> + put_container(container);
> + return 0;
> +}
> +
> +const struct file_operations containerfs_fops = {
> + .poll = containerfs_poll,
> + .release = containerfs_release,
> +};
> +
> +/*
> + * Indicate the name we want to display the container file as.
> + */
> +static char *containerfs_dname(struct dentry *dentry, char *buffer, int buflen)
> +{
> + return dynamic_dname(dentry, buffer, buflen, "container:[%lu]",
> + d_inode(dentry)->i_ino);
> +}
> +
> +static const struct dentry_operations containerfs_dentry_operations = {
> + .d_dname = containerfs_dname,
> +};
> +
> +/*
> + * Allocate a container.
> + */
> +static struct container *alloc_container(const char __user *name)
> +{
> + struct container *c;
> + long len;
> + int ret;
> +
> + c = kzalloc(sizeof(struct container), GFP_KERNEL);
> + if (!c)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&c->members);
> + INIT_LIST_HEAD(&c->children);
> + init_waitqueue_head(&c->waitq);
> + spin_lock_init(&c->lock);
> + refcount_set(&c->usage, 1);
> +
> + ret = -EFAULT;
> + len = strncpy_from_user(c->name, name, sizeof(c->name));
> + if (len < 0)
> + goto err;
> + ret = -ENAMETOOLONG;
> + if (len >= sizeof(c->name))
> + goto err;
> + ret = -EINVAL;
> + if (strchr(c->name, '/'))
> + goto err;
> +
> + c->name[len] = 0;
> + return c;
> +
> +err:
> + kfree(c);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a supervisory file for a new container
> + */
> +static struct file *create_container_file(struct container *c)
> +{
> + struct inode *inode;
> + struct file *f;
> + struct path path;
> + int ret;
> +
> + inode = alloc_anon_inode(containerfs_mnt->mnt_sb);
> + if (!inode)
> + return ERR_PTR(-ENFILE);
> + inode->i_fop = &containerfs_fops;
> +
> + ret = -ENOMEM;
> + path.dentry = d_alloc_pseudo(containerfs_mnt->mnt_sb, &empty_name);
> + if (!path.dentry)
> + goto err_inode;
> + path.mnt = mntget(containerfs_mnt);
> +
> + d_instantiate(path.dentry, inode);
> +
> + f = alloc_file(&path, 0, &containerfs_fops);
> + if (IS_ERR(f)) {
> + ret = PTR_ERR(f);
> + goto err_file;
> + }
> +
> + f->private_data = c;
> + return f;
> +
> +err_file:
> + path_put(&path);
> + return ERR_PTR(ret);
> +
> +err_inode:
> + iput(inode);
> + return ERR_PTR(ret);
> +}
> +
> +static const struct super_operations containerfs_ops = {
> + .drop_inode = generic_delete_inode,
> + .destroy_inode = free_inode_nonrcu,
> + .statfs = simple_statfs,
> +};
> +
> +/*
> + * containerfs should _never_ be mounted by userland - too much of security
> + * hassle, no real gain from having the whole whorehouse mounted. So we don't
> + * need any operations on the root directory. However, we need a non-trivial
> + * d_name - container: will go nicely and kill the special-casing in procfs.
> + */
> +static struct dentry *containerfs_mount(struct file_system_type *fs_type,
> + int flags, const char *dev_name,
> + void *data)
> +{
> + return mount_pseudo(fs_type, "container:", &containerfs_ops,
> + &containerfs_dentry_operations, CONTAINERFS_MAGIC);
> +}
> +
> +static struct file_system_type container_fs_type = {
> + .name = "containerfs",
> + .mount = containerfs_mount,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static int __init init_container_fs(void)
> +{
> + int ret;
> +
> + ret = register_filesystem(&container_fs_type);
> + if (ret < 0)
> + panic("Cannot register containerfs\n");
> +
> + containerfs_mnt = kern_mount(&container_fs_type);
> + if (IS_ERR(containerfs_mnt))
> + panic("Cannot mount containerfs: %ld\n",
> + PTR_ERR(containerfs_mnt));
> +
> + return 0;
> +}
> +
> +fs_initcall(init_container_fs);
> +
> +/*
> + * Handle fork/clone.
> + *
> + * A process inherits its parent's container. The first process into the
> + * container is its 'init' process and the life of everything else in there is
> + * dependent upon that.
> + */
> +int copy_container(unsigned long flags, struct task_struct *tsk,
> + struct container *container)
> +{
> + struct container *c = container ?: tsk->container;
> + int ret = -ECANCELED;
> +
> + spin_lock(&c->lock);
> +
> + if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) {
> + list_add_tail(&tsk->container_link, &c->members);
> + get_container(c);
> + tsk->container = c;
> + if (!c->init) {
> + set_bit(CONTAINER_FLAG_INIT_STARTED, &c->flags);
> + c->init = tsk;
> + }
> + ret = 0;
> + }
> +
> + spin_unlock(&c->lock);
> + return ret;
> +}
> +
> +/*
> + * Remove a dead process from a container.
> + *
> + * If the 'init' process in a container dies, we kill off all the other
> + * processes in the container.
> + */
> +void exit_container(struct task_struct *tsk)
> +{
> + struct task_struct *p;
> + struct container *c = tsk->container;
> + struct siginfo si = {
> + .si_signo = SIGKILL,
> + .si_code = SI_KERNEL,
> + };
> +
> + spin_lock(&c->lock);
> +
> + list_del(&tsk->container_link);
> +
> + if (c->init == tsk) {
> + c->init = NULL;
> + c->exit_code = tsk->exit_code;
> + smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */
> + set_bit(CONTAINER_FLAG_DEAD, &c->flags);
> + wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD);
> +
> + list_for_each_entry(p, &c->members, container_link) {
> + si.si_pid = task_tgid_vnr(p);
> + send_sig_info(SIGKILL, &si, p);
> + }
> + }
> +
> + spin_unlock(&c->lock);
> + put_container(c);
> +}
> +
> +/*
> + * Create some creds for the container. We don't want to pin things we don't
> + * have to, so drop all keyrings from the new cred. The LSM gets to audit the
> + * cred struct when security_container_alloc() is invoked.
> + */
> +static const struct cred *create_container_creds(unsigned int flags)
> +{
> + struct cred *new;
> + int ret;
> +
> + new = prepare_creds();
> + if (!new)
> + return ERR_PTR(-ENOMEM);
> +
> +#ifdef CONFIG_KEYS
> + key_put(new->thread_keyring);
> + new->thread_keyring = NULL;
> + key_put(new->process_keyring);
> + new->process_keyring = NULL;
> + key_put(new->session_keyring);
> + new->session_keyring = NULL;
> + key_put(new->request_key_auth);
> + new->request_key_auth = NULL;
> +#endif
> +
> + if (flags & CONTAINER_NEW_USER_NS) {
> + ret = create_user_ns(new);
> + if (ret < 0)
> + goto err;
> + new->euid = new->user_ns->owner;
> + new->egid = new->user_ns->group;
> + }
> +
> + new->fsuid = new->suid = new->uid = new->euid;
> + new->fsgid = new->sgid = new->gid = new->egid;
> + return new;
> +
> +err:
> + abort_creds(new);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container.
> + */
> +static struct container *create_container(const char *name, unsigned int flags)
> +{
> + struct container *parent, *c;
> + struct fs_struct *fs;
> + struct nsproxy *ns;
> + const struct cred *cred;
> + int ret;
> +
> + c = alloc_container(name);
> + if (IS_ERR(c))
> + return c;
> +
> + if (flags & CONTAINER_KILL_ON_CLOSE)
> + __set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags);
> +
> + cred = create_container_creds(flags);
> + if (IS_ERR(cred)) {
> + ret = PTR_ERR(cred);
> + goto err_cont;
> + }
> + c->cred = cred;
> +
> + ret = -ENOMEM;
> + fs = copy_fs_struct(current->fs);
> + if (!fs)
> + goto err_cont;
> +
> + ns = create_new_namespaces(
> + (flags & CONTAINER_NEW_FS_NS ? CLONE_NEWNS : 0) |
> + (flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP : 0) |
> + (flags & CONTAINER_NEW_UTS_NS ? CLONE_NEWUTS : 0) |
> + (flags & CONTAINER_NEW_IPC_NS ? CLONE_NEWIPC : 0) |
> + (flags & CONTAINER_NEW_PID_NS ? CLONE_NEWPID : 0) |
> + (flags & CONTAINER_NEW_NET_NS ? CLONE_NEWNET : 0),
> + current->nsproxy, cred->user_ns, fs);
> + if (IS_ERR(ns)) {
> + ret = PTR_ERR(ns);
> + goto err_fs;
> + }
> +
> + c->ns = ns;
> + c->root = fs->root;
> + c->seq = fs->seq;
> + fs->root.mnt = NULL;
> + fs->root.dentry = NULL;
> +
> + ret = security_container_alloc(c, flags);
> + if (ret < 0)
> + goto err_fs;
> +
> + parent = current->container;
> + get_container(parent);
> + c->parent = parent;
> + spin_lock(&parent->lock);
> + list_add_tail(&c->child_link, &parent->children);
> + spin_unlock(&parent->lock);
> + return c;
> +
> +err_fs:
> + free_fs_struct(fs);
> +err_cont:
> + put_container(c);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container object.
> + */
> +SYSCALL_DEFINE5(container_create,
> + const char __user *, name,
> + unsigned int, flags,
> + unsigned long, spare3,
> + unsigned long, spare4,
> + unsigned long, spare5)
> +{
> + struct container *c;
> + struct file *f;
> + int ret, fd;
> +
> + if (!name ||
> + flags & ~CONTAINER__FLAG_MASK ||
> + spare3 != 0 || spare4 != 0 || spare5 != 0)
> + return -EINVAL;
> + if ((flags & (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) ==
> + (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS))
> + return -EINVAL;
> +
> + c = create_container(name, flags);
> + if (IS_ERR(c))
> + return PTR_ERR(c);
> +
> + f = create_container_file(c);
> + if (IS_ERR(f)) {
> + ret = PTR_ERR(f);
> + goto err_cont;
> + }
> +
> + ret = get_unused_fd_flags(flags & CONTAINER_FD_CLOEXEC ? O_CLOEXEC : 0);
> + if (ret < 0)
> + goto err_file;
> +
> + fd = ret;
> + fd_install(fd, f);
> + return fd;
> +
> +err_file:
> + fput(f);
> + return ret;
> +err_cont:
> + put_container(c);
> + return ret;
> +}
> +
> +#endif /* CONFIG_CONTAINERS */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 31b8617aee04..1ff87f7e40a2 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -875,6 +875,7 @@ void __noreturn do_exit(long code)
> if (group_dead)
> disassociate_ctty(1);
> exit_task_namespaces(tsk);
> + exit_container(tsk);
> exit_task_work(tsk);
> exit_thread(tsk);
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index aec6672d3f0e..ff2779426fe9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1728,9 +1728,12 @@ static __latent_entropy struct task_struct *copy_process(
> retval = copy_namespaces(clone_flags, p);
> if (retval)
> goto bad_fork_cleanup_mm;
> - retval = copy_io(clone_flags, p);
> + retval = copy_container(clone_flags, p, NULL);
> if (retval)
> goto bad_fork_cleanup_namespaces;
> + retval = copy_io(clone_flags, p);
> + if (retval)
> + goto bad_fork_cleanup_container;
> retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
> if (retval)
> goto bad_fork_cleanup_io;
> @@ -1918,6 +1921,8 @@ static __latent_entropy struct task_struct *copy_process(
> bad_fork_cleanup_io:
> if (p->io_context)
> exit_io_context(p);
> +bad_fork_cleanup_container:
> + exit_container(p);
> bad_fork_cleanup_namespaces:
> exit_task_namespaces(p);
> bad_fork_cleanup_mm:
> diff --git a/kernel/namespaces.h b/kernel/namespaces.h
> new file mode 100644
> index 000000000000..c44e3cf0e254
> --- /dev/null
> +++ b/kernel/namespaces.h
> @@ -0,0 +1,15 @@
> +/* Local namespaces defs
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +extern struct nsproxy *create_new_namespaces(unsigned long flags,
> + struct nsproxy *nsproxy,
> + struct user_namespace *user_ns,
> + struct fs_struct *new_fs);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index f6c5d330059a..4bb5184b3a80 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -27,6 +27,7 @@
> #include <linux/syscalls.h>
> #include <linux/cgroup.h>
> #include <linux/perf_event.h>
> +#include "namespaces.h"
>
> static struct kmem_cache *nsproxy_cachep;
>
> @@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void)
> * Return the newly created nsproxy. Do not attach this to the task,
> * leave it to the caller to do proper locking and attach it to task.
> */
> -static struct nsproxy *create_new_namespaces(unsigned long flags,
> - struct task_struct *tsk, struct user_namespace *user_ns,
> +struct nsproxy *create_new_namespaces(unsigned long flags,
> + struct nsproxy *nsproxy, struct user_namespace *user_ns,
> struct fs_struct *new_fs)
> {
> struct nsproxy *new_nsp;
> @@ -72,39 +73,39 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
> if (!new_nsp)
> return ERR_PTR(-ENOMEM);
>
> - new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
> + new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns, user_ns, new_fs);
> if (IS_ERR(new_nsp->mnt_ns)) {
> err = PTR_ERR(new_nsp->mnt_ns);
> goto out_ns;
> }
>
> - new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
> + new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy->uts_ns);
> if (IS_ERR(new_nsp->uts_ns)) {
> err = PTR_ERR(new_nsp->uts_ns);
> goto out_uts;
> }
>
> - new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
> + new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy->ipc_ns);
> if (IS_ERR(new_nsp->ipc_ns)) {
> err = PTR_ERR(new_nsp->ipc_ns);
> goto out_ipc;
> }
>
> new_nsp->pid_ns_for_children =
> - copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
> + copy_pid_ns(flags, user_ns, nsproxy->pid_ns_for_children);
> if (IS_ERR(new_nsp->pid_ns_for_children)) {
> err = PTR_ERR(new_nsp->pid_ns_for_children);
> goto out_pid;
> }
>
> new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> - tsk->nsproxy->cgroup_ns);
> + nsproxy->cgroup_ns);
> if (IS_ERR(new_nsp->cgroup_ns)) {
> err = PTR_ERR(new_nsp->cgroup_ns);
> goto out_cgroup;
> }
>
> - new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
> + new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy->net_ns);
> if (IS_ERR(new_nsp->net_ns)) {
> err = PTR_ERR(new_nsp->net_ns);
> goto out_net;
> @@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> (CLONE_NEWIPC | CLONE_SYSVSEM))
> return -EINVAL;
>
> - new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
> + new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
> if (IS_ERR(new_ns))
> return PTR_ERR(new_ns);
>
> @@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> return -EPERM;
>
> - *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
> + *new_nsp = create_new_namespaces(unshare_flags, current->nsproxy, user_ns,
> new_fs ? new_fs : current->fs);
> if (IS_ERR(*new_nsp)) {
> err = PTR_ERR(*new_nsp);
> @@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
> if (nstype && (ns->ops->type != nstype))
> goto out;
>
> - new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> + new_nsproxy = create_new_namespaces(0, tsk->nsproxy, current_user_ns(), tsk->fs);
> if (IS_ERR(new_nsproxy)) {
> err = PTR_ERR(new_nsproxy);
> goto out;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index a0fe764bd5dd..99b1e1f58d05 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -262,3 +262,7 @@ cond_syscall(sys_pkey_free);
> /* fd-based mount */
> cond_syscall(sys_fsopen);
> cond_syscall(sys_fsmount);
> +
> +/* Containers */
> +cond_syscall(sys_container_create);
> +
> diff --git a/security/security.c b/security/security.c
> index f4136ca5cb1b..b5c5b5ae1266 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1668,3 +1668,16 @@ int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule,
> actx);
> }
> #endif /* CONFIG_AUDIT */
> +
> +#ifdef CONFIG_CONTAINERS
> +
> +int security_container_alloc(struct container *container, unsigned int flags)
> +{
> + return call_int_hook(container_alloc, 0, container, flags);
> +}
> +
> +void security_container_free(struct container *container)
> +{
> + call_void_hook(container_free, container);
> +}
> +#endif /* CONFIG_CONTAINERS */
- RGB
--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <[email protected]> wrote:
> Hi David,
>
> I wanted to respond to this thread to attempt some constructive feedback,
> better late than never. I had a look at your fsopen/fsmount() patchset(s) to
> support this patchset which was interesting, but doesn't directly affect my
> work. The primary patch of interest to the audit kernel folks (Paul Moore and
> me) is this patch while the rest of the patchset is interesting, but not likely
> to directly affect us. This patch has most of what we need to solve our
> problem.
>
> Paul and I agree that audit is going to have a difficult time identifying
> containers or even namespaces without some change to the kernel. The audit
> subsystem in the kernel needs at least a basic clue about which container
> caused an event to be able to report this at the appropriate level and ignore
> it at other levels to avoid a DoS.
While there is some increased risk of "death by audit", this is really
only an issue once we start supporting multiple audit daemons; simply
associating auditable events with the container that triggered them
shouldn't add any additional overhead (I hope). For a number of use
cases, a single auditd running outside the containers, but recording
all their events with some type of container attribution will be
sufficient. This is step #1.
However, we will obviously want to go a bit further and support
multiple audit daemons on the system to allow containers to
record/process their own events (side note: the non-container auditd
instance will still see all the events). There are a number of ways
we could tackle this, both via in-kernel and in-userspace record
routing, each with their own pros/cons. However, how this works is
going to be dependent on how we identify containers and track their
audit events: the bits from step #1. For this reason I'm not really
interested in worrying about the multiple auditd problem just yet;
it's obviously important, and something to keep in mind while working
up a solution, but it isn't something we should focus on right now.
> We also agree that there will need to be some sort of trigger from userspace to
> indicate the creation of a container and its allocated resources and we're not
> really picky how that is done, such as a clone flag, a syscall or a sysfs write
> (or even a read, I suppose), but there will need to be some permission
> restrictions, obviously. (I'd like to see capabilities used for this by adding
> a specific container bit to the capabilities bitmask.)
To be clear, from an audit perspective I think the only thing we would
really care about controlling access to is the creation and assignment
of a new audit container ID/token, not necessarily the container
itself. It's a small point, but an important one I think.
> I doubt we will be able to accomodate all definitions or concepts of a
> container in a timely fashion. We'll need to start somewhere with a minimum
> definition so that we can get traction and actually move forward before another
> compelling shared kernel microservice method leaves our entire community
> behind. I'd like to declare that a container is a full set of cloned
> namespaces, but this is inefficient, overly constricting and unnecessary for
> our needs. If we could agree on a minimum definition of a container (which may
> have only one specific cloned namespace) then we have something on which to
> build. I could even see a container being defined by a trigger sent from
> userspace about a process (task) from which all its children are considered to
> be within that container, subject to further nesting.
I really would prefer if we could avoid defining the term "container".
Even if we manage to get it right at this particular moment, we will
surely be made fools a year or two from now when things change. At
the very least lets avoid a rigid definition of container, I'll
concede that we will probably need to have some definition simply so
we can implement something, I just don't want the design or
implementation to depend on a particular definition.
This comment is jumping ahead a bit, but from an audit perspective I
think we handle this by emitting an audit record whenever a container
ID is created which describes it as the kernel sees it; as of now that
probably means a list of namespace IDs. Richard mentions this in his
email, I just wanted to make it clear that I think we should see this
as a flexible mechanism. At the very least we will likely see a few
more namespaces before the world moves on from containers.
> In the simplest usable model for audit, if a container (definition implies and)
> starts a PID namespace, then the container ID could simply be the container's
> "init" process PID in the initial PID namespace. This assumes that as soon as
> that process vanishes, that entire container and all its children are killed
> off (which you've done). There may be some container orchestration systems
> that don't use a unique PID namespace per container and that imposing this will
> cause them challenges.
I don't follow how this would cause challenges if the containers do
not use a unique PID namespace; you are suggesting using the PID from
in the context of the initial PID namespace, yes?
Regardless, I do worry that using a PID could potentially be a bit
racy once we start jumping between kernel and userspace (audit
configuration, logs, etc.).
> If containers have at minimum a unique mount namespace then the root path
> dentry inode device and inode number could be used, but there are likely better
> identifiers. Again, there may be container orchestrators that don't use a
> unique mount namespace per container and that imposing this will cause
> challenges.
>
> I expect there are similar examples for each of the other namespaces.
The PID case is a bit unique as each process is going to have a unique
PID regardless of namespaces, but even that has some drawbacks as
discussed above. As for the other namespaces, I agree that we can't
rely on them (see my earlier comments).
> If we could pick one namespace type for consensus for which each container has
> a unique instance of that namespace, we could use the dev/ino tuple from that
> namespace as had originally been suggested by Aristeu Rozanski more than 4
> years ago as part of the set of namespace IDs. I had also attempted to
> solve this problem by using the namespace' proc inode, then switched over to
> generate a unique kernel serial number for each namespace and then went back to
> namespace proc dev/ino once Al Viro implemented nsfs:
> v1 https://lkml.org/lkml/2014/4/22/662
> v2 https://lkml.org/lkml/2014/5/9/637
> v3 https://lkml.org/lkml/2014/5/20/287
> v4 https://lkml.org/lkml/2014/8/20/844
> v5 https://lkml.org/lkml/2014/10/6/25
> v6 https://lkml.org/lkml/2015/4/17/48
> v7 https://lkml.org/lkml/2015/5/12/773
>
> These patches don't use a container ID, but track all namespaces in use for an
> event. This has the benefit of punting this tracking to userspace for some
> other tool to analyse and determine to which container an event belongs.
> This will use a lot of bandwidth in audit log files when a single
> container ID that doesn't require nesting information to be complete
> would be a much more efficient use of audit log bandwidth.
Relying on a particular namespace to identify a containers is a
non-starter from my perspective for all the reasons previously
discussed.
> If we rely only on the setting of arbitrary container names from userspace,
> then we must provide a map or tree back to the initial audit domain for that
> running kernel to be able to differentiate between potentially identical
> container names assigned in a nested container system. If we assign a
> container serial number sequentially (atomic64_inc) from the kernel on request
> from userspace like the sessionID and log the creation with all nsIDs and the
> parent container serial number and/or container name, the nesting is clear due
> to lack of ambiguity in potential duplicate names in nesting. If a container
> serial number is used, the tree of inheritance of nested containers can be
> rebuilt from the audit records showing what containers were spawned from what
> parent.
I believe we are going to need a container ID to container definition
(namespace, etc.) mapping mechanism regardless of if the container ID
is provided by userspace or a kernel generated serial number. This
mapping should be recorded in the audit log when the container ID is
created/defined.
> As was suggested in one of the previous threads, if there are any events not
> associated with a task (incoming network packets) we log the namespace ID and
> then only concern ourselves with its container serial number or container name
> once it becomes associated with a task at which point that tracking will be
> more important anyways.
Agreed. After all, a single namespace can be shared between multiple
containers. For those security officers who need to track individual
events like this they will have the container ID mapping information
in the logs as well so they should be able to trace the unassociated
event to a set of containers.
> I'm not convinced that a userspace or kernel generated UUID is that useful
> since they are large, not human readable and may not be globally unique given
> the "pets vs cattle" direction we are going with potentially identical
> conditions in hosts or containers spawning containers, but I see no need to
> restrict them.
On 2017-08-16 18:21, Paul Moore wrote:
> On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <[email protected]> wrote:
> > Hi David,
> >
> > I wanted to respond to this thread to attempt some constructive feedback,
> > better late than never. I had a look at your fsopen/fsmount() patchset(s) to
> > support this patchset which was interesting, but doesn't directly affect my
> > work. The primary patch of interest to the audit kernel folks (Paul Moore and
> > me) is this patch while the rest of the patchset is interesting, but not likely
> > to directly affect us. This patch has most of what we need to solve our
> > problem.
> >
> > Paul and I agree that audit is going to have a difficult time identifying
> > containers or even namespaces without some change to the kernel. The audit
> > subsystem in the kernel needs at least a basic clue about which container
> > caused an event to be able to report this at the appropriate level and ignore
> > it at other levels to avoid a DoS.
>
> While there is some increased risk of "death by audit", this is really
> only an issue once we start supporting multiple audit daemons; simply
> associating auditable events with the container that triggered them
> shouldn't add any additional overhead (I hope). For a number of use
> cases, a single auditd running outside the containers, but recording
> all their events with some type of container attribution will be
> sufficient. This is step #1.
>
> However, we will obviously want to go a bit further and support
> multiple audit daemons on the system to allow containers to
> record/process their own events (side note: the non-container auditd
> instance will still see all the events). There are a number of ways
> we could tackle this, both via in-kernel and in-userspace record
> routing, each with their own pros/cons. However, how this works is
> going to be dependent on how we identify containers and track their
> audit events: the bits from step #1. For this reason I'm not really
> interested in worrying about the multiple auditd problem just yet;
> it's obviously important, and something to keep in mind while working
> up a solution, but it isn't something we should focus on right now.
>
> > We also agree that there will need to be some sort of trigger from userspace to
> > indicate the creation of a container and its allocated resources and we're not
> > really picky how that is done, such as a clone flag, a syscall or a sysfs write
> > (or even a read, I suppose), but there will need to be some permission
> > restrictions, obviously. (I'd like to see capabilities used for this by adding
> > a specific container bit to the capabilities bitmask.)
>
> To be clear, from an audit perspective I think the only thing we would
> really care about controlling access to is the creation and assignment
> of a new audit container ID/token, not necessarily the container
> itself. It's a small point, but an important one I think.
>
> > I doubt we will be able to accomodate all definitions or concepts of a
> > container in a timely fashion. We'll need to start somewhere with a minimum
> > definition so that we can get traction and actually move forward before another
> > compelling shared kernel microservice method leaves our entire community
> > behind. I'd like to declare that a container is a full set of cloned
> > namespaces, but this is inefficient, overly constricting and unnecessary for
> > our needs. If we could agree on a minimum definition of a container (which may
> > have only one specific cloned namespace) then we have something on which to
> > build. I could even see a container being defined by a trigger sent from
> > userspace about a process (task) from which all its children are considered to
> > be within that container, subject to further nesting.
>
> I really would prefer if we could avoid defining the term "container".
> Even if we manage to get it right at this particular moment, we will
> surely be made fools a year or two from now when things change. At
> the very least lets avoid a rigid definition of container, I'll
> concede that we will probably need to have some definition simply so
> we can implement something, I just don't want the design or
> implementation to depend on a particular definition.
>
> This comment is jumping ahead a bit, but from an audit perspective I
> think we handle this by emitting an audit record whenever a container
> ID is created which describes it as the kernel sees it; as of now that
> probably means a list of namespace IDs. Richard mentions this in his
> email, I just wanted to make it clear that I think we should see this
> as a flexible mechanism. At the very least we will likely see a few
> more namespaces before the world moves on from containers.
>
> > In the simplest usable model for audit, if a container (definition implies and)
> > starts a PID namespace, then the container ID could simply be the container's
> > "init" process PID in the initial PID namespace. This assumes that as soon as
> > that process vanishes, that entire container and all its children are killed
> > off (which you've done). There may be some container orchestration systems
> > that don't use a unique PID namespace per container and that imposing this will
> > cause them challenges.
>
> I don't follow how this would cause challenges if the containers do
> not use a unique PID namespace; you are suggesting using the PID from
> in the context of the initial PID namespace, yes?
The PID of the "init" process of a container (PID=1 inside container,
but PID=containerID from the initial PID namespace perspective).
> Regardless, I do worry that using a PID could potentially be a bit
> racy once we start jumping between kernel and userspace (audit
> configuration, logs, etc.).
How do you think this could be racy? An event happenning before or as
the container has been defined?
> > If containers have at minimum a unique mount namespace then the root path
> > dentry inode device and inode number could be used, but there are likely better
> > identifiers. Again, there may be container orchestrators that don't use a
> > unique mount namespace per container and that imposing this will cause
> > challenges.
> >
> > I expect there are similar examples for each of the other namespaces.
>
> The PID case is a bit unique as each process is going to have a unique
> PID regardless of namespaces, but even that has some drawbacks as
> discussed above. As for the other namespaces, I agree that we can't
> rely on them (see my earlier comments).
(In general can you specify which earlier comments so we can be sure to
what you are referring?)
> > If we could pick one namespace type for consensus for which each container has
> > a unique instance of that namespace, we could use the dev/ino tuple from that
> > namespace as had originally been suggested by Aristeu Rozanski more than 4
> > years ago as part of the set of namespace IDs. I had also attempted to
> > solve this problem by using the namespace' proc inode, then switched over to
> > generate a unique kernel serial number for each namespace and then went back to
> > namespace proc dev/ino once Al Viro implemented nsfs:
> > v1 https://lkml.org/lkml/2014/4/22/662
> > v2 https://lkml.org/lkml/2014/5/9/637
> > v3 https://lkml.org/lkml/2014/5/20/287
> > v4 https://lkml.org/lkml/2014/8/20/844
> > v5 https://lkml.org/lkml/2014/10/6/25
> > v6 https://lkml.org/lkml/2015/4/17/48
> > v7 https://lkml.org/lkml/2015/5/12/773
> >
> > These patches don't use a container ID, but track all namespaces in use for an
> > event. This has the benefit of punting this tracking to userspace for some
> > other tool to analyse and determine to which container an event belongs.
> > This will use a lot of bandwidth in audit log files when a single
> > container ID that doesn't require nesting information to be complete
> > would be a much more efficient use of audit log bandwidth.
>
> Relying on a particular namespace to identify a containers is a
> non-starter from my perspective for all the reasons previously
> discussed.
I'd rather not either and suspect there isn't much danger of it, but if
it is determined that there is one namespace in particular that is a
minimum requirement, I'd prefer to use that nsID instead of creating an
additional ID.
> > If we rely only on the setting of arbitrary container names from userspace,
> > then we must provide a map or tree back to the initial audit domain for that
> > running kernel to be able to differentiate between potentially identical
> > container names assigned in a nested container system. If we assign a
> > container serial number sequentially (atomic64_inc) from the kernel on request
> > from userspace like the sessionID and log the creation with all nsIDs and the
> > parent container serial number and/or container name, the nesting is clear due
> > to lack of ambiguity in potential duplicate names in nesting. If a container
> > serial number is used, the tree of inheritance of nested containers can be
> > rebuilt from the audit records showing what containers were spawned from what
> > parent.
>
> I believe we are going to need a container ID to container definition
> (namespace, etc.) mapping mechanism regardless of if the container ID
> is provided by userspace or a kernel generated serial number. This
> mapping should be recorded in the audit log when the container ID is
> created/defined.
Agreed.
> > As was suggested in one of the previous threads, if there are any events not
> > associated with a task (incoming network packets) we log the namespace ID and
> > then only concern ourselves with its container serial number or container name
> > once it becomes associated with a task at which point that tracking will be
> > more important anyways.
>
> Agreed. After all, a single namespace can be shared between multiple
> containers. For those security officers who need to track individual
> events like this they will have the container ID mapping information
> in the logs as well so they should be able to trace the unassociated
> event to a set of containers.
>
> > I'm not convinced that a userspace or kernel generated UUID is that useful
> > since they are large, not human readable and may not be globally unique given
> > the "pets vs cattle" direction we are going with potentially identical
> > conditions in hosts or containers spawning containers, but I see no need to
> > restrict them.
>
> From a kernel perspective I think an int should suffice; after all,
> you can't have more containers then you have processes. If the
> container engine requires something more complex, it can use the int
> as input to its own mapping function.
PIDs roll over. That already causes some ambiguity in reporting. If a
system is constantly spawning and reaping containers, especially
single-process containers, I don't want to have to worry about that ID
rolling to keep track of it even though there should be audit records of
the spawn and death of each container. There isn't significant cost
added here compared with some of the other overhead we're dealing with.
> > How do we deal with setns()? Once it is determined that action is permitted,
> > given the new combinaiton of namespaces and potential membership in a different
> > container, record the transition from one container to another including all
> > namespaces if the latter are a different subset than the target container
> > initial set.
>
> That is a fun one, isn't it? I think this is where the container
> ID-to-definition mapping comes into play. If setns() changes the
> process such that the existing container ID is no longer valid then we
> need to do a new lookup in the table to see if another container ID is
> valid; if no established container ID mappings are valid, the
> container ID becomes "undefined".
Hopefully we can design this stuff so that container IDs are still valid
while that transition occurs.
> paul moore
- RGB
--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Quoting Richard Guy Briggs ([email protected]):
...
> > I believe we are going to need a container ID to container definition
> > (namespace, etc.) mapping mechanism regardless of if the container ID
> > is provided by userspace or a kernel generated serial number. This
> > mapping should be recorded in the audit log when the container ID is
> > created/defined.
>
> Agreed.
>
> > > As was suggested in one of the previous threads, if there are any events not
> > > associated with a task (incoming network packets) we log the namespace ID and
> > > then only concern ourselves with its container serial number or container name
> > > once it becomes associated with a task at which point that tracking will be
> > > more important anyways.
> >
> > Agreed. After all, a single namespace can be shared between multiple
> > containers. For those security officers who need to track individual
> > events like this they will have the container ID mapping information
> > in the logs as well so they should be able to trace the unassociated
> > event to a set of containers.
> >
> > > I'm not convinced that a userspace or kernel generated UUID is that useful
> > > since they are large, not human readable and may not be globally unique given
> > > the "pets vs cattle" direction we are going with potentially identical
> > > conditions in hosts or containers spawning containers, but I see no need to
> > > restrict them.
> >
> > From a kernel perspective I think an int should suffice; after all,
> > you can't have more containers then you have processes. If the
> > container engine requires something more complex, it can use the int
> > as input to its own mapping function.
>
> PIDs roll over. That already causes some ambiguity in reporting. If a
> system is constantly spawning and reaping containers, especially
> single-process containers, I don't want to have to worry about that ID
> rolling to keep track of it even though there should be audit records of
> the spawn and death of each container. There isn't significant cost
> added here compared with some of the other overhead we're dealing with.
Strawman proposal:
1. Each clone/unshare/setns involving a namespace type generates an audit
message along the lines of:
PID 9512 (pid in init_pid_ns) in auditnsid 00000001 cloned CLONE_NEWNS|CLONE_NEWNET
new auditnsid: 00000002
associated namespaces: (list of all namespace filesystem inode numbers)
2. Userspace (i.e. the container logging deamon here) can watch the audit log
for all messages relating to auditnsid 00000002. Presumably there will be
messages along the lines of "PID 9513 in auditnsid 00000002 cloned...". The
container logging daemon can track those messages and add the new auditnsids
to the list it watches.
3. If a container is migrated (checkpointed and restored here or elsewhere),
userspace can just follow the appropriate logs for the new containers.
Userspace does not ever *request* a auditnsid. They are ephemeral, just a
tool to track the namespaces through the audit log. They are however guaranteed
to never be re-used until reboot.
(Feels like someone must have proposed this before)
-serge
On Fri, Aug 18, 2017 at 4:03 AM, Richard Guy Briggs <[email protected]> wrote:
> On 2017-08-16 18:21, Paul Moore wrote:
>> On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <[email protected]> wrote:
>> > Hi David,
>> >
>> > I wanted to respond to this thread to attempt some constructive feedback,
>> > better late than never. I had a look at your fsopen/fsmount() patchset(s) to
>> > support this patchset which was interesting, but doesn't directly affect my
>> > work. The primary patch of interest to the audit kernel folks (Paul Moore and
>> > me) is this patch while the rest of the patchset is interesting, but not likely
>> > to directly affect us. This patch has most of what we need to solve our
>> > problem.
>> >
>> > Paul and I agree that audit is going to have a difficult time identifying
>> > containers or even namespaces without some change to the kernel. The audit
>> > subsystem in the kernel needs at least a basic clue about which container
>> > caused an event to be able to report this at the appropriate level and ignore
>> > it at other levels to avoid a DoS.
>>
>> While there is some increased risk of "death by audit", this is really
>> only an issue once we start supporting multiple audit daemons; simply
>> associating auditable events with the container that triggered them
>> shouldn't add any additional overhead (I hope). For a number of use
>> cases, a single auditd running outside the containers, but recording
>> all their events with some type of container attribution will be
>> sufficient. This is step #1.
>>
>> However, we will obviously want to go a bit further and support
>> multiple audit daemons on the system to allow containers to
>> record/process their own events (side note: the non-container auditd
>> instance will still see all the events). There are a number of ways
>> we could tackle this, both via in-kernel and in-userspace record
>> routing, each with their own pros/cons. However, how this works is
>> going to be dependent on how we identify containers and track their
>> audit events: the bits from step #1. For this reason I'm not really
>> interested in worrying about the multiple auditd problem just yet;
>> it's obviously important, and something to keep in mind while working
>> up a solution, but it isn't something we should focus on right now.
>>
>> > We also agree that there will need to be some sort of trigger from userspace to
>> > indicate the creation of a container and its allocated resources and we're not
>> > really picky how that is done, such as a clone flag, a syscall or a sysfs write
>> > (or even a read, I suppose), but there will need to be some permission
>> > restrictions, obviously. (I'd like to see capabilities used for this by adding
>> > a specific container bit to the capabilities bitmask.)
>>
>> To be clear, from an audit perspective I think the only thing we would
>> really care about controlling access to is the creation and assignment
>> of a new audit container ID/token, not necessarily the container
>> itself. It's a small point, but an important one I think.
>>
>> > I doubt we will be able to accomodate all definitions or concepts of a
>> > container in a timely fashion. We'll need to start somewhere with a minimum
>> > definition so that we can get traction and actually move forward before another
>> > compelling shared kernel microservice method leaves our entire community
>> > behind. I'd like to declare that a container is a full set of cloned
>> > namespaces, but this is inefficient, overly constricting and unnecessary for
>> > our needs. If we could agree on a minimum definition of a container (which may
>> > have only one specific cloned namespace) then we have something on which to
>> > build. I could even see a container being defined by a trigger sent from
>> > userspace about a process (task) from which all its children are considered to
>> > be within that container, subject to further nesting.
>>
>> I really would prefer if we could avoid defining the term "container".
>> Even if we manage to get it right at this particular moment, we will
>> surely be made fools a year or two from now when things change. At
>> the very least lets avoid a rigid definition of container, I'll
>> concede that we will probably need to have some definition simply so
>> we can implement something, I just don't want the design or
>> implementation to depend on a particular definition.
>>
>> This comment is jumping ahead a bit, but from an audit perspective I
>> think we handle this by emitting an audit record whenever a container
>> ID is created which describes it as the kernel sees it; as of now that
>> probably means a list of namespace IDs. Richard mentions this in his
>> email, I just wanted to make it clear that I think we should see this
>> as a flexible mechanism. At the very least we will likely see a few
>> more namespaces before the world moves on from containers.
>>
>> > In the simplest usable model for audit, if a container (definition implies and)
>> > starts a PID namespace, then the container ID could simply be the container's
>> > "init" process PID in the initial PID namespace. This assumes that as soon as
>> > that process vanishes, that entire container and all its children are killed
>> > off (which you've done). There may be some container orchestration systems
>> > that don't use a unique PID namespace per container and that imposing this will
>> > cause them challenges.
>>
>> I don't follow how this would cause challenges if the containers do
>> not use a unique PID namespace; you are suggesting using the PID from
>> in the context of the initial PID namespace, yes?
>
> The PID of the "init" process of a container (PID=1 inside container,
> but PID=containerID from the initial PID namespace perspective).
Yep. I still don't see how a container not creating a unique PID
namespace presents a challenge here as the unique information would be
taken from the initial PID namespace.
However, based on some off-list discussions I expect this is going to
be a non-issue in the next proposal.
>> Regardless, I do worry that using a PID could potentially be a bit
>> racy once we start jumping between kernel and userspace (audit
>> configuration, logs, etc.).
>
> How do you think this could be racy? An event happenning before or as
> the container has been defined?
It's racy for the same reasons why we have the pid struct in the
kernel. If the orchestrator is referencing things via a PID there is
always some danger of a mixup.
>> > If containers have at minimum a unique mount namespace then the root path
>> > dentry inode device and inode number could be used, but there are likely better
>> > identifiers. Again, there may be container orchestrators that don't use a
>> > unique mount namespace per container and that imposing this will cause
>> > challenges.
>> >
>> > I expect there are similar examples for each of the other namespaces.
>>
>> The PID case is a bit unique as each process is going to have a unique
>> PID regardless of namespaces, but even that has some drawbacks as
>> discussed above. As for the other namespaces, I agree that we can't
>> rely on them (see my earlier comments).
>
> (In general can you specify which earlier comments so we can be sure to
> what you are referring?)
Really? How about the race condition concerns. Come on Richard ...
>> > If we could pick one namespace type for consensus for which each container has
>> > a unique instance of that namespace, we could use the dev/ino tuple from that
>> > namespace as had originally been suggested by Aristeu Rozanski more than 4
>> > years ago as part of the set of namespace IDs. I had also attempted to
>> > solve this problem by using the namespace' proc inode, then switched over to
>> > generate a unique kernel serial number for each namespace and then went back to
>> > namespace proc dev/ino once Al Viro implemented nsfs:
>> > v1 https://lkml.org/lkml/2014/4/22/662
>> > v2 https://lkml.org/lkml/2014/5/9/637
>> > v3 https://lkml.org/lkml/2014/5/20/287
>> > v4 https://lkml.org/lkml/2014/8/20/844
>> > v5 https://lkml.org/lkml/2014/10/6/25
>> > v6 https://lkml.org/lkml/2015/4/17/48
>> > v7 https://lkml.org/lkml/2015/5/12/773
>> >
>> > These patches don't use a container ID, but track all namespaces in use for an
>> > event. This has the benefit of punting this tracking to userspace for some
>> > other tool to analyse and determine to which container an event belongs.
>> > This will use a lot of bandwidth in audit log files when a single
>> > container ID that doesn't require nesting information to be complete
>> > would be a much more efficient use of audit log bandwidth.
>>
>> Relying on a particular namespace to identify a containers is a
>> non-starter from my perspective for all the reasons previously
>> discussed.
>
> I'd rather not either and suspect there isn't much danger of it, but if
> it is determined that there is one namespace in particular that is a
> minimum requirement, I'd prefer to use that nsID instead of creating an
> additional ID.
>
>> > If we rely only on the setting of arbitrary container names from userspace,
>> > then we must provide a map or tree back to the initial audit domain for that
>> > running kernel to be able to differentiate between potentially identical
>> > container names assigned in a nested container system. If we assign a
>> > container serial number sequentially (atomic64_inc) from the kernel on request
>> > from userspace like the sessionID and log the creation with all nsIDs and the
>> > parent container serial number and/or container name, the nesting is clear due
>> > to lack of ambiguity in potential duplicate names in nesting. If a container
>> > serial number is used, the tree of inheritance of nested containers can be
>> > rebuilt from the audit records showing what containers were spawned from what
>> > parent.
>>
>> I believe we are going to need a container ID to container definition
>> (namespace, etc.) mapping mechanism regardless of if the container ID
>> is provided by userspace or a kernel generated serial number. This
>> mapping should be recorded in the audit log when the container ID is
>> created/defined.
>
> Agreed.
>
>> > As was suggested in one of the previous threads, if there are any events not
>> > associated with a task (incoming network packets) we log the namespace ID and
>> > then only concern ourselves with its container serial number or container name
>> > once it becomes associated with a task at which point that tracking will be
>> > more important anyways.
>>
>> Agreed. After all, a single namespace can be shared between multiple
>> containers. For those security officers who need to track individual
>> events like this they will have the container ID mapping information
>> in the logs as well so they should be able to trace the unassociated
>> event to a set of containers.
>>
>> > I'm not convinced that a userspace or kernel generated UUID is that useful
>> > since they are large, not human readable and may not be globally unique given
>> > the "pets vs cattle" direction we are going with potentially identical
>> > conditions in hosts or containers spawning containers, but I see no need to
>> > restrict them.
>>
>> From a kernel perspective I think an int should suffice; after all,
>> you can't have more containers then you have processes. If the
>> container engine requires something more complex, it can use the int
>> as input to its own mapping function.
>
> PIDs roll over. That already causes some ambiguity in reporting. If a
> system is constantly spawning and reaping containers, especially
> single-process containers, I don't want to have to worry about that ID
> rolling to keep track of it even though there should be audit records of
> the spawn and death of each container. There isn't significant cost
> added here compared with some of the other overhead we're dealing with.
Fine, make it a u64. I believe that's what I've been proposing in the
off-list discussion if memory serves.
A UUID or string are not acceptable from my perspective. Too big for
the audit records and not really necessary anyway, a u64 should be
just fine.
... and if anyone dares bring up that 640kb quote I swear I'll NACK
all their patches for the next year :)
>> > How do we deal with setns()? Once it is determined that action is permitted,
>> > given the new combinaiton of namespaces and potential membership in a different
>> > container, record the transition from one container to another including all
>> > namespaces if the latter are a different subset than the target container
>> > initial set.
>>
>> That is a fun one, isn't it? I think this is where the container
>> ID-to-definition mapping comes into play. If setns() changes the
>> process such that the existing container ID is no longer valid then we
>> need to do a new lookup in the table to see if another container ID is
>> valid; if no established container ID mappings are valid, the
>> container ID becomes "undefined".
>
> Hopefully we can design this stuff so that container IDs are still valid
> while that transition occurs.
>
>> paul moore
>
> - RGB
>
> --
> Richard Guy Briggs <[email protected]>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
--
paul moore
http://www.paul-moore.com
On 2017-09-06 09:03, Serge E. Hallyn wrote:
> Quoting Richard Guy Briggs ([email protected]):
> ...
> > > I believe we are going to need a container ID to container definition
> > > (namespace, etc.) mapping mechanism regardless of if the container ID
> > > is provided by userspace or a kernel generated serial number. This
> > > mapping should be recorded in the audit log when the container ID is
> > > created/defined.
> >
> > Agreed.
> >
> > > > As was suggested in one of the previous threads, if there are any events not
> > > > associated with a task (incoming network packets) we log the namespace ID and
> > > > then only concern ourselves with its container serial number or container name
> > > > once it becomes associated with a task at which point that tracking will be
> > > > more important anyways.
> > >
> > > Agreed. After all, a single namespace can be shared between multiple
> > > containers. For those security officers who need to track individual
> > > events like this they will have the container ID mapping information
> > > in the logs as well so they should be able to trace the unassociated
> > > event to a set of containers.
> > >
> > > > I'm not convinced that a userspace or kernel generated UUID is that useful
> > > > since they are large, not human readable and may not be globally unique given
> > > > the "pets vs cattle" direction we are going with potentially identical
> > > > conditions in hosts or containers spawning containers, but I see no need to
> > > > restrict them.
> > >
> > > From a kernel perspective I think an int should suffice; after all,
> > > you can't have more containers then you have processes. If the
> > > container engine requires something more complex, it can use the int
> > > as input to its own mapping function.
> >
> > PIDs roll over. That already causes some ambiguity in reporting. If a
> > system is constantly spawning and reaping containers, especially
> > single-process containers, I don't want to have to worry about that ID
> > rolling to keep track of it even though there should be audit records of
> > the spawn and death of each container. There isn't significant cost
> > added here compared with some of the other overhead we're dealing with.
>
> Strawman proposal:
>
> 1. Each clone/unshare/setns involving a namespace type generates an audit
> message along the lines of:
>
> PID 9512 (pid in init_pid_ns) in auditnsid 00000001 cloned CLONE_NEWNS|CLONE_NEWNET
> new auditnsid: 00000002
> associated namespaces: (list of all namespace filesystem inode numbers)
As you will have seen, this is pretty much what my most recent proposal suggests.
> 2. Userspace (i.e. the container logging deamon here) can watch the audit log
> for all messages relating to auditnsid 00000002. Presumably there will be
> messages along the lines of "PID 9513 in auditnsid 00000002 cloned...". The
> container logging daemon can track those messages and add the new auditnsids
> to the list it watches.
Yes.
> 3. If a container is migrated (checkpointed and restored here or elsewhere),
> userspace can just follow the appropriate logs for the new containers.
Yes.
> Userspace does not ever *request* a auditnsid. They are ephemeral, just a
> tool to track the namespaces through the audit log. They are however guaranteed
> to never be re-used until reboot.
Well, this is where things get controvertial... I had wanted this, a
kernel-generated serial number unique to a running kernel to track every
container initiation, but this does have some CRIU challenges pointed
out by Eric Biederman. Nested containers will not have a consistent
view on a new host and no way to make it consistent. If we could
guarantee that containers would never be nested, this could be workable.
I think nesting is inevitable in the future given the variety and
creativity of the orchestration tools, so restricting this seems
short-sighted.
At the moment the approch is to let the orchestrator determine the ID of
a container. Some have argued for as small as u32 and others for a full
UUID. A u32 runs the risk of rolling, so a u64 seems like a reasonable
step to solve that issue. Others would like to be able to store a full
UUID which seemed like a good idea on the outset, but on further
thinking, this is something the orchestrator can manage while minimising
the number of bits of required information per audit record to guarantee
we can identify the provenance of a particular audit event. Let's see
if we can make it work with a u64.
> (Feels like someone must have proposed this before)
Thsee ideas have been thrown around a few times and I'm starting to
understand them better.
> -serge
- RGB
--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635