2012-11-19 15:08:27

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 0/16] user namespace and namespace infrastructure completion


The following series of changes completes the user namespace and adds
the much too long delay bits of namespace infrastructure.

This series of changes adds unprivilged creation of all namespaces
support for creating a user namespace with unshare, and support for
entering a user namespace with setns.

The proc namespace files are converted into magic symlinks to avoid
problems with dentry caching excessively keeping a namespace alive and
dentry caching allowing the ptrace_may_access checks to be bypassed.

The proc namespace now have inode numbers that are always the same for
the same user namespace allowing stat to test if two file descriptors
refer to the same namespace.

Eric W. Biederman (16):
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: Allow unprivileged users to create user namespaces.
userns: Allow chown and setgid preservation
userns: Allow setting a userns mapping to your current uid.
userns: Allow unprivileged users to create new namespaces
userns: Allow unprivileged use of setns.
userns: Make create_new_namespaces take a user_ns parameter
userns: Kill task_user_ns
userns: Implent proc namespace operations
userns: Implement unshare of the user namespace
procfs: Print task uids and gids in the userns that opened the proc file
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
userns: Allow unprivilged mounts of proc and sysfs
proc: Generalize proc inode allocation
proc: Fix the namespace inode permission checks.
proc: Usable inode numbers for the namespace file descriptors.

fs/attr.c | 11 ++-
fs/exec.c | 9 +--
fs/mount.h | 1 +
fs/namespace.c | 14 +++
fs/proc/array.c | 2 +-
fs/proc/generic.c | 26 +++---
fs/proc/inode.c | 6 +-
fs/proc/namespaces.c | 177 +++++++++++++++++++++++++++++++++++-----
fs/proc/root.c | 1 +
fs/sysfs/mount.c | 1 +
include/linux/cred.h | 2 -
include/linux/ipc_namespace.h | 9 ++-
include/linux/nsproxy.h | 2 +-
include/linux/pid_namespace.h | 1 +
include/linux/proc_fs.h | 18 ++++-
include/linux/user_namespace.h | 10 ++
include/linux/utsname.h | 7 +-
include/net/net_namespace.h | 2 +
init/version.c | 2 +
ipc/msgutil.c | 2 +
ipc/namespace.c | 32 ++++++--
kernel/fork.c | 33 +++++---
kernel/nsproxy.c | 34 ++++----
kernel/pid.c | 1 +
kernel/pid_namespace.c | 12 +++
kernel/ptrace.c | 10 ++-
kernel/sched/core.c | 10 ++-
kernel/user.c | 2 +
kernel/user_namespace.c | 147 +++++++++++++++++++++++++++++----
kernel/utsname.c | 33 ++++++--
net/core/net_namespace.c | 31 +++++++-
security/yama/yama_lsm.c | 12 ++-
32 files changed, 535 insertions(+), 125 deletions(-)


2012-11-19 15:12:57

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 01/16] userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped

From: "Eric W. Biederman" <[email protected]>

When performing an exec where the binary lives in one user namespace and
the execing process lives in another usre namespace there is the possibility
that the target uids can not be represented.

Instead of failing the exec simply ignore the suid/sgid bits and run
the binary with lower privileges. We already do this in the case
of MNT_NOSUID so this should be a well tested code path.

As the user and group are not changed this should not introduce any
security issues.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/exec.c | 9 +++------
1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0039055..aef0c2f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,14 +1266,13 @@ int prepare_binprm(struct linux_binprm *bprm)
bprm->cred->egid = current_egid();

if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
- !current->no_new_privs) {
+ !current->no_new_privs &&
+ kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
+ kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
/* Set-uid? */
if (mode & S_ISUID) {
- if (!kuid_has_mapping(bprm->cred->user_ns, inode->i_uid))
- return -EPERM;
bprm->per_clear |= PER_CLEAR_ON_SETID;
bprm->cred->euid = inode->i_uid;
-
}

/* Set-gid? */
@@ -1283,8 +1282,6 @@ int prepare_binprm(struct linux_binprm *bprm)
* executable.
*/
if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
- if (!kgid_has_mapping(bprm->cred->user_ns, inode->i_gid))
- return -EPERM;
bprm->per_clear |= PER_CLEAR_ON_SETID;
bprm->cred->egid = inode->i_gid;
}
--
1.7.5.4

2012-11-19 15:13:06

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 03/16] userns: Allow chown and setgid preservation

From: "Eric W. Biederman" <[email protected]>

- Allow chown if CAP_CHOWN is present in the current user namespace
and the uid of the inode maps into the current user namespace, and
the destination uid or gid maps into the current user namespace.

- Allow perserving setgid when changing an inode if CAP_FSETID is
present in the current user namespace and the owner of the file has
a mapping into the current user namespace.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/attr.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index cce7df5..1449adb 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -49,14 +49,15 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
/* Make sure a caller can chown. */
if ((ia_valid & ATTR_UID) &&
(!uid_eq(current_fsuid(), inode->i_uid) ||
- !uid_eq(attr->ia_uid, inode->i_uid)) && !capable(CAP_CHOWN))
+ !uid_eq(attr->ia_uid, inode->i_uid)) &&
+ !inode_capable(inode, CAP_CHOWN))
return -EPERM;

/* Make sure caller can chgrp. */
if ((ia_valid & ATTR_GID) &&
(!uid_eq(current_fsuid(), inode->i_uid) ||
(!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
- !capable(CAP_CHOWN))
+ !inode_capable(inode, CAP_CHOWN))
return -EPERM;

/* Make sure a caller can chmod. */
@@ -65,7 +66,8 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
return -EPERM;
/* Also check the setgid bit! */
if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
- inode->i_gid) && !capable(CAP_FSETID))
+ inode->i_gid) &&
+ !inode_capable(inode, CAP_FSETID))
attr->ia_mode &= ~S_ISGID;
}

@@ -157,7 +159,8 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;

- if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+ if (!in_group_p(inode->i_gid) &&
+ !inode_capable(inode, CAP_FSETID))
mode &= ~S_ISGID;
inode->i_mode = mode;
}
--
1.7.5.4

2012-11-19 15:13:12

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 04/16] userns: Allow setting a userns mapping to your current uid.

From: "Eric W. Biederman" <[email protected]>

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/user_namespace.c | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 456a6b9..49096d5 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -709,6 +709,21 @@ ssize_t proc_projid_map_write(struct file *file, const char __user *buf, size_t
static bool new_idmap_permitted(struct user_namespace *ns, int cap_setid,
struct uid_gid_map *new_map)
{
+ /* Allow mapping to your own filesystem ids */
+ if ((new_map->nr_extents == 1) && (new_map->extent[0].count == 1)) {
+ u32 id = new_map->extent[0].lower_first;
+ if (cap_setid == CAP_SETUID) {
+ kuid_t uid = make_kuid(ns->parent, id);
+ if (uid_eq(uid, current_fsuid()))
+ return true;
+ }
+ else if (cap_setid == CAP_SETGID) {
+ kgid_t gid = make_kgid(ns->parent, id);
+ if (gid_eq(gid, current_fsgid()))
+ return true;
+ }
+ }
+
/* Allow anyone to set a mapping that doesn't require privilege */
if (!cap_valid(cap_setid))
return true;
--
1.7.5.4

2012-11-19 15:13:19

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 06/16] userns: Allow unprivileged use of setns.

From: "Eric W. Biederman" <[email protected]>

- Push the permission check from the core setns syscall into
the setns install methods where the user namespace of the
target namespace can be determined, and used in a ns_capable
call.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
ipc/namespace.c | 6 +++++-
kernel/nsproxy.c | 3 ---
kernel/utsname.c | 7 ++++++-
net/core/net_namespace.c | 7 ++++++-
4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index f362298c..6ed33c0 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -161,8 +161,12 @@ static void ipcns_put(void *ns)
return put_ipc_ns(ns);
}

-static int ipcns_install(struct nsproxy *nsproxy, void *ns)
+static int ipcns_install(struct nsproxy *nsproxy, void *new)
{
+ struct ipc_namespace *ns = new;
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
/* Ditch state from the old ipc namespace */
exit_sem(current);
put_ipc_ns(nsproxy->ipc_ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index a214e0e..4357a0a 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -242,9 +242,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
struct file *file;
int err;

- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
-
file = proc_ns_fget(fd);
if (IS_ERR(file))
return PTR_ERR(file);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 679d97a..4a9362f 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -102,8 +102,13 @@ static void utsns_put(void *ns)
put_uts_ns(ns);
}

-static int utsns_install(struct nsproxy *nsproxy, void *ns)
+static int utsns_install(struct nsproxy *nsproxy, void *new)
{
+ struct uts_namespace *ns = new;
+
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
get_uts_ns(ns);
put_uts_ns(nsproxy->uts_ns);
nsproxy->uts_ns = ns;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6456439..ec2870b 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -630,8 +630,13 @@ static void netns_put(void *ns)

static int netns_install(struct nsproxy *nsproxy, void *ns)
{
+ struct net *net = ns;
+
+ if (!ns_capable(net->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
put_net(nsproxy->net_ns);
- nsproxy->net_ns = get_net(ns);
+ nsproxy->net_ns = get_net(net);
return 0;
}

--
1.7.5.4

2012-11-19 15:13:26

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 08/16] userns: Kill task_user_ns

From: "Eric W. Biederman" <[email protected]>

The task_user_ns function hides the fact that it is getting the user
namespace from struct cred on the task. struct cred may go away as
soon as the rcu lock is released. This leads to a race where we
can dereference a stale user namespace pointer.

To make it obvious a struct cred is involved kill task_user_ns.

To kill the race modify the users of task_user_ns to only
reference the user namespace while the rcu lock is held.

Cc: Kees Cook <[email protected]>
Cc: James Morris <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/cred.h | 2 --
kernel/ptrace.c | 10 ++++++++--
kernel/sched/core.c | 10 ++++++++--
security/yama/yama_lsm.c | 12 +++++++++---
4 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index ebbed2c..856d262 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -357,10 +357,8 @@ static inline void put_cred(const struct cred *_cred)
extern struct user_namespace init_user_ns;
#ifdef CONFIG_USER_NS
#define current_user_ns() (current_cred_xxx(user_ns))
-#define task_user_ns(task) (task_cred_xxx((task), user_ns))
#else
#define current_user_ns() (&init_user_ns)
-#define task_user_ns(task) (&init_user_ns)
#endif


diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1f5e55d..7b09b88 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -215,8 +215,12 @@ ok:
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
- if (!dumpable && !ptrace_has_cap(task_user_ns(task), mode))
+ rcu_read_lock();
+ if (!dumpable && !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
+ rcu_read_unlock();
return -EPERM;
+ }
+ rcu_read_unlock();

return security_ptrace_access_check(task, mode);
}
@@ -280,8 +284,10 @@ static int ptrace_attach(struct task_struct *task, long request,

if (seize)
flags |= PT_SEIZED;
- if (ns_capable(task_user_ns(task), CAP_SYS_PTRACE))
+ rcu_read_lock();
+ if (ns_capable(__task_cred(task)->user_ns, CAP_SYS_PTRACE))
flags |= PT_PTRACE_CAP;
+ rcu_read_unlock();
task->ptrace = flags;

__ptrace_link(task, current);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..2f5eb18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4029,8 +4029,14 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
goto out_free_cpus_allowed;
}
retval = -EPERM;
- if (!check_same_owner(p) && !ns_capable(task_user_ns(p), CAP_SYS_NICE))
- goto out_unlock;
+ if (!check_same_owner(p)) {
+ rcu_read_lock();
+ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
+ rcu_read_unlock();
+ goto out_unlock;
+ }
+ rcu_read_unlock();
+ }

retval = security_task_setscheduler(p);
if (retval)
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index b4c2984..0e72239 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -262,14 +262,18 @@ int yama_ptrace_access_check(struct task_struct *child,
/* No additional restrictions. */
break;
case YAMA_SCOPE_RELATIONAL:
+ rcu_read_lock();
if (!task_is_descendant(current, child) &&
!ptracer_exception_found(current, child) &&
- !ns_capable(task_user_ns(child), CAP_SYS_PTRACE))
+ !ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
rc = -EPERM;
+ rcu_read_unlock();
break;
case YAMA_SCOPE_CAPABILITY:
- if (!ns_capable(task_user_ns(child), CAP_SYS_PTRACE))
+ rcu_read_lock();
+ if (!ns_capable(__task_cred(child)->user_ns, CAP_SYS_PTRACE))
rc = -EPERM;
+ rcu_read_unlock();
break;
case YAMA_SCOPE_NO_ATTACH:
default:
@@ -307,8 +311,10 @@ int yama_ptrace_traceme(struct task_struct *parent)
/* Only disallow PTRACE_TRACEME on more aggressive settings. */
switch (ptrace_scope) {
case YAMA_SCOPE_CAPABILITY:
- if (!ns_capable(task_user_ns(parent), CAP_SYS_PTRACE))
+ rcu_read_lock();
+ if (!ns_capable(__task_cred(parent)->user_ns, CAP_SYS_PTRACE))
rc = -EPERM;
+ rcu_read_unlock();
break;
case YAMA_SCOPE_NO_ATTACH:
rc = -EPERM;
--
1.7.5.4

2012-11-19 15:13:33

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 10/16] userns: Implement unshare of the user namespace

From: "Eric W. Biederman" <[email protected]>

- Add CLONE_THREAD to the unshare flags if CLONE_NEWUSER is selected
As changing user namespaces is only valid if all there is only
a single thread.
- Restore the code to add CLONE_VM if CLONE_THREAD is selected and
the code to addCLONE_SIGHAND if CLONE_VM is selected.
Making the constraints in the code clear.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/nsproxy.h | 2 +-
include/linux/user_namespace.h | 9 +++++++++
kernel/fork.c | 25 ++++++++++++++++++++++---
kernel/nsproxy.c | 8 ++++----
kernel/user_namespace.c | 15 +++++++++++++++
5 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index cc37a55..10e5947 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -67,7 +67,7 @@ void exit_task_namespaces(struct task_struct *tsk);
void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
void free_nsproxy(struct nsproxy *ns);
int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
- struct fs_struct *);
+ struct cred *, struct fs_struct *);
int __init nsproxy_cache_init(void);

static inline void put_nsproxy(struct nsproxy *ns)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 95142ca..17651f0 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,6 +39,7 @@ static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
}

extern int create_user_ns(struct cred *new);
+extern int unshare_userns(unsigned long unshare_flags, struct cred **new_cred);
extern void free_user_ns(struct kref *kref);

static inline void put_user_ns(struct user_namespace *ns)
@@ -66,6 +67,14 @@ static inline int create_user_ns(struct cred *new)
return -EINVAL;
}

+static inline int unshare_userns(unsigned long unshare_flags,
+ struct cred **new_cred)
+{
+ if (unshare_flags & CLONE_NEWUSER)
+ return -EINVAL;
+ return 0;
+}
+
static inline void put_user_ns(struct user_namespace *ns)
{
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 8c29abb..38e53b8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1687,7 +1687,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
@@ -1754,11 +1754,17 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
{
struct fs_struct *fs, *new_fs = NULL;
struct files_struct *fd, *new_fd = NULL;
+ struct cred *new_cred = NULL;
struct nsproxy *new_nsproxy = NULL;
int do_sysvsem = 0;
int err;

/*
+ * If unsharing a user namespace must also unshare the thread.
+ */
+ if (unshare_flags & CLONE_NEWUSER)
+ unshare_flags |= CLONE_THREAD;
+ /*
* If unsharing a pid namespace must also unshare the thread.
*/
if (unshare_flags & CLONE_NEWPID)
@@ -1795,11 +1801,15 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
err = unshare_fd(unshare_flags, &new_fd);
if (err)
goto bad_unshare_cleanup_fs;
- err = unshare_nsproxy_namespaces(unshare_flags, &new_nsproxy, new_fs);
+ err = unshare_userns(unshare_flags, &new_cred);
if (err)
goto bad_unshare_cleanup_fd;
+ err = unshare_nsproxy_namespaces(unshare_flags, &new_nsproxy,
+ new_cred, new_fs);
+ if (err)
+ goto bad_unshare_cleanup_cred;

- if (new_fs || new_fd || do_sysvsem || new_nsproxy) {
+ if (new_fs || new_fd || do_sysvsem || new_cred || new_nsproxy) {
if (do_sysvsem) {
/*
* CLONE_SYSVSEM is equivalent to sys_exit().
@@ -1832,11 +1842,20 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
}

task_unlock(current);
+
+ if (new_cred) {
+ /* Install the new user namespace */
+ commit_creds(new_cred);
+ new_cred = NULL;
+ }
}

if (new_nsproxy)
put_nsproxy(new_nsproxy);

+bad_unshare_cleanup_cred:
+ if (new_cred)
+ put_cred(new_cred);
bad_unshare_cleanup_fd:
if (new_fd)
put_files_struct(new_fd);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 2ddd816..78e2ecb 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -186,7 +186,7 @@ void free_nsproxy(struct nsproxy *ns)
* On success, returns the new nsproxy.
*/
int unshare_nsproxy_namespaces(unsigned long unshare_flags,
- struct nsproxy **new_nsp, struct fs_struct *new_fs)
+ struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs)
{
struct user_namespace *user_ns;
int err = 0;
@@ -195,12 +195,12 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
CLONE_NEWNET | CLONE_NEWPID)))
return 0;

- if (!nsown_capable(CAP_SYS_ADMIN))
+ user_ns = new_cred ? new_cred->user_ns : current_user_ns();
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;

- user_ns = current_user_ns();
*new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
- new_fs ? new_fs : current->fs);
+ new_fs ? new_fs : current->fs);
if (IS_ERR(*new_nsp)) {
err = PTR_ERR(*new_nsp);
goto out;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index a946077..ce92f7e 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -82,6 +82,21 @@ int create_user_ns(struct cred *new)
return 0;
}

+int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
+{
+ struct cred *cred;
+
+ if (!(unshare_flags & CLONE_NEWUSER))
+ return 0;
+
+ cred = prepare_creds();
+ if (!cred)
+ return -ENOMEM;
+
+ *new_cred = cred;
+ return create_user_ns(cred);
+}
+
void free_user_ns(struct kref *kref)
{
struct user_namespace *parent, *ns =
--
1.7.5.4

2012-11-19 15:13:49

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 09/16] userns: Implent proc namespace operations

From: "Eric W. Biederman" <[email protected]>

This allows entering a user namespace, and the ability
to store a reference to a user namespace with a bind
mount.

Addition of missing userns_ns_put in userns_install
from Gao feng <[email protected]>

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/namespaces.c | 4 ++
include/linux/proc_fs.h | 1 +
kernel/user_namespace.c | 90 ++++++++++++++++++++++++++++++++++++++---------
3 files changed, 78 insertions(+), 17 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 2a17fd9..030250c 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -11,6 +11,7 @@
#include <net/net_namespace.h>
#include <linux/ipc_namespace.h>
#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
#include "internal.h"


@@ -27,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
#ifdef CONFIG_PID_NS
&pidns_operations,
#endif
+#ifdef CONFIG_USER_NS
+ &userns_operations,
+#endif
&mntns_operations,
};

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 9014c04..3144781 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -258,6 +258,7 @@ extern const struct proc_ns_operations netns_operations;
extern const struct proc_ns_operations utsns_operations;
extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
+extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;

union proc_op {
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 49096d5..a946077 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -9,6 +9,7 @@
#include <linux/nsproxy.h>
#include <linux/slab.h>
#include <linux/user_namespace.h>
+#include <linux/proc_fs.h>
#include <linux/highuid.h>
#include <linux/cred.h>
#include <linux/securebits.h>
@@ -26,6 +27,24 @@ static struct kmem_cache *user_ns_cachep __read_mostly;
static bool new_idmap_permitted(struct user_namespace *ns, int cap_setid,
struct uid_gid_map *map);

+static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
+{
+ /* Start with the same capabilities as init but useless for doing
+ * anything as the capabilities are bound to the new user namespace.
+ */
+ cred->securebits = SECUREBITS_DEFAULT;
+ cred->cap_inheritable = CAP_EMPTY_SET;
+ cred->cap_permitted = CAP_FULL_SET;
+ cred->cap_effective = CAP_FULL_SET;
+ cred->cap_bset = CAP_FULL_SET;
+#ifdef CONFIG_KEYS
+ key_put(cred->request_key_auth);
+ cred->request_key_auth = NULL;
+#endif
+ /* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */
+ cred->user_ns = user_ns;
+}
+
/*
* Create a new user namespace, deriving the creator from the user in the
* passed credentials, and replacing that user with the new root user for the
@@ -53,27 +72,12 @@ int create_user_ns(struct cred *new)
return -ENOMEM;

kref_init(&ns->kref);
+ /* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
ns->owner = owner;
ns->group = group;

- /* Start with the same capabilities as init but useless for doing
- * anything as the capabilities are bound to the new user namespace.
- */
- new->securebits = SECUREBITS_DEFAULT;
- new->cap_inheritable = CAP_EMPTY_SET;
- new->cap_permitted = CAP_FULL_SET;
- new->cap_effective = CAP_FULL_SET;
- new->cap_bset = CAP_FULL_SET;
-#ifdef CONFIG_KEYS
- key_put(new->request_key_auth);
- new->request_key_auth = NULL;
-#endif
- /* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */
-
- /* Leave the new->user_ns reference with the new user namespace. */
- /* Leave the reference to our user_ns with the new cred. */
- new->user_ns = ns;
+ set_cred_user_ns(new, ns);

return 0;
}
@@ -737,6 +741,58 @@ static bool new_idmap_permitted(struct user_namespace *ns, int cap_setid,
return false;
}

+static void *userns_get(struct task_struct *task)
+{
+ struct user_namespace *user_ns;
+
+ rcu_read_lock();
+ user_ns = get_user_ns(__task_cred(task)->user_ns);
+ rcu_read_unlock();
+
+ return user_ns;
+}
+
+static void userns_put(void *ns)
+{
+ put_user_ns(ns);
+}
+
+static int userns_install(struct nsproxy *nsproxy, void *ns)
+{
+ struct user_namespace *user_ns = ns;
+ struct cred *cred;
+
+ /* Don't allow gaining capabilities by reentering
+ * the same user namespace.
+ */
+ if (user_ns == current_user_ns())
+ return -EINVAL;
+
+ /* Threaded many not enter a different user namespace */
+ if (atomic_read(&current->mm->mm_users) > 1)
+ return -EINVAL;
+
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ cred = prepare_creds();
+ if (!cred)
+ return -ENOMEM;
+
+ put_user_ns(cred->user_ns);
+ set_cred_user_ns(cred, get_user_ns(user_ns));
+
+ return commit_creds(cred);
+}
+
+const struct proc_ns_operations userns_operations = {
+ .name = "user",
+ .type = CLONE_NEWUSER,
+ .get = userns_get,
+ .put = userns_put,
+ .install = userns_install,
+};
+
static __init int user_namespaces_init(void)
{
user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
--
1.7.5.4

2012-11-19 15:14:20

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 07/16] userns: Make create_new_namespaces take a user_ns parameter

From: "Eric W. Biederman" <[email protected]>

Modify create_new_namespaces to explicitly take a user namespace
parameter, instead of implicitly through the task_struct.

This allows an implementation of unshare(CLONE_NEWUSER) where
the new user namespace is not stored onto the current task_struct
until after all of the namespaces are created.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/ipc_namespace.h | 7 ++++---
include/linux/utsname.h | 6 +++---
ipc/namespace.c | 10 ++++------
kernel/nsproxy.c | 22 +++++++++++++---------
kernel/utsname.c | 9 ++++-----
5 files changed, 28 insertions(+), 26 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 5499c92..f03af70 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -133,7 +133,8 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { return 0; }

#if defined(CONFIG_IPC_NS)
extern struct ipc_namespace *copy_ipcs(unsigned long flags,
- struct task_struct *tsk);
+ struct user_namespace *user_ns, struct ipc_namespace *ns);
+
static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
{
if (ns)
@@ -144,12 +145,12 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
extern void put_ipc_ns(struct ipc_namespace *ns);
#else
static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
- struct task_struct *tsk)
+ struct user_namespace *user_ns, struct ipc_namespace *ns)
{
if (flags & CLONE_NEWIPC)
return ERR_PTR(-EINVAL);

- return tsk->nsproxy->ipc_ns;
+ return ns;
}

static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 2b34520..221f4a0 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -33,7 +33,7 @@ static inline void get_uts_ns(struct uts_namespace *ns)
}

extern struct uts_namespace *copy_utsname(unsigned long flags,
- struct task_struct *tsk);
+ struct user_namespace *user_ns, struct uts_namespace *old_ns);
extern void free_uts_ns(struct kref *kref);

static inline void put_uts_ns(struct uts_namespace *ns)
@@ -50,12 +50,12 @@ static inline void put_uts_ns(struct uts_namespace *ns)
}

static inline struct uts_namespace *copy_utsname(unsigned long flags,
- struct task_struct *tsk)
+ struct user_namespace *user_ns, struct uts_namespace *old_ns)
{
if (flags & CLONE_NEWUTS)
return ERR_PTR(-EINVAL);

- return tsk->nsproxy->uts_ns;
+ return old_ns;
}
#endif

diff --git a/ipc/namespace.c b/ipc/namespace.c
index 6ed33c0..72c8682 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -16,7 +16,7 @@

#include "util.h"

-static struct ipc_namespace *create_ipc_ns(struct task_struct *tsk,
+static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
struct ipc_namespace *old_ns)
{
struct ipc_namespace *ns;
@@ -46,19 +46,17 @@ static struct ipc_namespace *create_ipc_ns(struct task_struct *tsk,
ipcns_notify(IPCNS_CREATED);
register_ipcns_notifier(ns);

- ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns));
+ ns->user_ns = get_user_ns(user_ns);

return ns;
}

struct ipc_namespace *copy_ipcs(unsigned long flags,
- struct task_struct *tsk)
+ struct user_namespace *user_ns, struct ipc_namespace *ns)
{
- struct ipc_namespace *ns = tsk->nsproxy->ipc_ns;
-
if (!(flags & CLONE_NEWIPC))
return get_ipc_ns(ns);
- return create_ipc_ns(tsk, ns);
+ return create_ipc_ns(user_ns, ns);
}

/*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4357a0a..2ddd816 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -57,7 +57,8 @@ static inline struct nsproxy *create_nsproxy(void)
* leave it to the caller to do proper locking and attach it to task.
*/
static struct nsproxy *create_new_namespaces(unsigned long flags,
- struct task_struct *tsk, struct fs_struct *new_fs)
+ struct task_struct *tsk, struct user_namespace *user_ns,
+ struct fs_struct *new_fs)
{
struct nsproxy *new_nsp;
int err;
@@ -66,31 +67,31 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
if (!new_nsp)
return ERR_PTR(-ENOMEM);

- new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, task_cred_xxx(tsk, user_ns), new_fs);
+ new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
if (IS_ERR(new_nsp->mnt_ns)) {
err = PTR_ERR(new_nsp->mnt_ns);
goto out_ns;
}

- new_nsp->uts_ns = copy_utsname(flags, tsk);
+ new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
if (IS_ERR(new_nsp->uts_ns)) {
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}

- new_nsp->ipc_ns = copy_ipcs(flags, tsk);
+ new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
err = PTR_ERR(new_nsp->ipc_ns);
goto out_ipc;
}

- new_nsp->pid_ns = copy_pid_ns(flags, task_cred_xxx(tsk, user_ns), tsk->nsproxy->pid_ns);
+ new_nsp->pid_ns = copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns);
if (IS_ERR(new_nsp->pid_ns)) {
err = PTR_ERR(new_nsp->pid_ns);
goto out_pid;
}

- new_nsp->net_ns = copy_net_ns(flags, task_cred_xxx(tsk, user_ns), tsk->nsproxy->net_ns);
+ new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
goto out_net;
@@ -152,7 +153,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
goto out;
}

- new_ns = create_new_namespaces(flags, tsk, tsk->fs);
+ new_ns = create_new_namespaces(flags, tsk,
+ task_cred_xxx(tsk, user_ns), tsk->fs);
if (IS_ERR(new_ns)) {
err = PTR_ERR(new_ns);
goto out;
@@ -186,6 +188,7 @@ void free_nsproxy(struct nsproxy *ns)
int unshare_nsproxy_namespaces(unsigned long unshare_flags,
struct nsproxy **new_nsp, struct fs_struct *new_fs)
{
+ struct user_namespace *user_ns;
int err = 0;

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
@@ -195,7 +198,8 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
if (!nsown_capable(CAP_SYS_ADMIN))
return -EPERM;

- *new_nsp = create_new_namespaces(unshare_flags, current,
+ user_ns = current_user_ns();
+ *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
new_fs ? new_fs : current->fs);
if (IS_ERR(*new_nsp)) {
err = PTR_ERR(*new_nsp);
@@ -252,7 +256,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
if (nstype && (ops->type != nstype))
goto out;

- new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
+ new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
if (IS_ERR(new_nsproxy)) {
err = PTR_ERR(new_nsproxy);
goto out;
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 4a9362f..fdc619e 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -32,7 +32,7 @@ static struct uts_namespace *create_uts_ns(void)
* @old_ns: namespace to clone
* Return NULL on error (failure to kmalloc), new ns otherwise
*/
-static struct uts_namespace *clone_uts_ns(struct task_struct *tsk,
+static struct uts_namespace *clone_uts_ns(struct user_namespace *user_ns,
struct uts_namespace *old_ns)
{
struct uts_namespace *ns;
@@ -43,7 +43,7 @@ static struct uts_namespace *clone_uts_ns(struct task_struct *tsk,

down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
- ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns));
+ ns->user_ns = get_user_ns(user_ns);
up_read(&uts_sem);
return ns;
}
@@ -55,9 +55,8 @@ static struct uts_namespace *clone_uts_ns(struct task_struct *tsk,
* versa.
*/
struct uts_namespace *copy_utsname(unsigned long flags,
- struct task_struct *tsk)
+ struct user_namespace *user_ns, struct uts_namespace *old_ns)
{
- struct uts_namespace *old_ns = tsk->nsproxy->uts_ns;
struct uts_namespace *new_ns;

BUG_ON(!old_ns);
@@ -66,7 +65,7 @@ struct uts_namespace *copy_utsname(unsigned long flags,
if (!(flags & CLONE_NEWUTS))
return old_ns;

- new_ns = clone_uts_ns(tsk, old_ns);
+ new_ns = clone_uts_ns(user_ns, old_ns);

put_uts_ns(old_ns);
return new_ns;
--
1.7.5.4

2012-11-19 15:14:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 05/16] userns: Allow unprivileged users to create new namespaces

From: "Eric W. Biederman" <[email protected]>

If an unprivileged user has the appropriate capabilities in their
current user namespace allow the creation of new namespaces.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/nsproxy.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 7f8b051..a214e0e 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -122,6 +122,7 @@ out_ns:
int copy_namespaces(unsigned long flags, struct task_struct *tsk)
{
struct nsproxy *old_ns = tsk->nsproxy;
+ struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
int err = 0;

@@ -134,7 +135,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
CLONE_NEWPID | CLONE_NEWNET)))
return 0;

- if (!capable(CAP_SYS_ADMIN)) {
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN)) {
err = -EPERM;
goto out;
}
@@ -191,7 +192,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
CLONE_NEWNET | CLONE_NEWPID)))
return 0;

- if (!capable(CAP_SYS_ADMIN))
+ if (!nsown_capable(CAP_SYS_ADMIN))
return -EPERM;

*new_nsp = create_new_namespaces(unshare_flags, current,
--
1.7.5.4

2012-11-19 15:13:03

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 02/16] userns: Allow unprivileged users to create user namespaces.

From: "Eric W. Biederman" <[email protected]>

Now that we have been through every permission check in the kernel
having uid == 0 and gid == 0 in your local user namespace no
longer adds any special privileges. Even having a full set
of caps in your local user namespace is safe because capabilies
are relative to your local user namespace, and do not confer
unexpected privileges.

Over the long term this should allow much more of the kernels
functionality to be safely used by non-root users. Functionality
like unsharing the mount namespace that is only unsafe because
it can fool applications whose privileges are raised when they
are executed. Since those applications have no privileges in
a user namespaces it becomes safe to spoof and confuse those
applications all you want.

Those capabilities will still need to be enabled carefully because
we may still need things like rlimits on the number of unprivileged
mounts but that is to avoid DOS attacks not to avoid fooling root
owned processes.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
kernel/fork.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 811ffba..8c29abb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1569,14 +1569,6 @@ long do_fork(unsigned long clone_flags,
if (clone_flags & (CLONE_THREAD|CLONE_PARENT))
return -EINVAL;
}
- if (clone_flags & CLONE_NEWUSER) {
- /* hopefully this check will go away when userns support is
- * complete
- */
- if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
- !capable(CAP_SETGID))
- return -EPERM;
- }

/*
* Determine whether and which event to report to ptracer. When
--
1.7.5.4

2012-11-19 15:17:09

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 12/16] userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file

From: "Eric W. Biederman" <[email protected]>

To keep things sane in the context of file descriptor passing derive the
user namespace that uids are mapped into from the opener of the file
instead of from current.

When writing to the maps file the lower user namespace must always
be the parent user namespace, or setting the mapping simply does
not make sense. Enforce that the opener of the file was in
the parent user namespace or the user namespace whose mapping
is being set.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/user_namespace.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ce92f7e..89f6eae 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -391,7 +391,7 @@ static int uid_m_show(struct seq_file *seq, void *v)
struct user_namespace *lower_ns;
uid_t lower;

- lower_ns = current_user_ns();
+ lower_ns = seq_user_ns(seq);
if ((lower_ns == ns) && lower_ns->parent)
lower_ns = lower_ns->parent;

@@ -412,7 +412,7 @@ static int gid_m_show(struct seq_file *seq, void *v)
struct user_namespace *lower_ns;
gid_t lower;

- lower_ns = current_user_ns();
+ lower_ns = seq_user_ns(seq);
if ((lower_ns == ns) && lower_ns->parent)
lower_ns = lower_ns->parent;

@@ -688,10 +688,14 @@ ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t siz
{
struct seq_file *seq = file->private_data;
struct user_namespace *ns = seq->private;
+ struct user_namespace *seq_ns = seq_user_ns(seq);

if (!ns->parent)
return -EPERM;

+ if ((seq_ns != ns) && (seq_ns != ns->parent))
+ return -EPERM;
+
return map_write(file, buf, size, ppos, CAP_SETUID,
&ns->uid_map, &ns->parent->uid_map);
}
@@ -700,10 +704,14 @@ ssize_t proc_gid_map_write(struct file *file, const char __user *buf, size_t siz
{
struct seq_file *seq = file->private_data;
struct user_namespace *ns = seq->private;
+ struct user_namespace *seq_ns = seq_user_ns(seq);

if (!ns->parent)
return -EPERM;

+ if ((seq_ns != ns) && (seq_ns != ns->parent))
+ return -EPERM;
+
return map_write(file, buf, size, ppos, CAP_SETGID,
&ns->gid_map, &ns->parent->gid_map);
}
--
1.7.5.4

2012-11-19 15:17:07

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 14/16] proc: Generalize proc inode allocation

From: "Eric W. Biederman" <[email protected]>

Generalize the proc inode allocation so that it can be
used without having to having to create a proc_dir_entry.

This will allow namespace file descriptors to remain light
weight entitities but still have the same inode number
when the backing namespace is the same.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/generic.c | 26 +++++++++++++-------------
include/linux/proc_fs.h | 10 ++++++++++
2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 0d80cef..7b3ae3c 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -350,14 +350,14 @@ static DEFINE_SPINLOCK(proc_inum_lock); /* protects the above */
* Return an inode number between PROC_DYNAMIC_FIRST and
* 0xffffffff, or zero on failure.
*/
-static unsigned int get_inode_number(void)
+int proc_alloc_inum(unsigned int *inum)
{
unsigned int i;
int error;

retry:
- if (ida_pre_get(&proc_inum_ida, GFP_KERNEL) == 0)
- return 0;
+ if (!ida_pre_get(&proc_inum_ida, GFP_KERNEL))
+ return -ENOMEM;

spin_lock(&proc_inum_lock);
error = ida_get_new(&proc_inum_ida, &i);
@@ -365,18 +365,19 @@ retry:
if (error == -EAGAIN)
goto retry;
else if (error)
- return 0;
+ return error;

if (i > UINT_MAX - PROC_DYNAMIC_FIRST) {
spin_lock(&proc_inum_lock);
ida_remove(&proc_inum_ida, i);
spin_unlock(&proc_inum_lock);
- return 0;
+ return -ENOSPC;
}
- return PROC_DYNAMIC_FIRST + i;
+ *inum = PROC_DYNAMIC_FIRST + i;
+ return 0;
}

-static void release_inode_number(unsigned int inum)
+void proc_free_inum(unsigned int inum)
{
spin_lock(&proc_inum_lock);
ida_remove(&proc_inum_ida, inum - PROC_DYNAMIC_FIRST);
@@ -554,13 +555,12 @@ static const struct inode_operations proc_dir_inode_operations = {

static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp)
{
- unsigned int i;
struct proc_dir_entry *tmp;
+ int ret;

- i = get_inode_number();
- if (i == 0)
- return -EAGAIN;
- dp->low_ino = i;
+ ret = proc_alloc_inum(&dp->low_ino);
+ if (ret)
+ return ret;

if (S_ISDIR(dp->mode)) {
if (dp->proc_iops == NULL) {
@@ -764,7 +764,7 @@ EXPORT_SYMBOL(proc_create_data);

static void free_proc_entry(struct proc_dir_entry *de)
{
- release_inode_number(de->low_ino);
+ proc_free_inum(de->low_ino);

if (S_ISLNK(de->mode))
kfree(de->data);
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 3144781..bf1d000 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -176,6 +176,8 @@ extern struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
extern struct file *proc_ns_fget(int fd);
extern bool proc_ns_inode(struct inode *inode);

+extern int proc_alloc_inum(unsigned int *pino);
+extern void proc_free_inum(unsigned int inum);
#else

#define proc_net_fops_create(net, name, mode, fops) ({ (void)(mode), NULL; })
@@ -235,6 +237,14 @@ static inline bool proc_ns_inode(struct inode *inode)
return false;
}

+static inline int proc_alloc_inum(unsigned int *inum)
+{
+ *inum = 1;
+ return 0;
+}
+static inline void proc_free_inum(unsigned int inum)
+{
+}
#endif /* CONFIG_PROC_FS */

#if !defined(CONFIG_PROC_KCORE)
--
1.7.5.4

2012-11-19 15:17:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 11/16] procfs: Print task uids and gids in the userns that opened the proc file

From: "Eric W. Biederman" <[email protected]>

Instead of using current_userns() use the userns of the opener
of the file so that if the file is passed between processes
the contents of the file do not change.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/array.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index c1c207c..5544342 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -162,7 +162,7 @@ static inline const char *get_task_state(struct task_struct *tsk)
static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
{
- struct user_namespace *user_ns = current_user_ns();
+ struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
int g;
struct fdtable *fdt = NULL;
--
1.7.5.4

2012-11-19 15:17:57

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 13/16] userns: Allow unprivilged mounts of proc and sysfs

From: "Eric W. Biederman" <[email protected]>

- The context in which proc and sysfs are mounted have no
effect on the the uid/gid of their files so no conversion is
needed except allowing the mount.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/root.c | 1 +
fs/sysfs/mount.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index f2f2511..c6e9fac 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,6 +145,7 @@ static struct file_system_type proc_fs_type = {
.name = "proc",
.mount = proc_mount,
.kill_sb = proc_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

void __init proc_root_init(void)
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 71eb7e2..db940a9 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -149,6 +149,7 @@ static struct file_system_type sysfs_fs_type = {
.name = "sysfs",
.mount = sysfs_mount,
.kill_sb = sysfs_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

int __init sysfs_init(void)
--
1.7.5.4

2012-11-19 15:17:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 15/16] proc: Fix the namespace inode permission checks.

From: "Eric W. Biederman" <[email protected]>

Change the proc namespace files into symlinks so that
we won't cache the dentries for the namespace files
which can bypass the ptrace_may_access checks.

To support the symlinks create an additional namespace
inode with it's own set of operations distinct from the
proc pid inode and dentry methods as those no longer
make sense.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/inode.c | 6 +-
fs/proc/namespaces.c | 169 +++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 152 insertions(+), 23 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 3b22bbd..439ae688 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -31,6 +31,7 @@ static void proc_evict_inode(struct inode *inode)
struct proc_dir_entry *de;
struct ctl_table_header *head;
const struct proc_ns_operations *ns_ops;
+ void *ns;

truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
@@ -49,8 +50,9 @@ static void proc_evict_inode(struct inode *inode)
}
/* Release any associated namespace */
ns_ops = PROC_I(inode)->ns_ops;
- if (ns_ops && ns_ops->put)
- ns_ops->put(PROC_I(inode)->ns);
+ ns = PROC_I(inode)->ns;
+ if (ns_ops && ns)
+ ns_ops->put(ns);
}

static struct kmem_cache * proc_inode_cachep;
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 030250c..7a6d8d6 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -38,6 +38,151 @@ static const struct file_operations ns_file_operations = {
.llseek = no_llseek,
};

+static const struct inode_operations ns_inode_operations = {
+ .setattr = proc_setattr,
+};
+
+static int ns_delete_dentry(const struct dentry *dentry)
+{
+ /* Don't cache namespace inodes when not in use */
+ return 1;
+}
+
+static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+ struct inode *inode = dentry->d_inode;
+ const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
+
+ return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
+ ns_ops->name, inode->i_ino);
+}
+
+const struct dentry_operations ns_dentry_operations =
+{
+ .d_delete = ns_delete_dentry,
+ .d_dname = ns_dname,
+};
+
+static struct dentry *proc_ns_get_dentry(struct super_block *sb,
+ struct task_struct *task, const struct proc_ns_operations *ns_ops)
+{
+ struct dentry *dentry, *result;
+ struct inode *inode;
+ struct proc_inode *ei;
+ struct qstr qname = { .name = "", };
+ void *ns;
+
+ ns = ns_ops->get(task);
+ if (!ns)
+ return ERR_PTR(-ENOENT);
+
+ dentry = d_alloc_pseudo(sb, &qname);
+ if (!dentry) {
+ ns_ops->put(ns);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ inode = new_inode(sb);
+ if (!inode) {
+ dput(dentry);
+ ns_ops->put(ns);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ ei = PROC_I(inode);
+ inode->i_ino = get_next_ino();
+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+ inode->i_op = &ns_inode_operations;
+ inode->i_mode = S_IFREG | S_IRUGO;
+ inode->i_fop = &ns_file_operations;
+ ei->ns_ops = ns_ops;
+ ei->ns = ns;
+
+ d_set_d_op(dentry, &ns_dentry_operations);
+ result = d_instantiate_unique(dentry, inode);
+ if (result) {
+ dput(dentry);
+ dentry = result;
+ }
+
+ return dentry;
+}
+
+static void *proc_ns_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ struct inode *inode = dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
+ struct proc_inode *ei = PROC_I(inode);
+ struct task_struct *task;
+ struct dentry *ns_dentry;
+ void *error = ERR_PTR(-EACCES);
+
+ task = get_proc_task(inode);
+ if (!task)
+ goto out;
+
+ if (!ptrace_may_access(task, PTRACE_MODE_READ))
+ goto out_put_task;
+
+ ns_dentry = proc_ns_get_dentry(sb, task, ei->ns_ops);
+ if (IS_ERR(ns_dentry)) {
+ error = ERR_CAST(ns_dentry);
+ goto out_put_task;
+ }
+
+ dput(nd->path.dentry);
+ nd->path.dentry = ns_dentry;
+ error = NULL;
+
+out_put_task:
+ put_task_struct(task);
+out:
+ return error;
+}
+
+static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int buflen)
+{
+ struct inode *inode = dentry->d_inode;
+ struct proc_inode *ei = PROC_I(inode);
+ const struct proc_ns_operations *ns_ops = ei->ns_ops;
+ struct task_struct *task;
+ void *ns;
+ char name[50];
+ int len = -EACCES;
+
+ task = get_proc_task(inode);
+ if (!task)
+ goto out;
+
+ if (!ptrace_may_access(task, PTRACE_MODE_READ))
+ goto out_put_task;
+
+ len = -ENOENT;
+ ns = ns_ops->get(task);
+ if (!ns)
+ goto out_put_task;
+
+ snprintf(name, sizeof(name), "%s", ns_ops->name);
+ len = strlen(name);
+
+ if (len > buflen)
+ len = buflen;
+ if (copy_to_user(buffer, ns_ops->name, len))
+ len = -EFAULT;
+
+ ns_ops->put(ns);
+out_put_task:
+ put_task_struct(task);
+out:
+ return len;
+}
+
+static const struct inode_operations proc_ns_link_inode_operations = {
+ .readlink = proc_ns_readlink,
+ .follow_link = proc_ns_follow_link,
+ .setattr = proc_setattr,
+};
+
static struct dentry *proc_ns_instantiate(struct inode *dir,
struct dentry *dentry, struct task_struct *task, const void *ptr)
{
@@ -45,21 +190,15 @@ static struct dentry *proc_ns_instantiate(struct inode *dir,
struct inode *inode;
struct proc_inode *ei;
struct dentry *error = ERR_PTR(-ENOENT);
- void *ns;

inode = proc_pid_make_inode(dir->i_sb, task);
if (!inode)
goto out;

- ns = ns_ops->get(task);
- if (!ns)
- goto out_iput;
-
ei = PROC_I(inode);
- inode->i_mode = S_IFREG|S_IRUSR;
- inode->i_fop = &ns_file_operations;
- ei->ns_ops = ns_ops;
- ei->ns = ns;
+ inode->i_mode = S_IFLNK|S_IRWXUGO;
+ inode->i_op = &proc_ns_link_inode_operations;
+ ei->ns_ops = ns_ops;

d_set_d_op(dentry, &pid_dentry_operations);
d_add(dentry, inode);
@@ -68,9 +207,6 @@ static struct dentry *proc_ns_instantiate(struct inode *dir,
error = NULL;
out:
return error;
-out_iput:
- iput(inode);
- goto out;
}

static int proc_ns_fill_cache(struct file *filp, void *dirent,
@@ -97,10 +233,6 @@ static int proc_ns_dir_readdir(struct file *filp, void *dirent,
if (!task)
goto out_no_task;

- ret = -EPERM;
- if (!ptrace_may_access(task, PTRACE_MODE_READ))
- goto out;
-
ret = 0;
i = filp->f_pos;
switch (i) {
@@ -160,10 +292,6 @@ static struct dentry *proc_ns_dir_lookup(struct inode *dir,
if (!task)
goto out_no_task;

- error = ERR_PTR(-EPERM);
- if (!ptrace_may_access(task, PTRACE_MODE_READ))
- goto out;
-
last = &ns_entries[ARRAY_SIZE(ns_entries)];
for (entry = ns_entries; entry < last; entry++) {
if (strlen((*entry)->name) != len)
@@ -171,7 +299,6 @@ static struct dentry *proc_ns_dir_lookup(struct inode *dir,
if (!memcmp(dentry->d_name.name, (*entry)->name, len))
break;
}
- error = ERR_PTR(-ENOENT);
if (entry == last)
goto out;

--
1.7.5.4

2012-11-19 15:18:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH review 16/16] proc: Usable inode numbers for the namespace file descriptors.

From: "Eric W. Biederman" <[email protected]>

Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.

A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.

This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.

We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.

I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 1 +
fs/namespace.c | 14 ++++++++++++++
fs/proc/namespaces.c | 24 ++++++++++++++----------
include/linux/ipc_namespace.h | 2 ++
include/linux/pid_namespace.h | 1 +
include/linux/proc_fs.h | 7 ++++++-
include/linux/user_namespace.h | 1 +
include/linux/utsname.h | 1 +
include/net/net_namespace.h | 2 ++
init/version.c | 2 ++
ipc/msgutil.c | 2 ++
ipc/namespace.c | 16 ++++++++++++++++
kernel/pid.c | 1 +
kernel/pid_namespace.c | 12 ++++++++++++
kernel/user.c | 2 ++
kernel/user_namespace.c | 15 +++++++++++++++
kernel/utsname.c | 17 ++++++++++++++++-
net/core/net_namespace.c | 24 ++++++++++++++++++++++++
18 files changed, 132 insertions(+), 12 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 630fafc..cd50079 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -4,6 +4,7 @@

struct mnt_namespace {
atomic_t count;
+ unsigned int proc_inum;
struct mount * root;
struct list_head list;
struct user_namespace *user_ns;
diff --git a/fs/namespace.c b/fs/namespace.c
index cab78a7..c1bbe86 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2301,6 +2301,7 @@ dput_out:

static void free_mnt_ns(struct mnt_namespace *ns)
{
+ proc_free_inum(ns->proc_inum);
put_user_ns(ns->user_ns);
kfree(ns);
}
@@ -2317,10 +2318,16 @@ static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1);
static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
{
struct mnt_namespace *new_ns;
+ int ret;

new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
if (!new_ns)
return ERR_PTR(-ENOMEM);
+ ret = proc_alloc_inum(&new_ns->proc_inum);
+ if (ret) {
+ kfree(new_ns);
+ return ERR_PTR(ret);
+ }
new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
atomic_set(&new_ns->count, 1);
new_ns->root = NULL;
@@ -2799,10 +2806,17 @@ static int mntns_install(struct nsproxy *nsproxy, void *ns)
return 0;
}

+static unsigned int mntns_inum(void *ns)
+{
+ struct mnt_namespace *mnt_ns = ns;
+ return mnt_ns->proc_inum;
+}
+
const struct proc_ns_operations mntns_operations = {
.name = "mnt",
.type = CLONE_NEWNS,
.get = mntns_get,
.put = mntns_put,
.install = mntns_install,
+ .inum = mntns_inum,
};
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 7a6d8d6..b7a4719 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -82,7 +82,7 @@ static struct dentry *proc_ns_get_dentry(struct super_block *sb,
return ERR_PTR(-ENOMEM);
}

- inode = new_inode(sb);
+ inode = iget_locked(sb, ns_ops->inum(ns));
if (!inode) {
dput(dentry);
ns_ops->put(ns);
@@ -90,13 +90,17 @@ static struct dentry *proc_ns_get_dentry(struct super_block *sb,
}

ei = PROC_I(inode);
- inode->i_ino = get_next_ino();
- inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
- inode->i_op = &ns_inode_operations;
- inode->i_mode = S_IFREG | S_IRUGO;
- inode->i_fop = &ns_file_operations;
- ei->ns_ops = ns_ops;
- ei->ns = ns;
+ if (inode->i_state & I_NEW) {
+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+ inode->i_op = &ns_inode_operations;
+ inode->i_mode = S_IFREG | S_IRUGO;
+ inode->i_fop = &ns_file_operations;
+ ei->ns_ops = ns_ops;
+ ei->ns = ns;
+ unlock_new_inode(inode);
+ } else {
+ ns_ops->put(ns);
+ }

d_set_d_op(dentry, &ns_dentry_operations);
result = d_instantiate_unique(dentry, inode);
@@ -162,12 +166,12 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
if (!ns)
goto out_put_task;

- snprintf(name, sizeof(name), "%s", ns_ops->name);
+ snprintf(name, sizeof(name), "%s:[%u]", ns_ops->name, ns_ops->inum(ns));
len = strlen(name);

if (len > buflen)
len = buflen;
- if (copy_to_user(buffer, ns_ops->name, len))
+ if (copy_to_user(buffer, name, len))
len = -EFAULT;

ns_ops->put(ns);
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index f03af70..fe77197 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -67,6 +67,8 @@ struct ipc_namespace {

/* user_ns which owns the ipc ns */
struct user_namespace *user_ns;
+
+ unsigned int proc_inum;
};

extern struct ipc_namespace init_ipc_ns;
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 4c96acd..bf28599 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -37,6 +37,7 @@ struct pid_namespace {
kgid_t pid_gid;
int hide_pid;
int reboot; /* group exit code if this pidns was rebooted */
+ unsigned int proc_inum;
};

extern struct pid_namespace init_pid_ns;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index bf1d000..2e24018 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -28,7 +28,11 @@ struct mm_struct;
*/

enum {
- PROC_ROOT_INO = 1,
+ PROC_ROOT_INO = 1,
+ PROC_IPC_INIT_INO = 0xEFFFFFFFU,
+ PROC_UTS_INIT_INO = 0xEFFFFFFEU,
+ PROC_USER_INIT_INO = 0xEFFFFFFDU,
+ PROC_PID_INIT_INO = 0xEFFFFFFCU,
};

/*
@@ -263,6 +267,7 @@ struct proc_ns_operations {
void *(*get)(struct task_struct *task);
void (*put)(void *ns);
int (*install)(struct nsproxy *nsproxy, void *ns);
+ unsigned int (*inum)(void *ns);
};
extern const struct proc_ns_operations netns_operations;
extern const struct proc_ns_operations utsns_operations;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 17651f0..b9bd2e6 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -25,6 +25,7 @@ struct user_namespace {
struct user_namespace *parent;
kuid_t owner;
kgid_t group;
+ unsigned int proc_inum;
};

extern struct user_namespace init_user_ns;
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 221f4a0..239e277 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -23,6 +23,7 @@ struct uts_namespace {
struct kref kref;
struct new_utsname name;
struct user_namespace *user_ns;
+ unsigned int proc_inum;
};
extern struct uts_namespace init_uts_ns;

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index c5a43f5..de644bc 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -56,6 +56,8 @@ struct net {

struct user_namespace *user_ns; /* Owning user namespace */

+ unsigned int proc_inum;
+
struct proc_dir_entry *proc_net;
struct proc_dir_entry *proc_net_stat;

diff --git a/init/version.c b/init/version.c
index 86fe0cc..58170f1 100644
--- a/init/version.c
+++ b/init/version.c
@@ -12,6 +12,7 @@
#include <linux/utsname.h>
#include <generated/utsrelease.h>
#include <linux/version.h>
+#include <linux/proc_fs.h>

#ifndef CONFIG_KALLSYMS
#define version(a) Version_ ## a
@@ -34,6 +35,7 @@ struct uts_namespace init_uts_ns = {
.domainname = UTS_DOMAINNAME,
},
.user_ns = &init_user_ns,
+ .proc_inum = PROC_UTS_INIT_INO,
};
EXPORT_SYMBOL_GPL(init_uts_ns);

diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index 26143d3..6471f1b 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -16,6 +16,7 @@
#include <linux/msg.h>
#include <linux/ipc_namespace.h>
#include <linux/utsname.h>
+#include <linux/proc_fs.h>
#include <asm/uaccess.h>

#include "util.h"
@@ -30,6 +31,7 @@ DEFINE_SPINLOCK(mq_lock);
struct ipc_namespace init_ipc_ns = {
.count = ATOMIC_INIT(1),
.user_ns = &init_user_ns,
+ .proc_inum = PROC_IPC_INIT_INO,
};

atomic_t nr_ipc_ns = ATOMIC_INIT(1);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 72c8682..cf3386a 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -26,9 +26,16 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
if (ns == NULL)
return ERR_PTR(-ENOMEM);

+ err = proc_alloc_inum(&ns->proc_inum);
+ if (err) {
+ kfree(ns);
+ return ERR_PTR(err);
+ }
+
atomic_set(&ns->count, 1);
err = mq_init_ns(ns);
if (err) {
+ proc_free_inum(ns->proc_inum);
kfree(ns);
return ERR_PTR(err);
}
@@ -111,6 +118,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
*/
ipcns_notify(IPCNS_REMOVED);
put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
kfree(ns);
}

@@ -172,10 +180,18 @@ static int ipcns_install(struct nsproxy *nsproxy, void *new)
return 0;
}

+static unsigned int ipcns_inum(void *vp)
+{
+ struct ipc_namespace *ns = vp;
+
+ return ns->proc_inum;
+}
+
const struct proc_ns_operations ipcns_operations = {
.name = "ipc",
.type = CLONE_NEWIPC,
.get = ipcns_get,
.put = ipcns_put,
.install = ipcns_install,
+ .inum = ipcns_inum,
};
diff --git a/kernel/pid.c b/kernel/pid.c
index 6e8da29..3026dda 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -80,6 +80,7 @@ struct pid_namespace init_pid_ns = {
.level = 0,
.child_reaper = &init_task,
.user_ns = &init_user_ns,
+ .proc_inum = PROC_PID_INIT_INO,
};
EXPORT_SYMBOL_GPL(init_pid_ns);

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 68508d3..560da0d 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -107,6 +107,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
if (ns->pid_cachep == NULL)
goto out_free_map;

+ err = proc_alloc_inum(&ns->proc_inum);
+ if (err)
+ goto out_free_map;
+
kref_init(&ns->kref);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
@@ -133,6 +137,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
{
int i;

+ proc_free_inum(ns->proc_inum);
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
put_user_ns(ns->user_ns);
@@ -345,12 +350,19 @@ static int pidns_install(struct nsproxy *nsproxy, void *ns)
return 0;
}

+static unsigned int pidns_inum(void *ns)
+{
+ struct pid_namespace *pid_ns = ns;
+ return pid_ns->proc_inum;
+}
+
const struct proc_ns_operations pidns_operations = {
.name = "pid",
.type = CLONE_NEWPID,
.get = pidns_get,
.put = pidns_put,
.install = pidns_install,
+ .inum = pidns_inum,
};

static __init int pid_namespaces_init(void)
diff --git a/kernel/user.c b/kernel/user.c
index 750acff..33acb5e 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -16,6 +16,7 @@
#include <linux/interrupt.h>
#include <linux/export.h>
#include <linux/user_namespace.h>
+#include <linux/proc_fs.h>

/*
* userns count is 1 for root user, 1 for init_uts_ns,
@@ -51,6 +52,7 @@ struct user_namespace init_user_ns = {
},
.owner = GLOBAL_ROOT_UID,
.group = GLOBAL_ROOT_GID,
+ .proc_inum = PROC_USER_INIT_INO,
};
EXPORT_SYMBOL_GPL(init_user_ns);

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 89f6eae..f5975cc 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -58,6 +58,7 @@ int create_user_ns(struct cred *new)
struct user_namespace *ns, *parent_ns = new->user_ns;
kuid_t owner = new->euid;
kgid_t group = new->egid;
+ int ret;

/* The creator needs a mapping in the parent user namespace
* or else we won't be able to reasonably tell userspace who
@@ -71,6 +72,12 @@ int create_user_ns(struct cred *new)
if (!ns)
return -ENOMEM;

+ ret = proc_alloc_inum(&ns->proc_inum);
+ if (ret) {
+ kmem_cache_free(user_ns_cachep, ns);
+ return ret;
+ }
+
kref_init(&ns->kref);
/* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
@@ -103,6 +110,7 @@ void free_user_ns(struct kref *kref)
container_of(kref, struct user_namespace, kref);

parent = ns->parent;
+ proc_free_inum(ns->proc_inum);
kmem_cache_free(user_ns_cachep, ns);
put_user_ns(parent);
}
@@ -808,12 +816,19 @@ static int userns_install(struct nsproxy *nsproxy, void *ns)
return commit_creds(cred);
}

+static unsigned int userns_inum(void *ns)
+{
+ struct user_namespace *user_ns = ns;
+ return user_ns->proc_inum;
+}
+
const struct proc_ns_operations userns_operations = {
.name = "user",
.type = CLONE_NEWUSER,
.get = userns_get,
.put = userns_put,
.install = userns_install,
+ .inum = userns_inum,
};

static __init int user_namespaces_init(void)
diff --git a/kernel/utsname.c b/kernel/utsname.c
index fdc619e..f6336d5 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -36,11 +36,18 @@ static struct uts_namespace *clone_uts_ns(struct user_namespace *user_ns,
struct uts_namespace *old_ns)
{
struct uts_namespace *ns;
+ int err;

ns = create_uts_ns();
if (!ns)
return ERR_PTR(-ENOMEM);

+ err = proc_alloc_inum(&ns->proc_inum);
+ if (err) {
+ kfree(ns);
+ return ERR_PTR(err);
+ }
+
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
ns->user_ns = get_user_ns(user_ns);
@@ -77,6 +84,7 @@ void free_uts_ns(struct kref *kref)

ns = container_of(kref, struct uts_namespace, kref);
put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
kfree(ns);
}

@@ -114,11 +122,18 @@ static int utsns_install(struct nsproxy *nsproxy, void *new)
return 0;
}

+static unsigned int utsns_inum(void *vp)
+{
+ struct uts_namespace *ns = vp;
+
+ return ns->proc_inum;
+}
+
const struct proc_ns_operations utsns_operations = {
.name = "uts",
.type = CLONE_NEWUTS,
.get = utsns_get,
.put = utsns_put,
.install = utsns_install,
+ .inum = utsns_inum,
};
-
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index ec2870b..2e9a313 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -381,6 +381,21 @@ struct net *get_net_ns_by_pid(pid_t pid)
}
EXPORT_SYMBOL_GPL(get_net_ns_by_pid);

+static __net_init int net_ns_net_init(struct net *net)
+{
+ return proc_alloc_inum(&net->proc_inum);
+}
+
+static __net_exit void net_ns_net_exit(struct net *net)
+{
+ proc_free_inum(net->proc_inum);
+}
+
+static struct pernet_operations __net_initdata net_ns_ops = {
+ .init = net_ns_net_init,
+ .exit = net_ns_net_exit,
+};
+
static int __init net_ns_init(void)
{
struct net_generic *ng;
@@ -412,6 +427,8 @@ static int __init net_ns_init(void)

mutex_unlock(&net_mutex);

+ register_pernet_subsys(&net_ns_ops);
+
return 0;
}

@@ -640,11 +657,18 @@ static int netns_install(struct nsproxy *nsproxy, void *ns)
return 0;
}

+static unsigned int netns_inum(void *ns)
+{
+ struct net *net = ns;
+ return net->proc_inum;
+}
+
const struct proc_ns_operations netns_operations = {
.name = "net",
.type = CLONE_NEWNET,
.get = netns_get,
.put = netns_put,
.install = netns_install,
+ .inum = netns_inum,
};
#endif
--
1.7.5.4

2012-11-19 17:50:03

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH review 03/16] userns: Allow chown and setgid preservation

Quoting Eric W. Biederman ([email protected]):
> From: "Eric W. Biederman" <[email protected]>
>
> - Allow chown if CAP_CHOWN is present in the current user namespace
> and the uid of the inode maps into the current user namespace, and
> the destination uid or gid maps into the current user namespace.
>
> - Allow perserving setgid when changing an inode if CAP_FSETID is
> present in the current user namespace and the owner of the file has
> a mapping into the current user namespace.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Acked-by: Serge E. Hallyn <[email protected]>

> ---
> fs/attr.c | 11 +++++++----
> 1 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/fs/attr.c b/fs/attr.c
> index cce7df5..1449adb 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -49,14 +49,15 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
> /* Make sure a caller can chown. */
> if ((ia_valid & ATTR_UID) &&
> (!uid_eq(current_fsuid(), inode->i_uid) ||
> - !uid_eq(attr->ia_uid, inode->i_uid)) && !capable(CAP_CHOWN))
> + !uid_eq(attr->ia_uid, inode->i_uid)) &&
> + !inode_capable(inode, CAP_CHOWN))
> return -EPERM;
>
> /* Make sure caller can chgrp. */
> if ((ia_valid & ATTR_GID) &&
> (!uid_eq(current_fsuid(), inode->i_uid) ||
> (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> - !capable(CAP_CHOWN))
> + !inode_capable(inode, CAP_CHOWN))
> return -EPERM;
>
> /* Make sure a caller can chmod. */
> @@ -65,7 +66,8 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
> return -EPERM;
> /* Also check the setgid bit! */
> if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
> - inode->i_gid) && !capable(CAP_FSETID))
> + inode->i_gid) &&
> + !inode_capable(inode, CAP_FSETID))
> attr->ia_mode &= ~S_ISGID;
> }
>
> @@ -157,7 +159,8 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
> if (ia_valid & ATTR_MODE) {
> umode_t mode = attr->ia_mode;
>
> - if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
> + if (!in_group_p(inode->i_gid) &&
> + !inode_capable(inode, CAP_FSETID))
> mode &= ~S_ISGID;
> inode->i_mode = mode;
> }
> --
> 1.7.5.4
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2012-11-19 17:59:07

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH review 11/16] procfs: Print task uids and gids in the userns that opened the proc file

Quoting Eric W. Biederman ([email protected]):
> From: "Eric W. Biederman" <[email protected]>
>
> Instead of using current_userns() use the userns of the opener
> of the file so that if the file is passed between processes
> the contents of the file do not change.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Acked-by: Serge E. Hallyn <[email protected]>

> ---
> fs/proc/array.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index c1c207c..5544342 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -162,7 +162,7 @@ static inline const char *get_task_state(struct task_struct *tsk)
> static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> struct pid *pid, struct task_struct *p)
> {
> - struct user_namespace *user_ns = current_user_ns();
> + struct user_namespace *user_ns = seq_user_ns(m);
> struct group_info *group_info;
> int g;
> struct fdtable *fdt = NULL;
> --
> 1.7.5.4
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2012-11-19 18:03:35

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH review 12/16] userns: For /proc/self/{uid, gid}_map derive the lower userns from the struct file

Quoting Eric W. Biederman ([email protected]):
> From: "Eric W. Biederman" <[email protected]>
>
> To keep things sane in the context of file descriptor passing derive the
> user namespace that uids are mapped into from the opener of the file
> instead of from current.
>
> When writing to the maps file the lower user namespace must always
> be the parent user namespace, or setting the mapping simply does
> not make sense. Enforce that the opener of the file was in
> the parent user namespace or the user namespace whose mapping
> is being set.

Is there a reasonable use case for writing from the ns whose mapping
is being set? Are you expecting cases where the child opens the file
and passes it back to the parent to set the mappings?

> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/user_namespace.c | 12 ++++++++++--
> 1 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index ce92f7e..89f6eae 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -391,7 +391,7 @@ static int uid_m_show(struct seq_file *seq, void *v)
> struct user_namespace *lower_ns;
> uid_t lower;
>
> - lower_ns = current_user_ns();
> + lower_ns = seq_user_ns(seq);
> if ((lower_ns == ns) && lower_ns->parent)
> lower_ns = lower_ns->parent;
>
> @@ -412,7 +412,7 @@ static int gid_m_show(struct seq_file *seq, void *v)
> struct user_namespace *lower_ns;
> gid_t lower;
>
> - lower_ns = current_user_ns();
> + lower_ns = seq_user_ns(seq);
> if ((lower_ns == ns) && lower_ns->parent)
> lower_ns = lower_ns->parent;
>
> @@ -688,10 +688,14 @@ ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t siz
> {
> struct seq_file *seq = file->private_data;
> struct user_namespace *ns = seq->private;
> + struct user_namespace *seq_ns = seq_user_ns(seq);
>
> if (!ns->parent)
> return -EPERM;
>
> + if ((seq_ns != ns) && (seq_ns != ns->parent))
> + return -EPERM;
> +
> return map_write(file, buf, size, ppos, CAP_SETUID,
> &ns->uid_map, &ns->parent->uid_map);
> }
> @@ -700,10 +704,14 @@ ssize_t proc_gid_map_write(struct file *file, const char __user *buf, size_t siz
> {
> struct seq_file *seq = file->private_data;
> struct user_namespace *ns = seq->private;
> + struct user_namespace *seq_ns = seq_user_ns(seq);
>
> if (!ns->parent)
> return -EPERM;
>
> + if ((seq_ns != ns) && (seq_ns != ns->parent))
> + return -EPERM;
> +
> return map_write(file, buf, size, ppos, CAP_SETGID,
> &ns->gid_map, &ns->parent->gid_map);
> }
> --
> 1.7.5.4
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2012-11-19 18:05:11

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH review 14/16] proc: Generalize proc inode allocation

Quoting Eric W. Biederman ([email protected]):
> From: "Eric W. Biederman" <[email protected]>
>
> Generalize the proc inode allocation so that it can be
> used without having to having to create a proc_dir_entry.
>
> This will allow namespace file descriptors to remain light
> weight entitities but still have the same inode number
> when the backing namespace is the same.
>
> Signed-off-by: Eric W. Biederman <[email protected]>

Acked-by: Serge E. Hallyn <[email protected]>

> ---
> fs/proc/generic.c | 26 +++++++++++++-------------
> include/linux/proc_fs.h | 10 ++++++++++
> 2 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 0d80cef..7b3ae3c 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -350,14 +350,14 @@ static DEFINE_SPINLOCK(proc_inum_lock); /* protects the above */
> * Return an inode number between PROC_DYNAMIC_FIRST and
> * 0xffffffff, or zero on failure.
> */
> -static unsigned int get_inode_number(void)
> +int proc_alloc_inum(unsigned int *inum)
> {
> unsigned int i;
> int error;
>
> retry:
> - if (ida_pre_get(&proc_inum_ida, GFP_KERNEL) == 0)
> - return 0;
> + if (!ida_pre_get(&proc_inum_ida, GFP_KERNEL))
> + return -ENOMEM;
>
> spin_lock(&proc_inum_lock);
> error = ida_get_new(&proc_inum_ida, &i);
> @@ -365,18 +365,19 @@ retry:
> if (error == -EAGAIN)
> goto retry;
> else if (error)
> - return 0;
> + return error;
>
> if (i > UINT_MAX - PROC_DYNAMIC_FIRST) {
> spin_lock(&proc_inum_lock);
> ida_remove(&proc_inum_ida, i);
> spin_unlock(&proc_inum_lock);
> - return 0;
> + return -ENOSPC;
> }
> - return PROC_DYNAMIC_FIRST + i;
> + *inum = PROC_DYNAMIC_FIRST + i;
> + return 0;
> }
>
> -static void release_inode_number(unsigned int inum)
> +void proc_free_inum(unsigned int inum)
> {
> spin_lock(&proc_inum_lock);
> ida_remove(&proc_inum_ida, inum - PROC_DYNAMIC_FIRST);
> @@ -554,13 +555,12 @@ static const struct inode_operations proc_dir_inode_operations = {
>
> static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp)
> {
> - unsigned int i;
> struct proc_dir_entry *tmp;
> + int ret;
>
> - i = get_inode_number();
> - if (i == 0)
> - return -EAGAIN;
> - dp->low_ino = i;
> + ret = proc_alloc_inum(&dp->low_ino);
> + if (ret)
> + return ret;
>
> if (S_ISDIR(dp->mode)) {
> if (dp->proc_iops == NULL) {
> @@ -764,7 +764,7 @@ EXPORT_SYMBOL(proc_create_data);
>
> static void free_proc_entry(struct proc_dir_entry *de)
> {
> - release_inode_number(de->low_ino);
> + proc_free_inum(de->low_ino);
>
> if (S_ISLNK(de->mode))
> kfree(de->data);
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 3144781..bf1d000 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -176,6 +176,8 @@ extern struct proc_dir_entry *proc_net_mkdir(struct net *net, const char *name,
> extern struct file *proc_ns_fget(int fd);
> extern bool proc_ns_inode(struct inode *inode);
>
> +extern int proc_alloc_inum(unsigned int *pino);
> +extern void proc_free_inum(unsigned int inum);
> #else
>
> #define proc_net_fops_create(net, name, mode, fops) ({ (void)(mode), NULL; })
> @@ -235,6 +237,14 @@ static inline bool proc_ns_inode(struct inode *inode)
> return false;
> }
>
> +static inline int proc_alloc_inum(unsigned int *inum)
> +{
> + *inum = 1;
> + return 0;
> +}
> +static inline void proc_free_inum(unsigned int inum)
> +{
> +}
> #endif /* CONFIG_PROC_FS */
>
> #if !defined(CONFIG_PROC_KCORE)
> --
> 1.7.5.4
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2012-11-19 18:30:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH review 12/16] userns: For /proc/self/{uid, gid}_map derive the lower userns from the struct file

Serge Hallyn <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>> From: "Eric W. Biederman" <[email protected]>
>>
>> To keep things sane in the context of file descriptor passing derive the
>> user namespace that uids are mapped into from the opener of the file
>> instead of from current.
>>
>> When writing to the maps file the lower user namespace must always
>> be the parent user namespace, or setting the mapping simply does
>> not make sense. Enforce that the opener of the file was in
>> the parent user namespace or the user namespace whose mapping
>> is being set.
>
> Is there a reasonable use case for writing from the ns whose mapping
> is being set? Are you expecting cases where the child opens the file
> and passes it back to the parent to set the mappings?

Passing the open mappings file no. Although by using seq_user_ns I do
make certain the semantics are correct if the file descriptor is passed,
but I did that on general principles.

I expect a process in the user namespace to be able to meaningfully set
the mapping to some the current uid and the current gid.

I expect a process outside of the user namespace (the parent) to be able
to meaningfully set the mapping for a range of uids and gids.

The additional error cases are simply to deny cases that are not
currently handled in the code. Like what happens if the global root
tries to set the mapping for a process that is a user namespace 3 layers
deep and creating a 4th layer. For reads that combination can be
supported but for writes you have to write the uid in the new user
namespace and the uid in the user namespace just below or else it can
not be verified that there is a mapping in all parent user namespaces.

Eric

>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> kernel/user_namespace.c | 12 ++++++++++--
>> 1 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index ce92f7e..89f6eae 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -391,7 +391,7 @@ static int uid_m_show(struct seq_file *seq, void *v)
>> struct user_namespace *lower_ns;
>> uid_t lower;
>>
>> - lower_ns = current_user_ns();
>> + lower_ns = seq_user_ns(seq);
>> if ((lower_ns == ns) && lower_ns->parent)
>> lower_ns = lower_ns->parent;
>>
>> @@ -412,7 +412,7 @@ static int gid_m_show(struct seq_file *seq, void *v)
>> struct user_namespace *lower_ns;
>> gid_t lower;
>>
>> - lower_ns = current_user_ns();
>> + lower_ns = seq_user_ns(seq);
>> if ((lower_ns == ns) && lower_ns->parent)
>> lower_ns = lower_ns->parent;
>>
>> @@ -688,10 +688,14 @@ ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t siz
>> {
>> struct seq_file *seq = file->private_data;
>> struct user_namespace *ns = seq->private;
>> + struct user_namespace *seq_ns = seq_user_ns(seq);
>>
>> if (!ns->parent)
>> return -EPERM;
>>
>> + if ((seq_ns != ns) && (seq_ns != ns->parent))
>> + return -EPERM;
>> +
>> return map_write(file, buf, size, ppos, CAP_SETUID,
>> &ns->uid_map, &ns->parent->uid_map);
>> }
>> @@ -700,10 +704,14 @@ ssize_t proc_gid_map_write(struct file *file, const char __user *buf, size_t siz
>> {
>> struct seq_file *seq = file->private_data;
>> struct user_namespace *ns = seq->private;
>> + struct user_namespace *seq_ns = seq_user_ns(seq);
>>
>> if (!ns->parent)
>> return -EPERM;
>>
>> + if ((seq_ns != ns) && (seq_ns != ns->parent))
>> + return -EPERM;
>> +
>> return map_write(file, buf, size, ppos, CAP_SETGID,
>> &ns->gid_map, &ns->parent->gid_map);
>> }
>> --
>> 1.7.5.4
>>
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers

2012-11-19 21:02:03

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH review 12/16] userns: For /proc/self/{uid, gid}_map derive the lower userns from the struct file

Quoting Eric W. Biederman ([email protected]):
> Serge Hallyn <[email protected]> writes:
>
> > Quoting Eric W. Biederman ([email protected]):
> >> From: "Eric W. Biederman" <[email protected]>
> >>
> >> To keep things sane in the context of file descriptor passing derive the
> >> user namespace that uids are mapped into from the opener of the file
> >> instead of from current.
> >>
> >> When writing to the maps file the lower user namespace must always
> >> be the parent user namespace, or setting the mapping simply does
> >> not make sense. Enforce that the opener of the file was in
> >> the parent user namespace or the user namespace whose mapping
> >> is being set.
> >
> > Is there a reasonable use case for writing from the ns whose mapping
> > is being set? Are you expecting cases where the child opens the file
> > and passes it back to the parent to set the mappings?
>
> Passing the open mappings file no. Although by using seq_user_ns I do
> make certain the semantics are correct if the file descriptor is passed,
> but I did that on general principles.
>
> I expect a process in the user namespace to be able to meaningfully set
> the mapping to some the current uid and the current gid.

Sorry, I think a word is missing there. To be precise (bc I haven't
thought about this much before as it's not my target goal :) you're
saying if I'm uid 1000 gid 1000, I can create a new user namespace
and, from inside that new userns (where I'm first uid/gid -1) I can
map any uid+gid in the container to 1000 in the parent ns? Or is there
something more?

It still seems to me no less flexible to require being in the parent
ns, so

> >> + if ((seq_ns != ns) && (seq_ns != ns->parent))
> >> + return -EPERM;

would become

> >> + if (seq_ns != ns->parent)
> >> + return -EPERM;

I also wonder if -EINVAL would be a more appropriate choice here.
We're trying to keep things sane, rather than saying "not allowed"
for its own sake.

-serge

2012-11-19 21:10:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH review 12/16] userns: For /proc/self/{uid, gid}_map derive the lower userns from the struct file

Serge Hallyn <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>> Serge Hallyn <[email protected]> writes:
>>
>> > Quoting Eric W. Biederman ([email protected]):
>> >> From: "Eric W. Biederman" <[email protected]>
>> >>
>> >> To keep things sane in the context of file descriptor passing derive the
>> >> user namespace that uids are mapped into from the opener of the file
>> >> instead of from current.
>> >>
>> >> When writing to the maps file the lower user namespace must always
>> >> be the parent user namespace, or setting the mapping simply does
>> >> not make sense. Enforce that the opener of the file was in
>> >> the parent user namespace or the user namespace whose mapping
>> >> is being set.
>> >
>> > Is there a reasonable use case for writing from the ns whose mapping
>> > is being set? Are you expecting cases where the child opens the file
>> > and passes it back to the parent to set the mappings?
>>
>> Passing the open mappings file no. Although by using seq_user_ns I do
>> make certain the semantics are correct if the file descriptor is passed,
>> but I did that on general principles.
>>
>> I expect a process in the user namespace to be able to meaningfully set
>> the mapping to some the current uid and the current gid.
>
> Sorry, I think a word is missing there. To be precise (bc I haven't
> thought about this much before as it's not my target goal :) you're
> saying if I'm uid 1000 gid 1000, I can create a new user namespace
> and, from inside that new userns (where I'm first uid/gid -1) I can
> map any uid+gid in the container to 1000 in the parent ns? Or is there
> something more?

Only that for now. I had once imagined magic would happen in the
background to verify the parent.

> It still seems to me no less flexible to require being in the parent
> ns, so
>
>> >> + if ((seq_ns != ns) && (seq_ns != ns->parent))
>> >> + return -EPERM;
>
> would become
>
>> >> + if (seq_ns != ns->parent)
>> >> + return -EPERM;
>

In practice when playing around it is the difference between.
unshare -U /bin/bash
echo 0 1000 1 > /proc/self/uid_map

And the need to pre-plan something. You can set the uid_map from the
parent in a shell script but it is a real pain. So for just messing
around allowing seq_ns == ns is a real advantage.

> I also wonder if -EINVAL would be a more appropriate choice here.
> We're trying to keep things sane, rather than saying "not allowed"
> for its own sake.

A different error code might be better.

Eric

2012-11-19 21:19:19

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH review 12/16] userns: For /proc/self/{uid, gid}_map derive the lower userns from the struct file

Quoting Eric W. Biederman ([email protected]):
> Serge Hallyn <[email protected]> writes:
>
> > Quoting Eric W. Biederman ([email protected]):
> >> Serge Hallyn <[email protected]> writes:
> >>
> >> > Quoting Eric W. Biederman ([email protected]):
> >> >> From: "Eric W. Biederman" <[email protected]>
> >> >>
> >> >> To keep things sane in the context of file descriptor passing derive the
> >> >> user namespace that uids are mapped into from the opener of the file
> >> >> instead of from current.
> >> >>
> >> >> When writing to the maps file the lower user namespace must always
> >> >> be the parent user namespace, or setting the mapping simply does
> >> >> not make sense. Enforce that the opener of the file was in
> >> >> the parent user namespace or the user namespace whose mapping
> >> >> is being set.
> >> >
> >> > Is there a reasonable use case for writing from the ns whose mapping
> >> > is being set? Are you expecting cases where the child opens the file
> >> > and passes it back to the parent to set the mappings?
> >>
> >> Passing the open mappings file no. Although by using seq_user_ns I do
> >> make certain the semantics are correct if the file descriptor is passed,
> >> but I did that on general principles.
> >>
> >> I expect a process in the user namespace to be able to meaningfully set
> >> the mapping to some the current uid and the current gid.
> >
> > Sorry, I think a word is missing there. To be precise (bc I haven't
> > thought about this much before as it's not my target goal :) you're
> > saying if I'm uid 1000 gid 1000, I can create a new user namespace
> > and, from inside that new userns (where I'm first uid/gid -1) I can
> > map any uid+gid in the container to 1000 in the parent ns? Or is there
> > something more?
>
> Only that for now. I had once imagined magic would happen in the
> background to verify the parent.
>
> > It still seems to me no less flexible to require being in the parent
> > ns, so
> >
> >> >> + if ((seq_ns != ns) && (seq_ns != ns->parent))
> >> >> + return -EPERM;
> >
> > would become
> >
> >> >> + if (seq_ns != ns->parent)
> >> >> + return -EPERM;
> >
>
> In practice when playing around it is the difference between.
> unshare -U /bin/bash
> echo 0 1000 1 > /proc/self/uid_map
>
> And the need to pre-plan something. You can set the uid_map from the
> parent in a shell script but it is a real pain. So for just messing
> around allowing seq_ns == ns is a real advantage.

Heh, ok - I almost always want >1 uid mapped, but I can see the
advantage.

Thanks.

I don't recall whether I put this in originally, but

Acked-by: Serge E. Hallyn <[email protected]>

> > I also wonder if -EINVAL would be a more appropriate choice here.
> > We're trying to keep things sane, rather than saying "not allowed"
> > for its own sake.
>
> A different error code might be better.

I suppose strictly speaking (looking at errno-base.h) it would be EBADF?

-serge

2012-11-19 21:27:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH review 12/16] userns: For /proc/self/{uid, gid}_map derive the lower userns from the struct file

Serge Hallyn <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>>
>> In practice when playing around it is the difference between.
>> unshare -U /bin/bash
>> echo 0 1000 1 > /proc/self/uid_map
>>
>> And the need to pre-plan something. You can set the uid_map from the
>> parent in a shell script but it is a real pain. So for just messing
>> around allowing seq_ns == ns is a real advantage.
>
> Heh, ok - I almost always want >1 uid mapped, but I can see the
> advantage.

The original plan called for an upcall and >1 uid mapped. But yeah
that is something else again.

> Thanks.
>
> I don't recall whether I put this in originally, but
>
> Acked-by: Serge E. Hallyn <[email protected]>
>
>> > I also wonder if -EINVAL would be a more appropriate choice here.
>> > We're trying to keep things sane, rather than saying "not allowed"
>> > for its own sake.
>>
>> A different error code might be better.
>
> I suppose strictly speaking (looking at errno-base.h) it would be
> EBADF?

Definitely not EBADF. EBADF is the error code for operating on a closed
file descriptor.

I want a ENOTALLOWED. Anyway.

Eric

2012-11-19 22:34:53

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH review 08/16] userns: Kill task_user_ns

On Mon, Nov 19, 2012 at 7:12 AM, Eric W. Biederman
<[email protected]> wrote:
> From: "Eric W. Biederman" <[email protected]>
>
> The task_user_ns function hides the fact that it is getting the user
> namespace from struct cred on the task. struct cred may go away as
> soon as the rcu lock is released. This leads to a race where we
> can dereference a stale user namespace pointer.
>
> To make it obvious a struct cred is involved kill task_user_ns.
>
> To kill the race modify the users of task_user_ns to only
> reference the user namespace while the rcu lock is held.
>
> Cc: Kees Cook <[email protected]>
> Cc: James Morris <[email protected]>
> Acked-by: Serge Hallyn <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Nice catch! This is disappointingly messy looking, but I do not see
any sensible way to clean it up better than you've already done.

Acked-by: Kees Cook <[email protected]>

-Kees

--
Kees Cook
Chrome OS Security