2010-12-17 15:22:27

by Serge E. Hallyn

[permalink] [raw]
Subject: [RFC 0/5] user namespaces: start clamping down

Following is the next version of the user namespace patchset.

The core of the set is patch 2, originally conceived of and
implemented by Eric Biederman. The concept is to target
capabilities at user namespaces. A task's capabilities are
now contextualized as follows (previously, capabilities had
no context):

1. For a task in the initial user namespace, the calculated
capabilities (pi, pe, pp) are available to act upon any
user namespace.

2. For a task in a child user namespace, the calculated
capabilities are available to act only on its own or any
descendent user namespace. It has no capabilities to any
parent or unrelated user namespaces.

3. If a user A creates a new user namespace, that user has
all capabilities to that new user namespace and any of its
descendents. (Contrast this with a user namespace created
by another user B in the same user namespace, to which this
user A has only his calculated capabilities)

All existing 'capable' checks are automatically converted to
checks against the initial user namespace. This means that
by default, root in a child user namespace is powerless.
Patches 3-5 begin to enable capabilities in child user
namespaces to set hostnames, kill tasks, and do ptrace.

My near-term next goals will be to enable setuid and setgid,
and to provide a way for the filesystem to be usable in child
user namespaces. At the very least I'd like a fresh loopback
or LVM mount and proc mounts to be supported.

thanks,
-serge


2010-12-17 15:24:39

by Serge E. Hallyn

[permalink] [raw]
Subject: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

copy_process() handles CLONE_NEWUSER before the rest of the
namespaces. So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
the new uts namespace will have the new user namespace as its
owner. That is what we want, since we want root in that new
userns to be able to have privilege over it.

Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/utsname.h | 3 +++
init/version.c | 2 ++
kernel/nsproxy.c | 3 +++
kernel/user.c | 8 ++++++--
kernel/utsname.c | 4 ++++
5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 69f3997..85171be 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -37,9 +37,12 @@ struct new_utsname {
#include <linux/nsproxy.h>
#include <linux/err.h>

+struct user_namespace;
+
struct uts_namespace {
struct kref kref;
struct new_utsname name;
+ struct user_namespace *user_ns;
};
extern struct uts_namespace init_uts_ns;

diff --git a/init/version.c b/init/version.c
index 79fb8c2..9eb19fb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
int version_string(LINUX_VERSION_CODE);
#endif

+extern struct user_namespace init_user_ns;
struct uts_namespace init_uts_ns = {
.kref = {
.refcount = ATOMIC_INIT(2),
@@ -33,6 +34,7 @@ struct uts_namespace init_uts_ns = {
.machine = UTS_MACHINE,
.domainname = UTS_DOMAINNAME,
},
+ .user_ns = &init_user_ns,
};
EXPORT_SYMBOL_GPL(init_uts_ns);

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f74e6c0..5a22dcf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -74,6 +74,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}
+ put_user_ns(new_nsp->uts_ns->user_ns);
+ new_nsp->uts_ns->user_ns = task_cred_xxx(tsk, user)->user_ns;
+ get_user_ns(new_nsp->uts_ns->user_ns);

new_nsp->ipc_ns = copy_ipcs(flags, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
diff --git a/kernel/user.c b/kernel/user.c
index 2c7d8d5..5125681 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,9 +17,13 @@
#include <linux/module.h>
#include <linux/user_namespace.h>

+/*
+ * userns count is 1 for root user, 1 for init_uts_ns,
+ * and 1 for... ?
+ */
struct user_namespace init_user_ns = {
.kref = {
- .refcount = ATOMIC_INIT(2),
+ .refcount = ATOMIC_INIT(3),
},
.creator = &root_user,
};
@@ -47,7 +51,7 @@ static struct kmem_cache *uid_cachep;
*/
static DEFINE_SPINLOCK(uidhash_lock);

-/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->creator */
+/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->user_ns */
struct user_struct root_user = {
.__count = ATOMIC_INIT(2),
.processes = ATOMIC_INIT(1),
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..a7b3a8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,6 +14,7 @@
#include <linux/utsname.h>
#include <linux/err.h>
#include <linux/slab.h>
+#include <linux/user_namespace.h>

static struct uts_namespace *create_uts_ns(void)
{
@@ -40,6 +41,8 @@ static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns)

down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
+ ns->user_ns = old_ns->user_ns;
+ get_user_ns(ns->user_ns);
up_read(&uts_sem);
return ns;
}
@@ -71,5 +74,6 @@ void free_uts_ns(struct kref *kref)
struct uts_namespace *ns;

ns = container_of(kref, struct uts_namespace, kref);
+ put_user_ns(ns->user_ns);
kfree(ns);
}
--
1.7.0.4

2010-12-17 15:25:29

by Serge E. Hallyn

[permalink] [raw]
Subject: [RFC 2/5] user namespaces: make capabilities relative to the user namespace.

- Introduce ns_capable to test for a capability in a non-default
user namespace.
- Teach cap_capable to handle capabilities in a non-default
user namespace.

The motivation is to get to the unprivileged creation of new
namespaces. It looks like this gets us 90% of the way there, with
only potential uid confusion issues left.

I still need to handle getting all caps after creation but otherwise I
think I have a good starter patch that achieves all of your goals.

Changelog:
11/05/2010: [serge] add apparmor
12/14/2010: [serge] fix capabilities to created user namespaces
Without this, if user serge creates a user_ns, he won't have
capabilities to the user_ns he created. THis is because we
were first checking whether his effective caps had the caps
he needed and returning -EPERM if not, and THEN checking whether
he was the creator. Reverse those checks.
12/16/2010: [serge] security_real_capable needs ns argument in !security case

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/capability.h | 7 +++++--
include/linux/security.h | 22 ++++++++++++----------
kernel/capability.c | 22 ++++++++++++++++++++--
security/apparmor/lsm.c | 5 +++--
security/commoncap.c | 40 +++++++++++++++++++++++++++++++++-------
security/security.c | 12 ++++++------
security/selinux/hooks.c | 14 +++++++++-----
7 files changed, 88 insertions(+), 34 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 90012b9..cc3e976 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -541,7 +541,7 @@ extern const kernel_cap_t __cap_init_eff_set;
*
* Note that this does not set PF_SUPERPRIV on the task.
*/
-#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
+#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)

/**
* has_capability_noaudit - Determine if a task has a superior capability available (unaudited)
@@ -555,9 +555,12 @@ extern const kernel_cap_t __cap_init_eff_set;
* Note that this does not set PF_SUPERPRIV on the task.
*/
#define has_capability_noaudit(t, cap) \
- (security_real_capable_noaudit((t), (cap)) == 0)
+ (security_real_capable_noaudit((t), &init_user_ns, (cap)) == 0)

+struct user_namespace;
+extern struct user_namespace init_user_ns;
extern int capable(int cap);
+extern int ns_capable(struct user_namespace *ns, int cap);

/* audit system wants to get cap info from files as well */
struct dentry;
diff --git a/include/linux/security.h b/include/linux/security.h
index 39f5b7e..2141f5a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -46,13 +46,14 @@

struct ctl_table;
struct audit_krule;
+struct user_namespace;

/*
* These functions are in security/capability.c and are used
* as the default capabilities functions
*/
extern int cap_capable(struct task_struct *tsk, const struct cred *cred,
- int cap, int audit);
+ struct user_namespace *ns, int cap, int audit);
extern int cap_settime(struct timespec *ts, struct timezone *tz);
extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode);
extern int cap_ptrace_traceme(struct task_struct *parent);
@@ -1258,6 +1259,7 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
* credentials.
* @tsk contains the task_struct for the process.
* @cred contains the credentials to use.
+ * @ns contains the user namespace we want the capability in
* @cap contains the capability <include/linux/capability.h>.
* @audit: Whether to write an audit message or not
* Return 0 if the capability is granted for @tsk.
@@ -1386,7 +1388,7 @@ struct security_operations {
const kernel_cap_t *inheritable,
const kernel_cap_t *permitted);
int (*capable) (struct task_struct *tsk, const struct cred *cred,
- int cap, int audit);
+ struct user_namespace *ns, int cap, int audit);
int (*sysctl) (struct ctl_table *table, int op);
int (*quotactl) (int cmds, int type, int id, struct super_block *sb);
int (*quota_on) (struct dentry *dentry);
@@ -1668,9 +1670,9 @@ int security_capset(struct cred *new, const struct cred *old,
const kernel_cap_t *effective,
const kernel_cap_t *inheritable,
const kernel_cap_t *permitted);
-int security_capable(int cap);
-int security_real_capable(struct task_struct *tsk, int cap);
-int security_real_capable_noaudit(struct task_struct *tsk, int cap);
+int security_capable(struct user_namespace *ns, int cap);
+int security_real_capable(struct task_struct *tsk, struct user_namespace *ns, int cap);
+int security_real_capable_noaudit(struct task_struct *tsk, struct user_namespace *ns, int cap);
int security_sysctl(struct ctl_table *table, int op);
int security_quotactl(int cmds, int type, int id, struct super_block *sb);
int security_quota_on(struct dentry *dentry);
@@ -1864,26 +1866,26 @@ static inline int security_capset(struct cred *new,

static inline int security_capable(int cap)
{
- return cap_capable(current, current_cred(), cap, SECURITY_CAP_AUDIT);
+ return cap_capable(current, current_cred(), &init_user_ns, cap, SECURITY_CAP_AUDIT);
}

-static inline int security_real_capable(struct task_struct *tsk, int cap)
+static inline int security_real_capable(struct task_struct *tsk, struct user_namespace *ns, int cap)
{
int ret;

rcu_read_lock();
- ret = cap_capable(tsk, __task_cred(tsk), cap, SECURITY_CAP_AUDIT);
+ ret = cap_capable(tsk, __task_cred(tsk), ns, cap, SECURITY_CAP_AUDIT);
rcu_read_unlock();
return ret;
}

static inline
-int security_real_capable_noaudit(struct task_struct *tsk, int cap)
+int security_real_capable_noaudit(struct task_struct *tsk, struct user_namespace *ns, int cap)
{
int ret;

rcu_read_lock();
- ret = cap_capable(tsk, __task_cred(tsk), cap,
+ ret = cap_capable(tsk, __task_cred(tsk), ns, cap,
SECURITY_CAP_NOAUDIT);
rcu_read_unlock();
return ret;
diff --git a/kernel/capability.c b/kernel/capability.c
index 2f05303..744dd6e 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -14,6 +14,7 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
#include <asm/uaccess.h>

/*
@@ -301,15 +302,32 @@ error:
*/
int capable(int cap)
{
+ return ns_capable(&init_user_ns, cap);
+}
+EXPORT_SYMBOL(capable);
+
+/**
+ * ns_capable - Determine if the current task has a superior capability in effect
+ * @ns: The usernamespace we want the capability in
+ * @cap: The capability to be tested for
+ *
+ * Return true if the current task has the given superior capability currently
+ * available for use, false if not.
+ *
+ * This sets PF_SUPERPRIV on the task if the capability is available on the
+ * assumption that it's about to be used.
+ */
+int ns_capable(struct user_namespace *ns, int cap)
+{
if (unlikely(!cap_valid(cap))) {
printk(KERN_CRIT "capable() called with invalid cap=%u\n", cap);
BUG();
}

- if (security_capable(cap) == 0) {
+ if (security_capable(ns, cap) == 0) {
current->flags |= PF_SUPERPRIV;
return 1;
}
return 0;
}
-EXPORT_SYMBOL(capable);
+EXPORT_SYMBOL(ns_capable);
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index fa778a7..00d227f 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -22,6 +22,7 @@
#include <linux/ctype.h>
#include <linux/sysctl.h>
#include <linux/audit.h>
+#include <linux/user_namespace.h>
#include <net/sock.h>

#include "include/apparmor.h"
@@ -137,11 +138,11 @@ static int apparmor_capget(struct task_struct *target, kernel_cap_t *effective,
}

static int apparmor_capable(struct task_struct *task, const struct cred *cred,
- int cap, int audit)
+ struct user_namespace *ns, int cap, int audit)
{
struct aa_profile *profile;
/* cap_capable returns 0 on success, else -EPERM */
- int error = cap_capable(task, cred, cap, audit);
+ int error = cap_capable(task, cred, ns, cap, audit);
if (!error) {
profile = aa_cred_profile(cred);
if (!unconfined(profile))
diff --git a/security/commoncap.c b/security/commoncap.c
index e58b5d8..9d910e6 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -27,6 +27,7 @@
#include <linux/sched.h>
#include <linux/prctl.h>
#include <linux/securebits.h>
+#include <linux/user_namespace.h>

/*
* If a non-root user executes a setuid-root binary in
@@ -68,6 +69,7 @@ EXPORT_SYMBOL(cap_netlink_recv);
* cap_capable - Determine whether a task has a particular effective capability
* @tsk: The task to query
* @cred: The credentials to use
+ * @ns: The user namespace in which we need the capability
* @cap: The capability to check for
* @audit: Whether to write an audit message or not
*
@@ -79,10 +81,32 @@ EXPORT_SYMBOL(cap_netlink_recv);
* cap_has_capability() returns 0 when a task has a capability, but the
* kernel's capable() and has_capability() returns 1 for this case.
*/
-int cap_capable(struct task_struct *tsk, const struct cred *cred, int cap,
- int audit)
+int cap_capable(struct task_struct *tsk, const struct cred *cred,
+ struct user_namespace *targ_ns, int cap, int audit)
{
- return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;
+ for (;;) {
+ /* The creator of the user namespace has all caps. */
+ if (targ_ns->creator == cred->user)
+ return 0;
+
+ /* Do we have the necessary capabilities? */
+ if (targ_ns == cred->user->user_ns)
+ return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;
+
+ /* Have we tried all of the parent namespaces? */
+ if (targ_ns == &init_user_ns)
+ return -EPERM;
+
+ /* If you have the capability in a parent user ns you have it
+ * in the over all children user namespaces as well, so see
+ * if this process has the capability in the parent user
+ * namespace.
+ */
+ targ_ns = targ_ns->creator->user_ns;
+ }
+
+ /* We never get here */
+ return -EPERM;
}

/**
@@ -177,7 +201,8 @@ static inline int cap_inh_is_capped(void)
/* they are so limited unless the current task has the CAP_SETPCAP
* capability
*/
- if (cap_capable(current, current_cred(), CAP_SETPCAP,
+ if (cap_capable(current, current_cred(),
+ current_cred()->user->user_ns, CAP_SETPCAP,
SECURITY_CAP_AUDIT) == 0)
return 0;
return 1;
@@ -829,7 +854,8 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
& (new->securebits ^ arg2)) /*[1]*/
|| ((new->securebits & SECURE_ALL_LOCKS & ~arg2)) /*[2]*/
|| (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/
- || (cap_capable(current, current_cred(), CAP_SETPCAP,
+ || (cap_capable(current, current_cred(),
+ current_cred()->user->user_ns, CAP_SETPCAP,
SECURITY_CAP_AUDIT) != 0) /*[4]*/
/*
* [1] no changing of bits that are locked
@@ -894,7 +920,7 @@ int cap_vm_enough_memory(struct mm_struct *mm, long pages)
{
int cap_sys_admin = 0;

- if (cap_capable(current, current_cred(), CAP_SYS_ADMIN,
+ if (cap_capable(current, current_cred(), &init_user_ns, CAP_SYS_ADMIN,
SECURITY_CAP_NOAUDIT) == 0)
cap_sys_admin = 1;
return __vm_enough_memory(mm, pages, cap_sys_admin);
@@ -921,7 +947,7 @@ int cap_file_mmap(struct file *file, unsigned long reqprot,
int ret = 0;

if (addr < dac_mmap_min_addr) {
- ret = cap_capable(current, current_cred(), CAP_SYS_RAWIO,
+ ret = cap_capable(current, current_cred(), &init_user_ns, CAP_SYS_RAWIO,
SECURITY_CAP_AUDIT);
/* set PF_SUPERPRIV if it turns out we allow the low mmap */
if (ret == 0)
diff --git a/security/security.c b/security/security.c
index a774256..a7c1a1f 100644
--- a/security/security.c
+++ b/security/security.c
@@ -172,30 +172,30 @@ int security_capset(struct cred *new, const struct cred *old,
effective, inheritable, permitted);
}

-int security_capable(int cap)
+int security_capable(struct user_namespace *ns, int cap)
{
- return security_ops->capable(current, current_cred(), cap,
+ return security_ops->capable(current, current_cred(), ns, cap,
SECURITY_CAP_AUDIT);
}

-int security_real_capable(struct task_struct *tsk, int cap)
+int security_real_capable(struct task_struct *tsk, struct user_namespace *ns, int cap)
{
const struct cred *cred;
int ret;

cred = get_task_cred(tsk);
- ret = security_ops->capable(tsk, cred, cap, SECURITY_CAP_AUDIT);
+ ret = security_ops->capable(tsk, cred, ns, cap, SECURITY_CAP_AUDIT);
put_cred(cred);
return ret;
}

-int security_real_capable_noaudit(struct task_struct *tsk, int cap)
+int security_real_capable_noaudit(struct task_struct *tsk, struct user_namespace *ns, int cap)
{
const struct cred *cred;
int ret;

cred = get_task_cred(tsk);
- ret = security_ops->capable(tsk, cred, cap, SECURITY_CAP_NOAUDIT);
+ ret = security_ops->capable(tsk, cred, ns, cap, SECURITY_CAP_NOAUDIT);
put_cred(cred);
return ret;
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 65fa8bf..b9a6a53 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -77,6 +77,7 @@
#include <linux/mutex.h>
#include <linux/posix-timers.h>
#include <linux/syslog.h>
+#include <linux/user_namespace.h>

#include "avc.h"
#include "objsec.h"
@@ -1423,6 +1424,7 @@ static int current_has_perm(const struct task_struct *tsk,
/* Check whether a task is allowed to use a capability. */
static int task_has_capability(struct task_struct *tsk,
const struct cred *cred,
+ struct user_namespace *ns,
int cap, int audit)
{
struct common_audit_data ad;
@@ -1851,15 +1853,15 @@ static int selinux_capset(struct cred *new, const struct cred *old,
*/

static int selinux_capable(struct task_struct *tsk, const struct cred *cred,
- int cap, int audit)
+ struct user_namespace *ns, int cap, int audit)
{
int rc;

- rc = cap_capable(tsk, cred, cap, audit);
+ rc = cap_capable(tsk, cred, ns, cap, audit);
if (rc)
return rc;

- return task_has_capability(tsk, cred, cap, audit);
+ return task_has_capability(tsk, cred, ns, cap, audit);
}

static int selinux_sysctl_get_sid(ctl_table *table, u16 tclass, u32 *sid)
@@ -2012,7 +2014,8 @@ static int selinux_vm_enough_memory(struct mm_struct *mm, long pages)
{
int rc, cap_sys_admin = 0;

- rc = selinux_capable(current, current_cred(), CAP_SYS_ADMIN,
+ rc = selinux_capable(current, current_cred(),
+ &init_user_ns, CAP_SYS_ADMIN,
SECURITY_CAP_NOAUDIT);
if (rc == 0)
cap_sys_admin = 1;
@@ -2826,7 +2829,8 @@ static int selinux_inode_getsecurity(const struct inode *inode, const char *name
* and lack of permission just means that we fall back to the
* in-core context value, not a denial.
*/
- error = selinux_capable(current, current_cred(), CAP_MAC_ADMIN,
+ error = selinux_capable(current, current_cred(),
+ &init_user_ns, CAP_MAC_ADMIN,
SECURITY_CAP_NOAUDIT);
if (!error)
error = security_sid_to_context_force(isec->sid, &context,
--
1.7.0.4

2010-12-17 15:26:06

by Serge E. Hallyn

[permalink] [raw]
Subject: [RFC 3/5] user namespaces: allow sethostname in a container

Signed-off-by: Serge E. Hallyn <[email protected]>
---
kernel/sys.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 2745dcd..9b9b03b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1171,7 +1171,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, len)
int errno;
char tmp[__NEW_UTS_LEN];

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(current->nsproxy->uts_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
--
1.7.0.4

2010-12-17 15:26:40

by Serge E. Hallyn

[permalink] [raw]
Subject: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
for clarity

Signed-off-by: Serge E. Hallyn <[email protected]>
---
kernel/signal.c | 33 ++++++++++++++++++++++++++++-----
1 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..499bd36 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,6 +636,33 @@ static inline bool si_fromuser(const struct siginfo *info)
}

/*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+ struct cred *cred = current_cred();
+ struct cred *tcred = __task_cred(t);
+
+ if (cred->user->user_ns != tcred->user->user_ns) {
+ /* userids are not equivalent - either you have the
+ capability to the target user ns or you don't */
+ if (ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 1;
+ return 0;
+ }
+
+ /* same user namespace - usual credentials checks apply */
+ if ((cred->euid ^ tcred->suid) &&
+ (cred->euid ^ tcred->uid) &&
+ (cred->uid ^ tcred->suid) &&
+ (cred->uid ^ tcred->uid) &&
+ !ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 0;
+
+ return 1;
+}
+
+/*
* Bad permissions for sending the signal
* - the caller must hold the RCU read lock
*/
@@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
cred = current_cred();
tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
- (cred->euid ^ tcred->suid) &&
- (cred->euid ^ tcred->uid) &&
- (cred->uid ^ tcred->suid) &&
- (cred->uid ^ tcred->uid) &&
- !capable(CAP_KILL)) {
+ !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
--
1.7.0.4

2010-12-17 15:27:19

by Serge E. Hallyn

[permalink] [raw]
Subject: [RFC 5/5] user namespaces: Allow ptrace from non-init user namespaces

ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace). ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/capability.h | 2 ++
kernel/ptrace.c | 40 ++++++++++++++++++++++++++++------------
security/commoncap.c | 26 +++++++++++++++++++++-----
3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cc3e976..777a166 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
*/
#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)

+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0)
+
/**
* has_capability_noaudit - Determine if a task has a superior capability available (unaudited)
* @t: The task in question
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..aed24eb 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int kill)
return ret;
}

+static inline int may_ptrace_ns(struct task_struct *t)
+{
+ struct user_namespace *ns;
+ int ret;
+
+ rcu_read_lock();
+ ns = task_cred_xxx(t, user)->user_ns;
+ ret = ns_capable(ns, CAP_SYS_PTRACE);
+ rcu_read_unlock();
+
+ return ret;
+}
+
int __ptrace_may_access(struct task_struct *task, unsigned int mode)
{
const struct cred *cred = current_cred(), *tcred;
@@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
- if ((cred->uid != tcred->euid ||
- cred->uid != tcred->suid ||
- cred->uid != tcred->uid ||
- cred->gid != tcred->egid ||
- cred->gid != tcred->sgid ||
- cred->gid != tcred->gid) &&
- !capable(CAP_SYS_PTRACE)) {
- rcu_read_unlock();
- return -EPERM;
- }
+ if (cred->user->user_ns == tcred->user->user_ns &&
+ (cred->uid == tcred->euid ||
+ cred->uid == tcred->suid ||
+ cred->uid == tcred->uid ||
+ cred->gid == tcred->egid ||
+ cred->gid == tcred->sgid ||
+ cred->gid == tcred->gid))
+ goto ok;
+ if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+ goto ok;
+ rcu_read_unlock();
+ return -EPERM;
+ok:
rcu_read_unlock();
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
- if (!dumpable && !capable(CAP_SYS_PTRACE))
+ if (!dumpable && !may_ptrace_ns(task))
return -EPERM;

return security_ptrace_access_check(task, mode);
@@ -198,7 +214,7 @@ int ptrace_attach(struct task_struct *task)
goto unlock_tasklist;

task->ptrace = PT_PTRACED;
- if (capable(CAP_SYS_PTRACE))
+ if (may_ptrace_ns(task))
task->ptrace |= PT_PTRACE_CAP;

__ptrace_link(task, current);
diff --git a/security/commoncap.c b/security/commoncap.c
index 9d910e6..bd0bcc6 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -136,12 +136,20 @@ int cap_settime(struct timespec *ts, struct timezone *tz)
int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
{
int ret = 0;
+ struct cred *cred, *tcred;

rcu_read_lock();
- if (!cap_issubset(__task_cred(child)->cap_permitted,
- current_cred()->cap_permitted) &&
+ cred = current_cred();
+ tcred = __task_cred(child);
+ if (cred->user->user_ns != tcred->user->user_ns) {
+ if (!ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+ ret = -EPERM;
+ goto out;
+ }
+ if (!cap_issubset(tcred->cap_permitted, cred->cap_permitted) &&
!capable(CAP_SYS_PTRACE))
ret = -EPERM;
+out:
rcu_read_unlock();
return ret;
}
@@ -156,12 +164,20 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
int cap_ptrace_traceme(struct task_struct *parent)
{
int ret = 0;
+ struct cred *cred, *tcred;

rcu_read_lock();
- if (!cap_issubset(current_cred()->cap_permitted,
- __task_cred(parent)->cap_permitted) &&
- !has_capability(parent, CAP_SYS_PTRACE))
+ cred = __task_cred(parent);
+ tcred = current_cred();
+ if (cred->user->user_ns != tcred->user->user_ns) {
+ if (!has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
+ ret = -EPERM;
+ goto out;
+ }
+ if (!cap_issubset(tcred->cap_permitted, cred->cap_permitted) &&
+ !has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
ret = -EPERM;
+out:
rcu_read_unlock();
return ret;
}
--
1.7.0.4

2010-12-17 15:56:53

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

On Fri, Dec 17, 2010 at 5:24 PM, Serge E. Hallyn <[email protected]> wrote:
> +/*
> + * userns count is 1 for root user, 1 for init_uts_ns,
> + * and 1 for... ?
> + */
> ?struct user_namespace init_user_ns = {
> ? ? ? ?.kref = {
> - ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(2),
> + ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(3),

+1 is for init_nsproxy ;-)

2010-12-17 16:00:19

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

On Fri, Dec 17, 2010 at 5:56 PM, Alexey Dobriyan <[email protected]> wrote:
> On Fri, Dec 17, 2010 at 5:24 PM, Serge E. Hallyn <[email protected]> wrote:
>> +/*
>> + * userns count is 1 for root user, 1 for init_uts_ns,
>> + * and 1 for... ?
>> + */
>> ?struct user_namespace init_user_ns = {
>> ? ? ? ?.kref = {
>> - ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(2),
>> + ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(3),
>
> +1 is for init_nsproxy ;-)

Err, no.

+1 is to ensure it's never freed, but since allocator will BUG_ON (?) if fed
with static object, maybe it's completely bogus.

2010-12-17 16:12:13

by Serge Hallyn

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

Quoting Alexey Dobriyan ([email protected]):
> On Fri, Dec 17, 2010 at 5:24 PM, Serge E. Hallyn <[email protected]> wrote:
> > +/*
> > + * userns count is 1 for root user, 1 for init_uts_ns,
> > + * and 1 for... ?
> > + */
> > ?struct user_namespace init_user_ns = {
> > ? ? ? ?.kref = {
> > - ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(2),
> > + ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(3),
>
> +1 is for init_nsproxy ;-)

Hmm, actually user_namespace isn't in struct nsproxy any more,
since 18b6e0414e42d95183f07d8177e3ff0241abd825.

I think my original thought was that init_user_ns is pinned twice by
root_user - once as creator and once as the root user in it. In
which case it isn't right - the user_ns pins the task which created
it, the creator does not pin the user_ns it creates.

-serge

2010-12-17 16:17:11

by Serge Hallyn

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

Quoting Alexey Dobriyan ([email protected]):
> On Fri, Dec 17, 2010 at 5:56 PM, Alexey Dobriyan <[email protected]> wrote:
> > On Fri, Dec 17, 2010 at 5:24 PM, Serge E. Hallyn <[email protected]> wrote:
> >> +/*
> >> + * userns count is 1 for root user, 1 for init_uts_ns,
> >> + * and 1 for... ?
> >> + */
> >> ?struct user_namespace init_user_ns = {
> >> ? ? ? ?.kref = {
> >> - ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(2),
> >> + ? ? ? ? ? ? ? .refcount ? ? ? = ATOMIC_INIT(3),
> >
> > +1 is for init_nsproxy ;-)
>
> Err, no.
>
> +1 is to ensure it's never freed, but since allocator will BUG_ON (?) if fed
> with static object, maybe it's completely bogus.

Right. On the one hand so long as it's high enough it doesn't really matter.
But I think it's worth getting the # exactly right in order to help document
what's going on. So I'll try dropping it back down to 2.

thanks,
-serge

2010-12-17 17:31:36

by Greg KH

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

On Fri, Dec 17, 2010 at 03:24:58PM +0000, Serge E. Hallyn wrote:
> copy_process() handles CLONE_NEWUSER before the rest of the
> namespaces. So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
> the new uts namespace will have the new user namespace as its
> owner. That is what we want, since we want root in that new
> userns to be able to have privilege over it.
>
> Signed-off-by: Serge E. Hallyn <[email protected]>
> ---
> include/linux/utsname.h | 3 +++
> init/version.c | 2 ++
> kernel/nsproxy.c | 3 +++
> kernel/user.c | 8 ++++++--
> kernel/utsname.c | 4 ++++
> 5 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/utsname.h b/include/linux/utsname.h
> index 69f3997..85171be 100644
> --- a/include/linux/utsname.h
> +++ b/include/linux/utsname.h
> @@ -37,9 +37,12 @@ struct new_utsname {
> #include <linux/nsproxy.h>
> #include <linux/err.h>
>
> +struct user_namespace;
> +
> struct uts_namespace {
> struct kref kref;
> struct new_utsname name;
> + struct user_namespace *user_ns;
> };
> extern struct uts_namespace init_uts_ns;
>
> diff --git a/init/version.c b/init/version.c
> index 79fb8c2..9eb19fb 100644
> --- a/init/version.c
> +++ b/init/version.c
> @@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
> int version_string(LINUX_VERSION_CODE);
> #endif
>
> +extern struct user_namespace init_user_ns;
> struct uts_namespace init_uts_ns = {
> .kref = {
> .refcount = ATOMIC_INIT(2),

Wait, WTF?

You have a static kref and you try to automatically instanciate it here?
As it's static, why are you even having a kref at all, what good does it
do you, you can't delete the thing, it's always around, so just remove
it entirely please.

Or, dynamically create it properly. In other words, this is majorly
broken.

thanks,

greg k-h

2010-12-17 19:26:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

Greg KH <[email protected]> writes:

> On Fri, Dec 17, 2010 at 03:24:58PM +0000, Serge E. Hallyn wrote:
>> copy_process() handles CLONE_NEWUSER before the rest of the
>> namespaces. So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
>> the new uts namespace will have the new user namespace as its
>> owner. That is what we want, since we want root in that new
>> userns to be able to have privilege over it.
>>
>> Signed-off-by: Serge E. Hallyn <[email protected]>
>> ---
>> include/linux/utsname.h | 3 +++
>> init/version.c | 2 ++
>> kernel/nsproxy.c | 3 +++
>> kernel/user.c | 8 ++++++--
>> kernel/utsname.c | 4 ++++
>> 5 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/utsname.h b/include/linux/utsname.h
>> index 69f3997..85171be 100644
>> --- a/include/linux/utsname.h
>> +++ b/include/linux/utsname.h
>> @@ -37,9 +37,12 @@ struct new_utsname {
>> #include <linux/nsproxy.h>
>> #include <linux/err.h>
>>
>> +struct user_namespace;
>> +
>> struct uts_namespace {
>> struct kref kref;
>> struct new_utsname name;
>> + struct user_namespace *user_ns;
>> };
>> extern struct uts_namespace init_uts_ns;
>>
>> diff --git a/init/version.c b/init/version.c
>> index 79fb8c2..9eb19fb 100644
>> --- a/init/version.c
>> +++ b/init/version.c
>> @@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
>> int version_string(LINUX_VERSION_CODE);
>> #endif
>>
>> +extern struct user_namespace init_user_ns;
>> struct uts_namespace init_uts_ns = {
>> .kref = {
>> .refcount = ATOMIC_INIT(2),
>
> Wait, WTF?
>
> You have a static kref and you try to automatically instanciate it here?
> As it's static, why are you even having a kref at all, what good does it
> do you, you can't delete the thing, it's always around, so just remove
> it entirely please.
>
> Or, dynamically create it properly. In other words, this is majorly
> broken.

There is a very weird case for the data structures the initial task has
references to. The initial task never goes away and so those data
structure never go away. Furthermore we need many of those data
structures before we have a memory allocator ready. So we statically
allocate a single data structure and up it's reference count to ensure
that the count never goes to zero.

There are also major benefits to have the version of something that is
never freed never going away, because it means you can just reference it
in code. So while I would be happy to say this is special don't use a
kref and roll the reference counting logic by hand, we aren't
dynamically allocating init_uts_ns any time soon.

Eric

2010-12-17 19:31:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

"Serge E. Hallyn" <[email protected]> writes:

> Changelog:
> Dec 8: Fixed bug in my check_kill_permission pointed out by
> Eric Biederman.
> Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
> for clarity
>
> Signed-off-by: Serge E. Hallyn <[email protected]>
> ---
> kernel/signal.c | 33 ++++++++++++++++++++++++++++-----
> 1 files changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 4e3cff1..499bd36 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
> cred = current_cred();
> tcred = __task_cred(t);
Nit pick you don't need to compute cred and tcred here now.

> if (!same_thread_group(current, t) &&
> - (cred->euid ^ tcred->suid) &&
> - (cred->euid ^ tcred->uid) &&
> - (cred->uid ^ tcred->suid) &&
> - (cred->uid ^ tcred->uid) &&
> - !capable(CAP_KILL)) {
> + !kill_ok_by_cred(t)) {
> switch (sig) {
> case SIGCONT:
> sid = task_session(t);

2010-12-17 19:45:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 5/5] user namespaces: Allow ptrace from non-init user namespaces

"Serge E. Hallyn" <[email protected]> writes:

> ptrace is allowed to tasks in the same user namespace according to
> the usual rules (i.e. the same rules as for two tasks in the init
> user namespace). ptrace is also allowed to a user namespace to
> which the current task the has CAP_SYS_PTRACE capability.

The uid equality check below is broken.

Eric


> Signed-off-by: Serge E. Hallyn <[email protected]>
> ---
> include/linux/capability.h | 2 ++
> kernel/ptrace.c | 40 ++++++++++++++++++++++++++++------------
> security/commoncap.c | 26 +++++++++++++++++++++-----
> 3 files changed, 51 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index cc3e976..777a166 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
> */
> #define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)
>
> +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0)
> +
> /**
> * has_capability_noaudit - Determine if a task has a superior capability available (unaudited)
> * @t: The task in question
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 99bbaa3..aed24eb 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int kill)
> return ret;
> }
>
> +static inline int may_ptrace_ns(struct task_struct *t)

Can we name this ptrace_capable? Since you are only
wrapping the capability check? With a name like may_ptrace_ns
I imagine very different semantics.

> +{
> + struct user_namespace *ns;
> + int ret;
> +
> + rcu_read_lock();
> + ns = task_cred_xxx(t, user)->user_ns;
> + ret = ns_capable(ns, CAP_SYS_PTRACE);
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> int __ptrace_may_access(struct task_struct *task, unsigned int mode)
> {
> const struct cred *cred = current_cred(), *tcred;
> @@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode)
> return 0;
> rcu_read_lock();
> tcred = __task_cred(task);
> - if ((cred->uid != tcred->euid ||
> - cred->uid != tcred->suid ||
> - cred->uid != tcred->uid ||
> - cred->gid != tcred->egid ||
> - cred->gid != tcred->sgid ||
> - cred->gid != tcred->gid) &&
> - !capable(CAP_SYS_PTRACE)) {
> - rcu_read_unlock();
> - return -EPERM;
> - }
> + if (cred->user->user_ns == tcred->user->user_ns &&
> + (cred->uid == tcred->euid ||
> + cred->uid == tcred->suid ||
> + cred->uid == tcred->uid ||
> + cred->gid == tcred->egid ||
> + cred->gid == tcred->sgid ||
> + cred->gid == tcred->gid))
> + goto ok;

This needs to be:
> + if (cred->user->user_ns == tcred->user->user_ns &&
> + (cred->uid == tcred->euid &&
> + cred->uid == tcred->suid &&
> + cred->uid == tcred->uid &&
> + cred->gid == tcred->egid &&
> + cred->gid == tcred->sgid &&
> + cred->gid == tcred->gid))
> + goto ok;



> + if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
> + goto ok;
> + rcu_read_unlock();
> + return -EPERM;
> +ok:
> rcu_read_unlock();
> smp_rmb();
> if (task->mm)
> dumpable = get_dumpable(task->mm);
> - if (!dumpable && !capable(CAP_SYS_PTRACE))
> + if (!dumpable && !may_ptrace_ns(task))
> return -EPERM;
>
> return security_ptrace_access_check(task, mode);
> @@ -198,7 +214,7 @@ int ptrace_attach(struct task_struct *task)
> goto unlock_tasklist;
>
> task->ptrace = PT_PTRACED;
> - if (capable(CAP_SYS_PTRACE))
> + if (may_ptrace_ns(task))
> task->ptrace |= PT_PTRACE_CAP;
>
> __ptrace_link(task, current);
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 9d910e6..bd0bcc6 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -136,12 +136,20 @@ int cap_settime(struct timespec *ts, struct timezone *tz)
> int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
> {
> int ret = 0;
> + struct cred *cred, *tcred;
>
> rcu_read_lock();
> - if (!cap_issubset(__task_cred(child)->cap_permitted,
> - current_cred()->cap_permitted) &&
> + cred = current_cred();
> + tcred = __task_cred(child);
> + if (cred->user->user_ns != tcred->user->user_ns) {

This probably deserves a comment about why cap_issubset isn't
needed here. Aka we implicitly have all caps in child user namespaces
so if we have CAP_SYS_PTRACE we know we have them all.

> + if (!ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
> + ret = -EPERM;
> + goto out;
> + }
> + if (!cap_issubset(tcred->cap_permitted, cred->cap_permitted) &&
> !capable(CAP_SYS_PTRACE))
> ret = -EPERM;
> +out:
> rcu_read_unlock();
> return ret;
> }
> @@ -156,12 +164,20 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
> int cap_ptrace_traceme(struct task_struct *parent)
> {
> int ret = 0;
> + struct cred *cred, *tcred;
>
> rcu_read_lock();
> - if (!cap_issubset(current_cred()->cap_permitted,
> - __task_cred(parent)->cap_permitted) &&
> - !has_capability(parent, CAP_SYS_PTRACE))
> + cred = __task_cred(parent);
> + tcred = current_cred();
> + if (cred->user->user_ns != tcred->user->user_ns) {
> + if (!has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
> + ret = -EPERM;
> + goto out;
> + }
> + if (!cap_issubset(tcred->cap_permitted, cred->cap_permitted) &&
> + !has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
> ret = -EPERM;
> +out:
> rcu_read_unlock();
> return ret;
> }

2010-12-17 19:47:00

by Serge Hallyn

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

Quoting Greg KH ([email protected]):
> > +extern struct user_namespace init_user_ns;
> > struct uts_namespace init_uts_ns = {
> > .kref = {
> > .refcount = ATOMIC_INIT(2),
>
> Wait, WTF?
>
> You have a static kref and you try to automatically instanciate it here?

You're complaining about the pre-existing init_uts_ns right?

> As it's static, why are you even having a kref at all, what good does it
> do you, you can't delete the thing,

Can't delete this one, but can delete all the uts namespaces, obviously.
As with init_tgcred in kernel/cred.c.

It's initialized with a refcount which will keep it from ever getting
freed.

> it's always around, so just remove
> it entirely please.
>
> Or, dynamically create it properly. In other words, this is majorly
> broken.

If we create it dynamically, then I don't think we can use it the way we
do in kernel/utsname_sysctl.c for instance.

thanks,
-serge

2010-12-17 20:05:02

by Serge Hallyn

[permalink] [raw]
Subject: Re: [RFC 5/5] user namespaces: Allow ptrace from non-init user namespaces

Thanks for reviewing, Eric.

Quoting Eric W. Biederman ([email protected]):
> > +static inline int may_ptrace_ns(struct task_struct *t)
>
> Can we name this ptrace_capable? Since you are only
> wrapping the capability check? With a name like may_ptrace_ns
> I imagine very different semantics.

Hm, the whole structure here could probably stand to be improved
anyway. I just can't quite think how. I'll rename it as you
suggest for starters, just not sure if it'll continue to exist.

>
> > +{
> > + struct user_namespace *ns;
> > + int ret;
> > +
> > + rcu_read_lock();
> > + ns = task_cred_xxx(t, user)->user_ns;
> > + ret = ns_capable(ns, CAP_SYS_PTRACE);
> > + rcu_read_unlock();
> > +
> > + return ret;
> > +}
> > +
> > int __ptrace_may_access(struct task_struct *task, unsigned int mode)
> > {
> > const struct cred *cred = current_cred(), *tcred;
> > @@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode)
> > return 0;
> > rcu_read_lock();
> > tcred = __task_cred(task);
> > - if ((cred->uid != tcred->euid ||
> > - cred->uid != tcred->suid ||
> > - cred->uid != tcred->uid ||
> > - cred->gid != tcred->egid ||
> > - cred->gid != tcred->sgid ||
> > - cred->gid != tcred->gid) &&
> > - !capable(CAP_SYS_PTRACE)) {
> > - rcu_read_unlock();
> > - return -EPERM;
> > - }
> > + if (cred->user->user_ns == tcred->user->user_ns &&
> > + (cred->uid == tcred->euid ||
> > + cred->uid == tcred->suid ||
> > + cred->uid == tcred->uid ||
> > + cred->gid == tcred->egid ||
> > + cred->gid == tcred->sgid ||
> > + cred->gid == tcred->gid))
> > + goto ok;
>
> This needs to be:
> > + if (cred->user->user_ns == tcred->user->user_ns &&
> > + (cred->uid == tcred->euid &&
> > + cred->uid == tcred->suid &&
> > + cred->uid == tcred->uid &&
> > + cred->gid == tcred->egid &&
> > + cred->gid == tcred->sgid &&
> > + cred->gid == tcred->gid))
> > + goto ok;

Hm, I started to explain why it doesn't, but you're right.
If any of the uids are different, then you must have
CAP_SYS_PTRACE or be denied.

> > --- a/security/commoncap.c
> > +++ b/security/commoncap.c
> > @@ -136,12 +136,20 @@ int cap_settime(struct timespec *ts, struct timezone *tz)
> > int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
> > {
> > int ret = 0;
> > + struct cred *cred, *tcred;
> >
> > rcu_read_lock();
> > - if (!cap_issubset(__task_cred(child)->cap_permitted,
> > - current_cred()->cap_permitted) &&
> > + cred = current_cred();
> > + tcred = __task_cred(child);
> > + if (cred->user->user_ns != tcred->user->user_ns) {
>
> This probably deserves a comment about why cap_issubset isn't
> needed here. Aka we implicitly have all caps in child user namespaces
> so if we have CAP_SYS_PTRACE we know we have them all.

(going strictly by the rules which fall out from the original intent
of ns_capable) :

There is a case where that isn't true - if I'm user B in userns 3, and
user A in userns 3 created the userns 4 in which this target task, owned
by user C, sits. Then user B does not have all capabilities to userns 4,
but any calculated capabilities which B has, are also valid in userns 4.

I'd still claim that capabilities aren't really comparable (because
they are targeted at different user namespaces), and therefore the
CAP_SYS_PTRACE should be sufficient for this case. But maybe that's
not as practical. Maybe the cap_issubset check should be there after
all.

> > + if (!ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
> > + ret = -EPERM;
> > + goto out;
> > + }
> > + if (!cap_issubset(tcred->cap_permitted, cred->cap_permitted) &&
> > !capable(CAP_SYS_PTRACE))
> > ret = -EPERM;
> > +out:
> > rcu_read_unlock();
> > return ret;
> > }
> > @@ -156,12 +164,20 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
> > int cap_ptrace_traceme(struct task_struct *parent)
> > {
> > int ret = 0;
> > + struct cred *cred, *tcred;
> >
> > rcu_read_lock();
> > - if (!cap_issubset(current_cred()->cap_permitted,
> > - __task_cred(parent)->cap_permitted) &&
> > - !has_capability(parent, CAP_SYS_PTRACE))
> > + cred = __task_cred(parent);
> > + tcred = current_cred();
> > + if (cred->user->user_ns != tcred->user->user_ns) {
> > + if (!has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
> > + ret = -EPERM;
> > + goto out;
> > + }
> > + if (!cap_issubset(tcred->cap_permitted, cred->cap_permitted) &&
> > + !has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
> > ret = -EPERM;
> > +out:
> > rcu_read_unlock();
> > return ret;
> > }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-12-17 20:07:48

by Greg KH

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

On Fri, Dec 17, 2010 at 01:46:45PM -0600, Serge Hallyn wrote:
> Quoting Greg KH ([email protected]):
> > > +extern struct user_namespace init_user_ns;
> > > struct uts_namespace init_uts_ns = {
> > > .kref = {
> > > .refcount = ATOMIC_INIT(2),
> >
> > Wait, WTF?
> >
> > You have a static kref and you try to automatically instanciate it here?
>
> You're complaining about the pre-existing init_uts_ns right?

Yup.

> > As it's static, why are you even having a kref at all, what good does it
> > do you, you can't delete the thing,
>
> Can't delete this one, but can delete all the uts namespaces, obviously.
> As with init_tgcred in kernel/cred.c.
>
> It's initialized with a refcount which will keep it from ever getting
> freed.

That's my point. You are getting "lucky" by creating a static kref.
Which is a totally pointless thing to do, right?

> > it's always around, so just remove
> > it entirely please.
> >
> > Or, dynamically create it properly. In other words, this is majorly
> > broken.
>
> If we create it dynamically, then I don't think we can use it the way we
> do in kernel/utsname_sysctl.c for instance.

Why not? It's just a pointer to the structure instead of the structure
itself, right?

thanks,

greg k-h

2010-12-17 20:08:00

by Greg KH

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

On Fri, Dec 17, 2010 at 11:26:50AM -0800, Eric W. Biederman wrote:
> Greg KH <[email protected]> writes:
>
> > On Fri, Dec 17, 2010 at 03:24:58PM +0000, Serge E. Hallyn wrote:
> >> copy_process() handles CLONE_NEWUSER before the rest of the
> >> namespaces. So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
> >> the new uts namespace will have the new user namespace as its
> >> owner. That is what we want, since we want root in that new
> >> userns to be able to have privilege over it.
> >>
> >> Signed-off-by: Serge E. Hallyn <[email protected]>
> >> ---
> >> include/linux/utsname.h | 3 +++
> >> init/version.c | 2 ++
> >> kernel/nsproxy.c | 3 +++
> >> kernel/user.c | 8 ++++++--
> >> kernel/utsname.c | 4 ++++
> >> 5 files changed, 18 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/utsname.h b/include/linux/utsname.h
> >> index 69f3997..85171be 100644
> >> --- a/include/linux/utsname.h
> >> +++ b/include/linux/utsname.h
> >> @@ -37,9 +37,12 @@ struct new_utsname {
> >> #include <linux/nsproxy.h>
> >> #include <linux/err.h>
> >>
> >> +struct user_namespace;
> >> +
> >> struct uts_namespace {
> >> struct kref kref;
> >> struct new_utsname name;
> >> + struct user_namespace *user_ns;
> >> };
> >> extern struct uts_namespace init_uts_ns;
> >>
> >> diff --git a/init/version.c b/init/version.c
> >> index 79fb8c2..9eb19fb 100644
> >> --- a/init/version.c
> >> +++ b/init/version.c
> >> @@ -21,6 +21,7 @@ extern int version_string(LINUX_VERSION_CODE);
> >> int version_string(LINUX_VERSION_CODE);
> >> #endif
> >>
> >> +extern struct user_namespace init_user_ns;
> >> struct uts_namespace init_uts_ns = {
> >> .kref = {
> >> .refcount = ATOMIC_INIT(2),
> >
> > Wait, WTF?
> >
> > You have a static kref and you try to automatically instanciate it here?
> > As it's static, why are you even having a kref at all, what good does it
> > do you, you can't delete the thing, it's always around, so just remove
> > it entirely please.
> >
> > Or, dynamically create it properly. In other words, this is majorly
> > broken.
>
> There is a very weird case for the data structures the initial task has
> references to. The initial task never goes away and so those data
> structure never go away. Furthermore we need many of those data
> structures before we have a memory allocator ready. So we statically
> allocate a single data structure and up it's reference count to ensure
> that the count never goes to zero.

Why not just dynamically create this structure once and then, if what
you say is really true, you never have to worry about the structure
going away, right?

> There are also major benefits to have the version of something that is
> never freed never going away, because it means you can just reference it
> in code. So while I would be happy to say this is special don't use a
> kref and roll the reference counting logic by hand, we aren't
> dynamically allocating init_uts_ns any time soon.

Why have a reference count at all if it's not needed or used here?

confused,

greg k-h

2010-12-17 20:08:43

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

Quoting Eric W. Biederman ([email protected]):
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
> > cred = current_cred();
> > tcred = __task_cred(t);
> Nit pick you don't need to compute cred and tcred here now.

Just to make sure I understand right: you mean wait until after the
same_thread_group() check to save calculation in that case, right?

> > if (!same_thread_group(current, t) &&
> > - (cred->euid ^ tcred->suid) &&
> > - (cred->euid ^ tcred->uid) &&
> > - (cred->uid ^ tcred->suid) &&
> > - (cred->uid ^ tcred->uid) &&
> > - !capable(CAP_KILL)) {
> > + !kill_ok_by_cred(t)) {
> > switch (sig) {
> > case SIGCONT:
> > sid = task_session(t);

2010-12-17 20:18:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

"Serge E. Hallyn" <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>> > --- a/kernel/signal.c
>> > +++ b/kernel/signal.c
>> > @@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
>> > cred = current_cred();
>> > tcred = __task_cred(t);
>> Nit pick you don't need to compute cred and tcred here now.
>
> Just to make sure I understand right: you mean wait until after the
> same_thread_group() check to save calculation in that case, right?

I mean cred and tcred are only use in kill_ok_by_cred.
So we can eliminate those two variables from check_kill_permission.

>> > if (!same_thread_group(current, t) &&
>> > - (cred->euid ^ tcred->suid) &&
>> > - (cred->euid ^ tcred->uid) &&
>> > - (cred->uid ^ tcred->suid) &&
>> > - (cred->uid ^ tcred->uid) &&
>> > - !capable(CAP_KILL)) {
>> > + !kill_ok_by_cred(t)) {
>> > switch (sig) {
>> > case SIGCONT:
>> > sid = task_session(t);


Eric

2010-12-17 20:22:01

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

Quoting Eric W. Biederman ([email protected]):
> >> > cred = current_cred();
> >> > tcred = __task_cred(t);
> >> Nit pick you don't need to compute cred and tcred here now.
> >
> > Just to make sure I understand right: you mean wait until after the
> > same_thread_group() check to save calculation in that case, right?
>
> I mean cred and tcred are only use in kill_ok_by_cred.
> So we can eliminate those two variables from check_kill_permission.

D'oh. Should've looked at the original tree, not the context. Got it,
thanks.

-serge

2010-12-17 22:41:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

Greg KH <[email protected]> writes:

>> There are also major benefits to have the version of something that is
>> never freed never going away, because it means you can just reference it
>> in code. So while I would be happy to say this is special don't use a
>> kref and roll the reference counting logic by hand, we aren't
>> dynamically allocating init_uts_ns any time soon.
>
> Why have a reference count at all if it's not needed or used here?

We have to reference count every other uts namespace.

Eric

2010-12-17 23:15:39

by Greg KH

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

On Fri, Dec 17, 2010 at 12:40:11PM -0800, Eric W. Biederman wrote:
> Greg KH <[email protected]> writes:
>
> >> There are also major benefits to have the version of something that is
> >> never freed never going away, because it means you can just reference it
> >> in code. So while I would be happy to say this is special don't use a
> >> kref and roll the reference counting logic by hand, we aren't
> >> dynamically allocating init_uts_ns any time soon.
> >
> > Why have a reference count at all if it's not needed or used here?
>
> We have to reference count every other uts namespace.

Ok, that makes sense, then also please dynamically create this one, do
not create a static kref.

thanks,

greg k-h

2010-12-18 06:33:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

Greg KH <[email protected]> writes:

> On Fri, Dec 17, 2010 at 12:40:11PM -0800, Eric W. Biederman wrote:
>> Greg KH <[email protected]> writes:
>>
>> >> There are also major benefits to have the version of something that is
>> >> never freed never going away, because it means you can just reference it
>> >> in code. So while I would be happy to say this is special don't use a
>> >> kref and roll the reference counting logic by hand, we aren't
>> >> dynamically allocating init_uts_ns any time soon.
>> >
>> > Why have a reference count at all if it's not needed or used here?
>>
>> We have to reference count every other uts namespace.
>
> Ok, that makes sense, then also please dynamically create this one, do
> not create a static kref.

Nope. It's a bad idea. It messes up the kernel bootstrap if you do
that, and it makes this one structure different from every other
structure init_task uses.

Eric

2010-12-18 17:55:57

by Greg KH

[permalink] [raw]
Subject: Re: [RFC 1/5] user namespaces: Add a user_namespace as creator/owner of uts_namespace

On Fri, Dec 17, 2010 at 10:32:52PM -0800, Eric W. Biederman wrote:
> Greg KH <[email protected]> writes:
>
> > On Fri, Dec 17, 2010 at 12:40:11PM -0800, Eric W. Biederman wrote:
> >> Greg KH <[email protected]> writes:
> >>
> >> >> There are also major benefits to have the version of something that is
> >> >> never freed never going away, because it means you can just reference it
> >> >> in code. So while I would be happy to say this is special don't use a
> >> >> kref and roll the reference counting logic by hand, we aren't
> >> >> dynamically allocating init_uts_ns any time soon.
> >> >
> >> > Why have a reference count at all if it's not needed or used here?
> >>
> >> We have to reference count every other uts namespace.
> >
> > Ok, that makes sense, then also please dynamically create this one, do
> > not create a static kref.
>
> Nope. It's a bad idea. It messes up the kernel bootstrap if you do
> that, and it makes this one structure different from every other
> structure init_task uses.

{sigh}

Ok, but I really don't like this use.

Also, don't go messing with that ATOMIC_INIT() to be a higher value, as
this patch series did, as that really implies that it is being used
incorrectly, right?

thanks,

greg k-h

2011-01-01 04:44:29

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC 4/5] user namespaces: allow killing tasks in your own or child userns

Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Quoting Eric W. Biederman ([email protected]):
> >> > --- a/kernel/signal.c
> >> > +++ b/kernel/signal.c
> >> > @@ -659,11 +686,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
> >> > cred = current_cred();
> >> > tcred = __task_cred(t);
> >> Nit pick you don't need to compute cred and tcred here now.
> >
> > Just to make sure I understand right: you mean wait until after the
> > same_thread_group() check to save calculation in that case, right?
>
> I mean cred and tcred are only use in kill_ok_by_cred.
> So we can eliminate those two variables from check_kill_permission.

Thanks for the review. Here is an updated version.

Subject: [PATCH 4/5] allow killing tasks in your own or child userns

Changelog:
Dec 8: Fixed bug in my check_kill_permission pointed out by
Eric Biederman.
Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred()
for clarity
Dec 31: address comment by Eric Biederman:
don't need cred/tcred in check_kill_permission.

Signed-off-by: Serge E. Hallyn <[email protected]>
---
kernel/signal.c | 36 ++++++++++++++++++++++++++++--------
1 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..d890c99 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -636,13 +636,39 @@ static inline bool si_fromuser(const struct siginfo *info)
}

/*
+ * called with RCU read lock from check_kill_permission()
+ */
+static inline int kill_ok_by_cred(struct task_struct *t)
+{
+ struct cred *cred = current_cred();
+ struct cred *tcred = __task_cred(t);
+
+ if (cred->user->user_ns != tcred->user->user_ns) {
+ /* userids are not equivalent - either you have the
+ capability to the target user ns or you don't */
+ if (ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 1;
+ return 0;
+ }
+
+ /* same user namespace - usual credentials checks apply */
+ if ((cred->euid ^ tcred->suid) &&
+ (cred->euid ^ tcred->uid) &&
+ (cred->uid ^ tcred->suid) &&
+ (cred->uid ^ tcred->uid) &&
+ !ns_capable(tcred->user->user_ns, CAP_KILL))
+ return 0;
+
+ return 1;
+}
+
+/*
* Bad permissions for sending the signal
* - the caller must hold the RCU read lock
*/
static int check_kill_permission(int sig, struct siginfo *info,
struct task_struct *t)
{
- const struct cred *cred, *tcred;
struct pid *sid;
int error;

@@ -656,14 +682,8 @@ static int check_kill_permission(int sig, struct siginfo *info,
if (error)
return error;

- cred = current_cred();
- tcred = __task_cred(t);
if (!same_thread_group(current, t) &&
- (cred->euid ^ tcred->suid) &&
- (cred->euid ^ tcred->uid) &&
- (cred->uid ^ tcred->suid) &&
- (cred->uid ^ tcred->uid) &&
- !capable(CAP_KILL)) {
+ !kill_ok_by_cred(t)) {
switch (sig) {
case SIGCONT:
sid = task_session(t);
--
1.7.0.4

2011-01-01 04:46:14

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC 5/5] user namespaces: Allow ptrace from non-init user namespaces

Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
>
> > ptrace is allowed to tasks in the same user namespace according to
> > the usual rules (i.e. the same rules as for two tasks in the init
> > user namespace). ptrace is also allowed to a user namespace to
> > which the current task the has CAP_SYS_PTRACE capability.
>
> The uid equality check below is broken.

Thanks for the review, Eric. Updated version appended. Assuming there
are no big problems with this version, I hope to do setuid/setgid and
start the simplest vfs access controls next.

Subject: [PATCH 5/5] Allow ptrace from non-init user namespaces

ptrace is allowed to tasks in the same user namespace according to
the usual rules (i.e. the same rules as for two tasks in the init
user namespace). ptrace is also allowed to a user namespace to
which the current task the has CAP_SYS_PTRACE capability.

Changelog:
Dec 31: Address feedback by Eric:
. Correct ptrace uid check
. Rename may_ptrace_ns to ptrace_capable
. Also fix the cap_ptrace checks.

Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/capability.h | 2 +
include/linux/user_namespace.h | 9 +++++++
kernel/ptrace.c | 40 +++++++++++++++++++++++----------
kernel/user_namespace.c | 16 +++++++++++++
security/commoncap.c | 48 +++++++++++++++++++++++++++++++++------
5 files changed, 95 insertions(+), 20 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cc3e976..777a166 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -543,6 +543,8 @@ extern const kernel_cap_t __cap_init_eff_set;
*/
#define has_capability(t, cap) (security_real_capable((t), &init_user_ns, (cap)) == 0)

+#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0)
+
/**
* has_capability_noaudit - Determine if a task has a superior capability available (unaudited)
* @t: The task in question
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8178156..91c4f10 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid);
gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t gid);

+int same_or_ancestor_user_ns(struct task_struct *task,
+ struct task_struct *victim);
+
#else

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -66,6 +69,12 @@ static inline gid_t user_ns_map_gid(struct user_namespace *to,
return gid;
}

+static inline int same_or_ancestor_user_ns(struct task_struct *task,
+ struct task_struct *victim)
+{
+ return 1;
+}
+
#endif

#endif /* _LINUX_USER_H */
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..88e3fb3 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -116,6 +116,19 @@ int ptrace_check_attach(struct task_struct *child, int kill)
return ret;
}

+static inline int ptrace_capable(struct task_struct *t)
+{
+ struct user_namespace *ns;
+ int ret;
+
+ rcu_read_lock();
+ ns = task_cred_xxx(t, user)->user_ns;
+ ret = ns_capable(ns, CAP_SYS_PTRACE);
+ rcu_read_unlock();
+
+ return ret;
+}
+
int __ptrace_may_access(struct task_struct *task, unsigned int mode)
{
const struct cred *cred = current_cred(), *tcred;
@@ -134,21 +147,24 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode)
return 0;
rcu_read_lock();
tcred = __task_cred(task);
- if ((cred->uid != tcred->euid ||
- cred->uid != tcred->suid ||
- cred->uid != tcred->uid ||
- cred->gid != tcred->egid ||
- cred->gid != tcred->sgid ||
- cred->gid != tcred->gid) &&
- !capable(CAP_SYS_PTRACE)) {
- rcu_read_unlock();
- return -EPERM;
- }
+ if (cred->user->user_ns == tcred->user->user_ns &&
+ (cred->uid == tcred->euid &&
+ cred->uid == tcred->suid &&
+ cred->uid == tcred->uid &&
+ cred->gid == tcred->egid &&
+ cred->gid == tcred->sgid &&
+ cred->gid == tcred->gid))
+ goto ok;
+ if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+ goto ok;
+ rcu_read_unlock();
+ return -EPERM;
+ok:
rcu_read_unlock();
smp_rmb();
if (task->mm)
dumpable = get_dumpable(task->mm);
- if (!dumpable && !capable(CAP_SYS_PTRACE))
+ if (!dumpable && !ptrace_capable(task))
return -EPERM;

return security_ptrace_access_check(task, mode);
@@ -198,7 +214,7 @@ int ptrace_attach(struct task_struct *task)
goto unlock_tasklist;

task->ptrace = PT_PTRACED;
- if (capable(CAP_SYS_PTRACE))
+ if (ptrace_capable(task))
task->ptrace |= PT_PTRACE_CAP;

__ptrace_link(task, current);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2591583..4b70999 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -126,3 +126,19 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t
/* No useful relationship so no mapping */
return overflowgid;
}
+
+int same_or_ancestor_user_ns(struct task_struct *task,
+ struct task_struct *victim)
+{
+ struct user_namespace *u1 = task_cred_xxx(task, user)->user_ns;
+ struct user_namespace *u2 = task_cred_xxx(victim, user)->user_ns;
+ for (;;) {
+ if (u1 == u2)
+ return 1;
+ if (u1 == &init_user_ns)
+ return 0;
+ u1 = u1->creator->user_ns;
+ }
+ /* We never get here */
+ return 0;
+}
diff --git a/security/commoncap.c b/security/commoncap.c
index 9d910e6..8fa4285 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -130,18 +130,34 @@ int cap_settime(struct timespec *ts, struct timezone *tz)
* @child: The process to be accessed
* @mode: The mode of attachment.
*
+ * If we are in the same or an ancestor user_ns and have all the target
+ * task's capabilities, then ptrace access is allowed.
+ * If we have the ptrace capability to the target user_ns, then ptrace
+ * access is allowed.
+ * Else denied.
+ *
* Determine whether a process may access another, returning 0 if permission
* granted, -ve if denied.
*/
int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
{
int ret = 0;
+ struct cred *cred, *tcred;

rcu_read_lock();
- if (!cap_issubset(__task_cred(child)->cap_permitted,
- current_cred()->cap_permitted) &&
- !capable(CAP_SYS_PTRACE))
- ret = -EPERM;
+ cred = current_cred();
+ tcred = __task_cred(child);
+ /*
+ * The ancestor user_ns check may be gratuitous, as I think
+ * we've already guaranteed that in kernel/ptrace.c.
+ */
+ if (same_or_ancestor_user_ns(current, child) &&
+ cap_issubset(tcred->cap_permitted, cred->cap_permitted))
+ goto out;
+ if (ns_capable(tcred->user->user_ns, CAP_SYS_PTRACE))
+ goto out;
+ ret = -EPERM;
+out:
rcu_read_unlock();
return ret;
}
@@ -150,18 +166,34 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
* cap_ptrace_traceme - Determine whether another process may trace the current
* @parent: The task proposed to be the tracer
*
+ * If parent is in the same or an ancestor user_ns and has all current's
+ * capabilities, then ptrace access is allowed.
+ * If parent has the ptrace capability to current's user_ns, then ptrace
+ * access is allowed.
+ * Else denied.
+ *
* Determine whether the nominated task is permitted to trace the current
* process, returning 0 if permission is granted, -ve if denied.
*/
int cap_ptrace_traceme(struct task_struct *parent)
{
int ret = 0;
+ struct cred *cred, *tcred;

rcu_read_lock();
- if (!cap_issubset(current_cred()->cap_permitted,
- __task_cred(parent)->cap_permitted) &&
- !has_capability(parent, CAP_SYS_PTRACE))
- ret = -EPERM;
+ cred = __task_cred(parent);
+ tcred = current_cred();
+ /*
+ * The ancestor user_ns check may be gratuitous, as I think
+ * we've already guaranteed that in kernel/ptrace.c.
+ */
+ if (same_or_ancestor_user_ns(parent, current) &&
+ cap_issubset(tcred->cap_permitted, cred->cap_permitted))
+ goto out;
+ if (has_ns_capability(parent, tcred->user->user_ns, CAP_SYS_PTRACE))
+ goto out;
+ ret = -EPERM;
+out:
rcu_read_unlock();
return ret;
}
--
1.7.0.4