LinuxLists.cc - [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns

2013-07-23 18:16:14

Subject: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

We allow a task to change its own devices cgroup, or to change other tasks'
cgroups if it has CAP_SYS_ADMIN.

Also allow task A to change task B's cgroup if task A has CAP_SYS_ADMIN
with respect to task B - meaning A is root in the same userns, or A
created B's userns.

Signed-off-by: Serge Hallyn <[email protected]>
---
security/device_cgroup.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 9760ecb6..8f5386ea 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -74,11 +74,14 @@ struct cgroup_subsys devices_subsys;
static int devcgroup_can_attach(struct cgroup *new_cgrp,
struct cgroup_taskset *set)
{
- struct task_struct *task = cgroup_taskset_first(set);
+ struct task_struct *t = cgroup_taskset_first(set);
+ int ret = 0;

- if (current != task && !capable(CAP_SYS_ADMIN))
- return -EPERM;
- return 0;
+ rcu_read_lock();
+ if (current != t && !ns_capable(__task_cred(t)->user_ns, CAP_SYS_ADMIN))
+ ret = -EPERM;
+ rcu_read_unlock();
+ return ret;
}

/*
--
1.7.9.5

2013-07-23 18:19:08

by Serge Hallyn

[permalink] [raw]

Subject: [RFC PATCH 2/2] capabilities: allow nice if we are privileged

We allow task A to change B's nice level if it has a supserset of
B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has
CAP_SYS_NICE with respect to B - meaning it is root in the same
namespace, or it created B's namespace.

Signed-off-by: Serge Hallyn <[email protected]>
---
security/commoncap.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index d78b003..ef98b56 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -768,16 +768,16 @@ int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags)
*/
static int cap_safe_nice(struct task_struct *p)
{
- int is_subset;
+ int is_subset, ret = 0;

rcu_read_lock();
is_subset = cap_issubset(__task_cred(p)->cap_permitted,
current_cred()->cap_permitted);
+ if (!is_subset && !ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE))
+ ret = -EPERM;
rcu_read_unlock();

- if (!is_subset && !capable(CAP_SYS_NICE))
- return -EPERM;
- return 0;
+ return ret;
}

/**
--
1.7.9.5

2013-07-23 18:30:28

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

On Tue, Jul 23, 2013 at 01:16:06PM -0500, Serge Hallyn wrote:
> We allow a task to change its own devices cgroup, or to change other tasks'
> cgroups if it has CAP_SYS_ADMIN.
>
> Also allow task A to change task B's cgroup if task A has CAP_SYS_ADMIN
> with respect to task B - meaning A is root in the same userns, or A
> created B's userns.

As discussed multpile times, cgroup isn't gonna support delegating
cgroup management directly into containers, so this doesn't really
jive with where we're heading.

Thanks.

--
tejun

2013-07-23 18:39:02

by Serge Hallyn

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

Quoting Tejun Heo ([email protected]):
> On Tue, Jul 23, 2013 at 01:16:06PM -0500, Serge Hallyn wrote:
> > We allow a task to change its own devices cgroup, or to change other tasks'
> > cgroups if it has CAP_SYS_ADMIN.
> >
> > Also allow task A to change task B's cgroup if task A has CAP_SYS_ADMIN
> > with respect to task B - meaning A is root in the same userns, or A
> > created B's userns.
>
> As discussed multpile times, cgroup isn't gonna support delegating
> cgroup management directly into containers, so this doesn't really
> jive with where we're heading.

This doesn't delegate it into the container. It allows me, on the host,
to set the cgroup for a container.

thanks,
-serge

2013-07-23 18:50:24

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

On Tue, Jul 23, 2013 at 2:38 PM, Serge Hallyn <[email protected]> wrote:
> This doesn't delegate it into the container. It allows me, on the host,
> to set the cgroup for a container.

Hmmm? I'm a bit confused. Isn't the description saying that the patch
allows pseudo-root in userns to change cgroup membership even if it
isn't actually root?

Besides, I find the whole check rather bogus and would actually much
prefer just nuking the check and just follow the standard permission
checks.

Thanks.

--
tejun

2013-07-23 19:04:36

by Serge Hallyn

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

2013-07-23 19:12:51

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

Hello,

On Tue, Jul 23, 2013 at 02:04:26PM -0500, Serge Hallyn wrote:
> If task A is uid 1000 on the host, and creates task B as uid X in a new
> user namespace, then task A, still being uid 1000 on the host, is
> privileged with respect to B and his namespace - i.e.
> ns_capable(B->userns, CAP_SYS_ADMIN) is true.

Well, that also is the exact type of priv delegation we're moving away
from, so....

> > Besides, I find the whole check rather bogus and would actually much
> > prefer just nuking the check and just follow the standard permission
> > checks.
>
> I'd be ok with that - but there's one case I'm not sure about: If PAM
> sets me up with /sys/fs/cgroup/devices/serge owned by me, then if I'm
> thinking right, removing can_attach would mean I could move init into
> /sys/fs/cgroup/devices/serge...
>
> Is there something else stopping that from happening?

If PAM is giving out perms on cgroup directory, the whole system is
prone to DoS in various ways anyway. It's already utterly broken, so
kinda moot point. If there are people actually doing that in the
wild, we can conditionalize it on cgroup_sane_behavior().

Thanks.

--
tejun

2013-07-23 19:28:11

by Serge Hallyn

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

Quoting Tejun Heo ([email protected]):
> Hello,
>
> On Tue, Jul 23, 2013 at 02:04:26PM -0500, Serge Hallyn wrote:
> > If task A is uid 1000 on the host, and creates task B as uid X in a new
> > user namespace, then task A, still being uid 1000 on the host, is
> > privileged with respect to B and his namespace - i.e.
> > ns_capable(B->userns, CAP_SYS_ADMIN) is true.
>
> Well, that also is the exact type of priv delegation we're moving away
> from, so....

I think that's unreasonable, but I guess I'll have to go reread the
old thread.

> > > Besides, I find the whole check rather bogus and would actually much
> > > prefer just nuking the check and just follow the standard permission
> > > checks.
> >
> > I'd be ok with that - but there's one case I'm not sure about: If PAM
> > sets me up with /sys/fs/cgroup/devices/serge owned by me, then if I'm
> > thinking right, removing can_attach would mean I could move init into
> > /sys/fs/cgroup/devices/serge...
> >
> > Is there something else stopping that from happening?
>
> If PAM is giving out perms on cgroup directory, the whole system is
> prone to DoS in various ways anyway. It's already utterly broken, so

If we have decent enforcement of hierarchy for devices.{allow,deny},
which we now do, then I don't see why this has to be the case.

> kinda moot point. If there are people actually doing that in the
> wild, we can conditionalize it on cgroup_sane_behavior().

Guess we'll stop using cgroups for now.

-serge

2013-07-23 19:39:15

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

Hello, Serge.

On Tue, Jul 23, 2013 at 3:28 PM, Serge Hallyn <[email protected]> wrote:
>> Well, that also is the exact type of priv delegation we're moving away
>> from, so....
>
> I think that's unreasonable, but I guess I'll have to go reread the
> old thread.

Yeah, please do. I think the case is pretty strong for disallowing
delegation of cgroup directories to !root (or whatever CAP it should
be) users. It's inherently unsafe for some controllers and ends up
leaking kernel implementation details into regular binaries thus
cementing those leaked details as APIs, which is a giant no-no.

> If we have decent enforcement of hierarchy for devices.{allow,deny},
> which we now do, then I don't see why this has to be the case.

If you think about devcg in isolation, maybe, but please keep in mind
that devcg itself is already somewhat abusing cgroup and many other
controllers shouldn't be allowed to delegate and we're headed for one
unified hierarchy.

>> kinda moot point. If there are people actually doing that in the
>> wild, we can conditionalize it on cgroup_sane_behavior().
>
> Guess we'll stop using cgroups for now.

If you're delegating cgroup accesses to !root users, yes, STOP,
please. It'd be doing a lot more harm to the whole kernel and its
maintainability than whatever extra features / benefits it may be
bringing.

Thanks.

--
tejun

2013-07-24 08:08:15

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] capabilities: allow nice if we are privileged

Serge Hallyn <[email protected]> writes:

> We allow task A to change B's nice level if it has a supserset of
> B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has
> CAP_SYS_NICE with respect to B - meaning it is root in the same
> namespace, or it created B's namespace.

This patch looks good to me.

We already have this logic elsewhere in the kernel so I don't expect
this will make things worse.

Reviewed-by: "Eric W. Biederman" <[email protected]>

> Signed-off-by: Serge Hallyn <[email protected]>
> ---
> security/commoncap.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/security/commoncap.c b/security/commoncap.c
> index d78b003..ef98b56 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -768,16 +768,16 @@ int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags)
> */
> static int cap_safe_nice(struct task_struct *p)
> {
> - int is_subset;
> + int is_subset, ret = 0;
>
> rcu_read_lock();
> is_subset = cap_issubset(__task_cred(p)->cap_permitted,
> current_cred()->cap_permitted);
> + if (!is_subset && !ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE))
> + ret = -EPERM;
> rcu_read_unlock();
>
> - if (!is_subset && !capable(CAP_SYS_NICE))
> - return -EPERM;
> - return 0;
> + return ret;
> }
>
> /**

2013-10-23 00:41:37

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

Quoting Tejun Heo ([email protected]):
> On Tue, Jul 23, 2013 at 2:38 PM, Serge Hallyn <[email protected]> wrote:
> > This doesn't delegate it into the container. It allows me, on the host,
> > to set the cgroup for a container.
>
> Hmmm? I'm a bit confused. Isn't the description saying that the patch
> allows pseudo-root in userns to change cgroup membership even if it
> isn't actually root?
>
> Besides, I find the whole check rather bogus and would actually much
> prefer just nuking the check and just follow the standard permission
> checks.

Can we please nuke it like this then?

>From b840083ec8fa1f0645ae925c79db3dc51edd019c Mon Sep 17 00:00:00 2001
From: Serge Hallyn <[email protected]>
Date: Wed, 23 Oct 2013 01:34:00 +0200
Subject: [PATCH 1/1] device_cgroup: remove can_attach

It is really only wanting to duplicate a check which is already done by the
cgroup subsystem.

With this patch, user jdoe still cannot move pid 1 into a devices cgroup
he owns, but now he can move his own other tasks into devices cgroups.

Signed-off-by: Serge Hallyn <[email protected]>
Cc: Aristeu Rozanski <[email protected]>
Cc: Tejun Heo <[email protected]>
---
security/device_cgroup.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index efc6f68..a37c054 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -64,16 +64,6 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)

struct cgroup_subsys devices_subsys;

-static int devcgroup_can_attach(struct cgroup_subsys_state *new_css,
- struct cgroup_taskset *set)
-{
- struct task_struct *task = cgroup_taskset_first(set);
-
- if (current != task && !capable(CAP_SYS_ADMIN))
- return -EPERM;
- return 0;
-}
-
/*
* called under devcgroup_mutex
*/
@@ -698,7 +688,6 @@ static struct cftype dev_cgroup_files[] = {

struct cgroup_subsys devices_subsys = {
.name = "devices",
- .can_attach = devcgroup_can_attach,
.css_alloc = devcgroup_css_alloc,
.css_free = devcgroup_css_free,
.css_online = devcgroup_online,
--
1.8.3.2

2013-10-24 10:57:52

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] devices cgroup: allow can_attach() if ns_capable

On Wed, Oct 23, 2013 at 12:41:30AM +0000, Serge E. Hallyn wrote:
> Quoting Tejun Heo ([email protected]):
> > On Tue, Jul 23, 2013 at 2:38 PM, Serge Hallyn <[email protected]> wrote:
> > > This doesn't delegate it into the container. It allows me, on the host,
> > > to set the cgroup for a container.
> >
> > Hmmm? I'm a bit confused. Isn't the description saying that the patch
> > allows pseudo-root in userns to change cgroup membership even if it
> > isn't actually root?
> >
> > Besides, I find the whole check rather bogus and would actually much
> > prefer just nuking the check and just follow the standard permission
> > checks.
>
> Can we please nuke it like this then?
>
> From b840083ec8fa1f0645ae925c79db3dc51edd019c Mon Sep 17 00:00:00 2001
> From: Serge Hallyn <[email protected]>
> Date: Wed, 23 Oct 2013 01:34:00 +0200
> Subject: [PATCH 1/1] device_cgroup: remove can_attach
>
> It is really only wanting to duplicate a check which is already done by the
> cgroup subsystem.
>
> With this patch, user jdoe still cannot move pid 1 into a devices cgroup
> he owns, but now he can move his own other tasks into devices cgroups.
>
> Signed-off-by: Serge Hallyn <[email protected]>
> Cc: Aristeu Rozanski <[email protected]>
> Cc: Tejun Heo <[email protected]>

Applied to cgroup/for-3.13. Thanks.

--
tejun