2016-10-04 04:41:37

by John Stultz

[permalink] [raw]
Subject: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

As a heads up, this is just a first RFC and not a submission.

I wanted to send this out again, as the last time I submitted this
(https://marc.info/?l=linux-kernel&m=143217972215192&w=2) the
discussion got out into the separate issue of how Android at one
time abused memcg (but I believe now memcg is no longer used).

So for this revision, I've removed any memcg usage so we can try
to focus on just the actively used cpuset and cpuctrl cgroups.

Android currently loosens the cgroup attchment permissions, allowing
tasks with CAP_SYS_NICE to be able to allow tasks to move arbitrary
tasks across cgroups.

Android currently uses cgroups to bound tasks in various states (ie:
foreground applications, background applications, audio application,
system audio, and system tasks), to specific cpus as well as to limit
cpu time. This allows for things like audio applications to be
SCHED_FIFO but not run-away hogging infinite cpu, and background task
cpu usage to be similarly cputime limited, and kept to only low-power
cpus.

The migration of a task from the foreground to background, or to
elevate a task to audio priority, may be done by system service that
does not run as root. So this patch allows processes with CAP_SYS_NICE
to be able to migrate tasks between cgroups. I suspect if there was a
specific cap (CAP_SYS_CHANGE_CGROUP) for this, it would be usable here,
but in its absence, they've overloaded CAP_SYS_NICE for this use.

At first glance, overloading CAP_SYS_NICE seems a bit hackish, but this
shows that there is a active and widely deployed use for different cgroup
attachment rules then what is currently available.

I've tried to rework the patches so this attachment policy is build
time configurable, and wanted to send them out for review so folks
might give their thoughts on this implementation and what they might
see as a better way to go about achieving the same goal.

Thoughts and feedback would be appriciated!

thanks
-john

Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Android Kernel Team <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Christian Poetzsch <[email protected]>
Cc: Amit Pundir <[email protected]>

Colin Cross (1):
cgroup: Add generic cgroup subsystem permission checks

Rom Lemarchand (1):
cgroup: Add a allow_attach policy for Android

Documentation/cgroup-v1/cgroups.txt | 9 ++++++
include/linux/cgroup-defs.h | 1 +
include/linux/cgroup.h | 16 ++++++++++
init/Kconfig | 7 +++++
kernel/cgroup.c | 61 +++++++++++++++++++++++++++++++++++--
kernel/cpuset.c | 3 ++
kernel/sched/core.c | 3 ++
7 files changed, 98 insertions(+), 2 deletions(-)

--
1.9.1


2016-10-04 04:41:40

by John Stultz

[permalink] [raw]
Subject: [PATCH 2/2] cgroup: Add a allow_attach policy for Android

From: Rom Lemarchand <[email protected]>

If CONFIG_CGROUP_NICE_ATTACH is enabled, this implements an
allow_attach policy for Android, which allows any process with
CAP_SYS_NICE to move tasks across cpuset and cpuctrl cgroups.

This includes folded down fixes from:
Dmitry Shmidt <[email protected]>
Riley Andrews <[email protected]>
Amit Pundir <[email protected]>

Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Android Kernel Team <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Christian Poetzsch <[email protected]>
Cc: Amit Pundir <[email protected]>
Signed-off-by: Rom Lemarchand <[email protected]>
[jstultz: Majorly reworked to make this policy function
configurable, and folded in fixes]
Signed-off-by: John Stultz <[email protected]>
---
include/linux/cgroup.h | 16 ++++++++++++++++
init/Kconfig | 8 ++++++++
kernel/cgroup.c | 22 ++++++++++++++++++++++
kernel/cpuset.c | 3 +++
kernel/sched/core.c | 3 +++
5 files changed, 52 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 984f73b..eab4311 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -554,6 +554,17 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}

+#ifdef CONFIG_CGROUP_NICE_ATTACH
+/*
+ * Default Android check for whether the current process is allowed to move a
+ * task across cgroups, either because CAP_SYS_NICE is set or because the uid
+ * of the calling process is the same as the moved task or because we are
+ * running as root.
+ * Returns 0 if this is allowed, or -EACCES otherwise.
+ */
+int cgroup_nice_allow_attach(struct cgroup_taskset *tset);
+#endif
+
#else /* !CONFIG_CGROUPS */

struct cgroup_subsys_state;
@@ -647,6 +658,11 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
return old_ns;
}

+static inline int subsys_cgroup_allow_attach(void *tset)
+{
+ return -EINVAL;
+}
+
#endif /* !CONFIG_CGROUPS */

static inline void get_cgroup_ns(struct cgroup_namespace *ns)
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..c000734 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1021,6 +1021,14 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.

+config CGROUP_NICE_ATTACH
+ bool "Enabled Android-style loosening of perm checks for attachment"
+ default n
+ ---help---
+ Allows non-root processes to add arbitrary processes to cpuset and
+ cpuctrl cgroups if they have CAP_SYS_NICE set. This is useful for
+ Android.
+
config CGROUP_WRITEBACK
bool
depends on MEMCG && BLK_CGROUP
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e6afe2d..a53f0be 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2833,6 +2833,28 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
}

+#ifdef CONFIG_CGROUP_NICE_ATTACH
+int cgroup_nice_allow_attach(struct cgroup_taskset *tset)
+{
+ const struct cred *cred = current_cred(), *tcred;
+ struct task_struct *task;
+ struct cgroup_subsys_state *css;
+
+ if (capable(CAP_SYS_NICE))
+ return 0;
+
+ cgroup_taskset_for_each(task, css, tset) {
+ tcred = __task_cred(task);
+
+ if (current != task && !uid_eq(cred->euid, tcred->uid) &&
+ !uid_eq(cred->euid, tcred->suid))
+ return -EACCES;
+ }
+
+ return 0;
+}
+#endif
+
static int cgroup_allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2b4c20a..87aede4 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2100,6 +2100,9 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
.css_offline = cpuset_css_offline,
.css_free = cpuset_css_free,
.can_attach = cpuset_can_attach,
+#ifdef CONFIG_CGROUP_NICE_ATTACH
+ .allow_attach = cgroup_nice_allow_attach,
+#endif
.cancel_attach = cpuset_cancel_attach,
.attach = cpuset_attach,
.post_attach = cpuset_post_attach,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44817c6..5573505 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8657,6 +8657,9 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.fork = cpu_cgroup_fork,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
+#ifdef CONFIG_CGROUP_NICE_ATTACH
+ .allow_attach = cgroup_nice_allow_attach,
+#endif
.legacy_cftypes = cpu_files,
.early_init = true,
};
--
1.9.1

2016-10-04 04:42:01

by John Stultz

[permalink] [raw]
Subject: [PATCH 1/2] cgroup: Add generic cgroup subsystem permission checks

From: Colin Cross <[email protected]>

Rather than using explicit euid == 0 checks when trying to move
tasks into a cgroup, move permission checks into each specific
cgroup subsystem. If a subsystem does not specify a 'allow_attach'
handler, then we fall back to doing the checks the old way.

This patch adds a 'allow_attach' handler instead of reusing the
'can_attach' handler, since if the 'can_attach' handler was
reused, a new cgroup that implements 'can_attach' but not the
permission checks could end up with no permission checks at all.

This also includes folded down fixes from:
Christian Poetzsch <[email protected]>
Amit Pundir <[email protected]>

Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Android Kernel Team <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Dmitry Shmidt <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Christian Poetzsch <[email protected]>
Cc: Amit Pundir <[email protected]>
Original-Author: San Mehat <[email protected]>
Signed-off-by: Colin Cross <[email protected]>
[jstultz: Rewording of commit message, folded down fixes]
Signed-off-by: John Stultz <[email protected]>
---
Documentation/cgroup-v1/cgroups.txt | 9 +++++++++
include/linux/cgroup-defs.h | 1 +
kernel/cgroup.c | 40 +++++++++++++++++++++++++++++++++++--
3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroup-v1/cgroups.txt b/Documentation/cgroup-v1/cgroups.txt
index 308e5ff..295f026 100644
--- a/Documentation/cgroup-v1/cgroups.txt
+++ b/Documentation/cgroup-v1/cgroups.txt
@@ -578,6 +578,15 @@ is completely unused; @cgrp->parent is still valid. (Note - can also
be called for a newly-created cgroup if an error occurs after this
subsystem's create() method has been called for the new cgroup).

+int allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
+(cgroup_mutex held by caller)
+
+Called prior to moving a task into a cgroup; if the subsystem
+returns an error, this will abort the attach operation. Used
+to extend the permission checks - if all subsystems in a cgroup
+return 0, the attach will be allowed to proceed, even if the
+default permission check (root or same user) fails.
+
int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..0f4548c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -441,6 +441,7 @@ struct cgroup_subsys {
void (*css_free)(struct cgroup_subsys_state *css);
void (*css_reset)(struct cgroup_subsys_state *css);

+ int (*allow_attach)(struct cgroup_taskset *tset);
int (*can_attach)(struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_taskset *tset);
void (*attach)(struct cgroup_taskset *tset);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d6b729b..e6afe2d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2833,6 +2833,25 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
}

+static int cgroup_allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ int i;
+ int ret;
+
+ for_each_css(css, i, cgrp) {
+ if (css->ss->allow_attach) {
+ ret = css->ss->allow_attach(tset);
+ if (ret)
+ return ret;
+ } else {
+ return -EACCES;
+ }
+ }
+
+ return 0;
+}
+
static int cgroup_procs_write_permission(struct task_struct *task,
struct cgroup *dst_cgrp,
struct kernfs_open_file *of)
@@ -2847,8 +2866,25 @@ static int cgroup_procs_write_permission(struct task_struct *task,
*/
if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
!uid_eq(cred->euid, tcred->uid) &&
- !uid_eq(cred->euid, tcred->suid))
- ret = -EACCES;
+ !uid_eq(cred->euid, tcred->suid)) {
+ /*
+ * if the default permission check fails, give each
+ * cgroup a chance to extend the permission check
+ */
+ struct cgroup_taskset tset = {
+ .src_csets = LIST_HEAD_INIT(tset.src_csets),
+ .dst_csets = LIST_HEAD_INIT(tset.dst_csets),
+ .csets = &tset.src_csets,
+ };
+ struct css_set *cset;
+
+ cset = task_css_set(task);
+ list_add(&cset->mg_node, &tset.src_csets);
+ ret = cgroup_allow_attach(dst_cgrp, &tset);
+ list_del(&tset.src_csets);
+ if (ret)
+ ret = -EACCES;
+ }

if (!ret && cgroup_on_dfl(dst_cgrp)) {
struct super_block *sb = of->file->f_path.dentry->d_sb;
--
1.9.1

2016-10-04 16:16:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

Hello, John.

On Mon, Oct 03, 2016 at 09:41:28PM -0700, John Stultz wrote:
> The migration of a task from the foreground to background, or to
> elevate a task to audio priority, may be done by system service that
> does not run as root. So this patch allows processes with CAP_SYS_NICE
> to be able to migrate tasks between cgroups. I suspect if there was a
> specific cap (CAP_SYS_CHANGE_CGROUP) for this, it would be usable here,
> but in its absence, they've overloaded CAP_SYS_NICE for this use.

CAP_SYS_RESOURCE won't do?

> At first glance, overloading CAP_SYS_NICE seems a bit hackish, but this
> shows that there is a active and widely deployed use for different cgroup
> attachment rules then what is currently available.

I'm curious who issues these migrations. Is that restricted to
certain uids? If so, would it work for android if cgroupfs supports
ACL so that those uids can be approved via setfacl? That'd be an a
lot more generic approach.

Thanks.

--
tejun

2016-10-04 18:01:15

by John Stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

On Tue, Oct 4, 2016 at 9:16 AM, Tejun Heo <[email protected]> wrote:
> On Mon, Oct 03, 2016 at 09:41:28PM -0700, John Stultz wrote:
>> The migration of a task from the foreground to background, or to
>> elevate a task to audio priority, may be done by system service that
>> does not run as root. So this patch allows processes with CAP_SYS_NICE
>> to be able to migrate tasks between cgroups. I suspect if there was a
>> specific cap (CAP_SYS_CHANGE_CGROUP) for this, it would be usable here,
>> but in its absence, they've overloaded CAP_SYS_NICE for this use.
>
> CAP_SYS_RESOURCE won't do?
>
>> At first glance, overloading CAP_SYS_NICE seems a bit hackish, but this
>> shows that there is a active and widely deployed use for different cgroup
>> attachment rules then what is currently available.
>
> I'm curious who issues these migrations.

The system_server process via the sched_policy logic:
http://androidxref.com/7.0.0_r1/xref/system/core/libcutils/sched_policy.c#274

See set_cpuset_policy() and set_sched_policy().

> Is that restricted to
> certain uids? If so, would it work for android if cgroupfs supports
> ACL so that those uids can be approved via setfacl? That'd be an a
> lot more generic approach.

So tasks might move themselves in some cases to specific groups, but
mostly its controlled by the system_server.

So to make sure I understand your suggestion, you're suggesting the
cgroupfs files like:
cpuctrl/tasks,
cpuctrl/bg_non_interactive/tasks,
cpuset/foreground/tasks,
cpuset/background/tasks,
etc
use ACL permissions to specify the specific uids that can write to
them? I guess this would be conceptually similar to just setting the
owner to the system task, no? Though I'm not sure that would be
sufficient since it would still fail the
cgroup_procs_write_permission() checks. Or are you suggesting we add
extra logic to make the file owner uid as sufficient to change other
tasks?

thanks
-john

2016-10-04 18:03:46

by John Stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

On Tue, Oct 4, 2016 at 9:16 AM, Tejun Heo <[email protected]> wrote:
> On Mon, Oct 03, 2016 at 09:41:28PM -0700, John Stultz wrote:
>> The migration of a task from the foreground to background, or to
>> elevate a task to audio priority, may be done by system service that
>> does not run as root. So this patch allows processes with CAP_SYS_NICE
>> to be able to migrate tasks between cgroups. I suspect if there was a
>> specific cap (CAP_SYS_CHANGE_CGROUP) for this, it would be usable here,
>> but in its absence, they've overloaded CAP_SYS_NICE for this use.
>
> CAP_SYS_RESOURCE won't do?

Oh, and I'll have to look into this one to see what CAP_SYS_RESOURCE
actually allows. We ran into trouble in the past changing the CAPs
required to something higher-level, which then causes trouble because
they want to avoid elevating permissions on system tasks and keep them
as restricted as possible.

thans
-john

2016-10-04 19:38:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

Hello, John.

On Tue, Oct 04, 2016 at 11:01:12AM -0700, John Stultz wrote:
> So to make sure I understand your suggestion, you're suggesting the
> cgroupfs files like:
> cpuctrl/tasks,
> cpuctrl/bg_non_interactive/tasks,
> cpuset/foreground/tasks,
> cpuset/background/tasks,
> etc
> use ACL permissions to specify the specific uids that can write to
> them? I guess this would be conceptually similar to just setting the
> owner to the system task, no? Though I'm not sure that would be

Yeah, finer grained but essentially just giving write perms.

> sufficient since it would still fail the
> cgroup_procs_write_permission() checks. Or are you suggesting we add
> extra logic to make the file owner uid as sufficient to change other
> tasks?

Hah, now I'm not sure how this is supposed to work inside a userns as
it's checking against GLOBAL_ROOT_UID. cc'ing Serge. Serge, can you
please have a look?

But back on subject, yeah, I think a capability based approach is
better here too. No idea how difficult it is to add a new CAP but I
think it's worth trying. Can you please spin up a patch?

Thanks!

--
tejun

2016-10-04 19:46:28

by John Stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

On Tue, Oct 4, 2016 at 12:38 PM, Tejun Heo <[email protected]> wrote:
> Hello, John.
>
> On Tue, Oct 04, 2016 at 11:01:12AM -0700, John Stultz wrote:
>> So to make sure I understand your suggestion, you're suggesting the
>> cgroupfs files like:
>> cpuctrl/tasks,
>> cpuctrl/bg_non_interactive/tasks,
>> cpuset/foreground/tasks,
>> cpuset/background/tasks,
>> etc
>> use ACL permissions to specify the specific uids that can write to
>> them? I guess this would be conceptually similar to just setting the
>> owner to the system task, no? Though I'm not sure that would be
>
> Yeah, finer grained but essentially just giving write perms.
>
>> sufficient since it would still fail the
>> cgroup_procs_write_permission() checks. Or are you suggesting we add
>> extra logic to make the file owner uid as sufficient to change other
>> tasks?
>
> Hah, now I'm not sure how this is supposed to work inside a userns as
> it's checking against GLOBAL_ROOT_UID. cc'ing Serge. Serge, can you
> please have a look?
>
> But back on subject, yeah, I think a capability based approach is
> better here too. No idea how difficult it is to add a new CAP but I
> think it's worth trying. Can you please spin up a patch?

Ok. I'll respin this introducing and using a new CAP value.

That said, while CAP_SYS_NICE seems a bit overloaded here, it doesn't
conceptually have that much friction for use with cpuset and cpuctrl
cgroups:

(from the man page: http://man7.org/linux/man-pages/man7/capabilities.7.html )
CAP_SYS_NICE
* Raise process nice value (nice(2), setpriority(2)) and
change the nice value for arbitrary processes;
* set real-time scheduling policies for calling process, and
set scheduling policies and priorities for arbitrary
processes (sched_setscheduler(2), sched_setparam(2),
shed_setattr(2));
* set CPU affinity for arbitrary processes
(sched_setaffinity(2));
* set I/O scheduling class and priority for arbitrary
processes (ioprio_set(2));
* apply migrate_pages(2) to arbitrary processes and allow
processes to be migrated to arbitrary nodes;
* apply move_pages(2) to arbitrary processes;
* use the MPOL_MF_MOVE_ALL flag with mbind(2) and
move_pages(2).

If you can tweak nice value and set realtime scheduling policy that
really seems to me just as invasive as what moving tasks between the
cpuctrl and cpuset cgroups could do.

thanks
-john

2016-10-04 19:49:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

Hello,

On Tue, Oct 04, 2016 at 12:46:24PM -0700, John Stultz wrote:
> Ok. I'll respin this introducing and using a new CAP value.
>
> That said, while CAP_SYS_NICE seems a bit overloaded here, it doesn't
> conceptually have that much friction for use with cpuset and cpuctrl
> cgroups:

We need to solve it for userns too and I wanna avoid pushing
permission logic into specific controllers. The logical extensions of
that would be protecting control interface files by different CAPs
too. It might work for some knobs and then there will all but
certainly unclear corner cases and so on. Let's please not go there.

Thanks.

--
tejun

2016-10-04 20:18:44

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

Quoting Tejun Heo ([email protected]):
> Hello, John.
>
> On Tue, Oct 04, 2016 at 11:01:12AM -0700, John Stultz wrote:
> > So to make sure I understand your suggestion, you're suggesting the
> > cgroupfs files like:
> > cpuctrl/tasks,
> > cpuctrl/bg_non_interactive/tasks,
> > cpuset/foreground/tasks,
> > cpuset/background/tasks,
> > etc
> > use ACL permissions to specify the specific uids that can write to
> > them? I guess this would be conceptually similar to just setting the
> > owner to the system task, no? Though I'm not sure that would be
>
> Yeah, finer grained but essentially just giving write perms.
>
> > sufficient since it would still fail the
> > cgroup_procs_write_permission() checks. Or are you suggesting we add
> > extra logic to make the file owner uid as sufficient to change other
> > tasks?
>
> Hah, now I'm not sure how this is supposed to work inside a userns as
> it's checking against GLOBAL_ROOT_UID. cc'ing Serge. Serge, can you
> please have a look?

Hi, thanks for the cc,

how about changing the GLOBAL_ROOT_UID check with a targeted
capability check, like

if (!ns_capable(tcred->user_ns, CAP_SYS_NICE) &&
!uid_eq(cred->euid, tcred->uid) &&
!uid_eq(cred->euid, tcred->suid))
ret = -EACCES;

where the actual capability to use may require some thought.


> But back on subject, yeah, I think a capability based approach is
> better here too. No idea how difficult it is to add a new CAP but I
> think it's worth trying. Can you please spin up a patch?
>
> Thanks!
>
> --
> tejun

2016-10-04 20:33:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

Hello, Serge.

On Tue, Oct 04, 2016 at 03:18:40PM -0500, Serge E. Hallyn wrote:
> how about changing the GLOBAL_ROOT_UID check with a targeted
> capability check, like
>
> if (!ns_capable(tcred->user_ns, CAP_SYS_NICE) &&
> !uid_eq(cred->euid, tcred->uid) &&
> !uid_eq(cred->euid, tcred->suid))
> ret = -EACCES;
>
> where the actual capability to use may require some thought.

Yeah, that's the direction I'm thinking too. We can't use
CAP_SYS_NICE in general tho. Let's see if a dedicated CAP sticks.

Thanks.

--
tejun

2016-10-04 21:26:46

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

Quoting Tejun Heo ([email protected]):
> Hello, Serge.
>
> On Tue, Oct 04, 2016 at 03:18:40PM -0500, Serge E. Hallyn wrote:
> > how about changing the GLOBAL_ROOT_UID check with a targeted
> > capability check, like
> >
> > if (!ns_capable(tcred->user_ns, CAP_SYS_NICE) &&
> > !uid_eq(cred->euid, tcred->uid) &&
> > !uid_eq(cred->euid, tcred->suid))
> > ret = -EACCES;
> >
> > where the actual capability to use may require some thought.
>
> Yeah, that's the direction I'm thinking too. We can't use
> CAP_SYS_NICE in general tho. Let's see if a dedicated CAP sticks.

One possibility would be to let each cgroup subsystem define
a move_caps capability mask which is required over the target
task. And add a new CAP_CGROUP which always suffices?

2016-10-04 21:29:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Another pass at Android style loosening of cgroup attach permissions

On Tue, Oct 04, 2016 at 04:26:43PM -0500, Serge E. Hallyn wrote:
> Quoting Tejun Heo ([email protected]):
> > Hello, Serge.
> >
> > On Tue, Oct 04, 2016 at 03:18:40PM -0500, Serge E. Hallyn wrote:
> > > how about changing the GLOBAL_ROOT_UID check with a targeted
> > > capability check, like
> > >
> > > if (!ns_capable(tcred->user_ns, CAP_SYS_NICE) &&
> > > !uid_eq(cred->euid, tcred->uid) &&
> > > !uid_eq(cred->euid, tcred->suid))
> > > ret = -EACCES;
> > >
> > > where the actual capability to use may require some thought.
> >
> > Yeah, that's the direction I'm thinking too. We can't use
> > CAP_SYS_NICE in general tho. Let's see if a dedicated CAP sticks.
>
> One possibility would be to let each cgroup subsystem define
> a move_caps capability mask which is required over the target
> task. And add a new CAP_CGROUP which always suffices?

As I wrote in another reply, I really don't wanna do that. It brings
in the question about control knob permissions too and makes the
permission checks a lot more difficult to predit. I'd much rather
just get rid of the extra checks, at least on the v2 hierarchy. The
extra checks are protecting against pulling in random processes into a
delegated subtree and v2 hierarchy already has a protection against
that.

Thanks.

--
tejun

2016-10-05 19:09:16

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 1/2] cgroup: Add generic cgroup subsystem permission checks

[ Some comments are form Ricky Zhou <[email protected]>, some from
myself ]

On Mon, Oct 03, 2016 at 09:41:29PM -0700, John Stultz wrote:
> From: Colin Cross <[email protected]>
>
> Rather than using explicit euid == 0 checks when trying to move
> tasks into a cgroup, move permission checks into each specific
> cgroup subsystem. If a subsystem does not specify a 'allow_attach'
> handler, then we fall back to doing the checks the old way.
>
> This patch adds a 'allow_attach' handler instead of reusing the
> 'can_attach' handler, since if the 'can_attach' handler was
> reused, a new cgroup that implements 'can_attach' but not the
> permission checks could end up with no permission checks at all.
>
> This also includes folded down fixes from:
> Christian Poetzsch <[email protected]>
> Amit Pundir <[email protected]>
>
> Cc: Tejun Heo <[email protected]>
> Cc: Li Zefan <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: [email protected]
> Cc: Android Kernel Team <[email protected]>
> Cc: Rom Lemarchand <[email protected]>
> Cc: Colin Cross <[email protected]>
> Cc: Dmitry Shmidt <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Christian Poetzsch <[email protected]>
> Cc: Amit Pundir <[email protected]>
> Original-Author: San Mehat <[email protected]>
> Signed-off-by: Colin Cross <[email protected]>
> [jstultz: Rewording of commit message, folded down fixes]
> Signed-off-by: John Stultz <[email protected]>
> ---
> Documentation/cgroup-v1/cgroups.txt | 9 +++++++++
> include/linux/cgroup-defs.h | 1 +
> kernel/cgroup.c | 40 +++++++++++++++++++++++++++++++++++--
> 3 files changed, 48 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/cgroup-v1/cgroups.txt b/Documentation/cgroup-v1/cgroups.txt
> index 308e5ff..295f026 100644
> --- a/Documentation/cgroup-v1/cgroups.txt
> +++ b/Documentation/cgroup-v1/cgroups.txt
> @@ -578,6 +578,15 @@ is completely unused; @cgrp->parent is still valid. (Note - can also
> be called for a newly-created cgroup if an error occurs after this
> subsystem's create() method has been called for the new cgroup).
>
> +int allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
> +(cgroup_mutex held by caller)
> +
> +Called prior to moving a task into a cgroup; if the subsystem
> +returns an error, this will abort the attach operation. Used
> +to extend the permission checks - if all subsystems in a cgroup
> +return 0, the attach will be allowed to proceed, even if the
> +default permission check (root or same user) fails.
> +
> int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
> (cgroup_mutex held by caller)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 5b17de6..0f4548c 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -441,6 +441,7 @@ struct cgroup_subsys {
> void (*css_free)(struct cgroup_subsys_state *css);
> void (*css_reset)(struct cgroup_subsys_state *css);
>
> + int (*allow_attach)(struct cgroup_taskset *tset);
> int (*can_attach)(struct cgroup_taskset *tset);
> void (*cancel_attach)(struct cgroup_taskset *tset);
> void (*attach)(struct cgroup_taskset *tset);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index d6b729b..e6afe2d 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2833,6 +2833,25 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
> return ret;
> }
>
> +static int cgroup_allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
> +{
> + struct cgroup_subsys_state *css;
> + int i;
> + int ret;
> +
> + for_each_css(css, i, cgrp) {
> + if (css->ss->allow_attach) {
> + ret = css->ss->allow_attach(tset);
> + if (ret)
> + return ret;
> + } else {
> + return -EACCES;
> + }
> + }
> +
> + return 0;
> +}
> +
> static int cgroup_procs_write_permission(struct task_struct *task,
> struct cgroup *dst_cgrp,
> struct kernfs_open_file *of)
> @@ -2847,8 +2866,25 @@ static int cgroup_procs_write_permission(struct task_struct *task,
> */
> if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
> !uid_eq(cred->euid, tcred->uid) &&
> - !uid_eq(cred->euid, tcred->suid))
> - ret = -EACCES;
> + !uid_eq(cred->euid, tcred->suid)) {
> + /*
> + * if the default permission check fails, give each
> + * cgroup a chance to extend the permission check
> + */
> + struct cgroup_taskset tset = {
> + .src_csets = LIST_HEAD_INIT(tset.src_csets),
> + .dst_csets = LIST_HEAD_INIT(tset.dst_csets),
> + .csets = &tset.src_csets,
> + };
> + struct css_set *cset;
> +
> + cset = task_css_set(task);

Do we need to take css_set_lock here? If not, why?

> + list_add(&cset->mg_node, &tset.src_csets);
> + ret = cgroup_allow_attach(dst_cgrp, &tset);
> + list_del(&tset.src_csets);

This should be

list_del_init(&cset->mg_node);

since you are deleting task's cset from the tset list, not other way
around. It only happen to work because there is exactly 1 member in
tset.src_csets and list_del done on it is exactly list_del_init on the
node, so you are not leaving with uncorrupted mg_node in task's cset.

> + if (ret)
> + ret = -EACCES;
> + }
>
> if (!ret && cgroup_on_dfl(dst_cgrp)) {
> struct super_block *sb = of->file->f_path.dentry->d_sb;

Isn't this, generally speaking, racy? We take current task's cset and
check if we have rights to move it over. But we do not have any locking
between check and actual move, so can the cset change between these 2
operations?

And if cset can't really change and it is only 1 task, then why do we
bother with forming taskset at all? Can we make allow_attach take just
the target task argument?

Thanks.

--
Dmitry

2016-10-05 19:11:01

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Add a allow_attach policy for Android

On Mon, Oct 03, 2016 at 09:41:30PM -0700, John Stultz wrote:
> +#ifdef CONFIG_CGROUP_NICE_ATTACH
> +int cgroup_nice_allow_attach(struct cgroup_taskset *tset)
> +{
> + const struct cred *cred = current_cred(), *tcred;
> + struct task_struct *task;
> + struct cgroup_subsys_state *css;
> +
> + if (capable(CAP_SYS_NICE))
> + return 0;
> +
> + cgroup_taskset_for_each(task, css, tset) {
> + tcred = __task_cred(task);

__task_cred() requires RCU lock (courtesy Ricky Z).

> +
> + if (current != task && !uid_eq(cred->euid, tcred->uid) &&
> + !uid_eq(cred->euid, tcred->suid))
> + return -EACCES;
> + }
> +
> + return 0;
> +}
> +#endif

Thanks.

--
Dmitry

2016-10-05 19:16:51

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/2] cgroup: Add generic cgroup subsystem permission checks

On Wed, Oct 5, 2016 at 12:09 PM, Dmitry Torokhov
<[email protected]> wrote:
> [ Some comments are form Ricky Zhou <[email protected]>, some from
> myself ]
> On Mon, Oct 03, 2016 at 09:41:29PM -0700, John Stultz wrote:
>> From: Colin Cross <[email protected]>
>>
[snip]
>> +
>> + cset = task_css_set(task);
>
> Do we need to take css_set_lock here? If not, why?
>
>> + list_add(&cset->mg_node, &tset.src_csets);
>> + ret = cgroup_allow_attach(dst_cgrp, &tset);
>> + list_del(&tset.src_csets);
>
> This should be
>
> list_del_init(&cset->mg_node);
>
> since you are deleting task's cset from the tset list, not other way
> around. It only happen to work because there is exactly 1 member in
> tset.src_csets and list_del done on it is exactly list_del_init on the
> node, so you are not leaving with uncorrupted mg_node in task's cset.
>
>> + if (ret)
>> + ret = -EACCES;
>> + }
>>
>> if (!ret && cgroup_on_dfl(dst_cgrp)) {
>> struct super_block *sb = of->file->f_path.dentry->d_sb;
>
> Isn't this, generally speaking, racy? We take current task's cset and
> check if we have rights to move it over. But we do not have any locking
> between check and actual move, so can the cset change between these 2
> operations?
>
> And if cset can't really change and it is only 1 task, then why do we
> bother with forming taskset at all? Can we make allow_attach take just
> the target task argument?

After Tejun's feedback, I've tried reworking the same functionality in
a much simpler fashion by introducing a new capability bit.
https://lkml.org/lkml/2016/10/4/479

I believe that approach doesn't have the drawbacks you've pointed out
here, but would appreciate your input on it.

As for your feedback on this patch, I'll have to look into it a bit,
as I don't have good answers for you for you right off. But these do
seem like valid concerns and since the Android common.git kernels are
using the code I submitted here, this issues likely need to be fixed
there.

thanks
-john

2016-10-05 19:18:21

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Add a allow_attach policy for Android

On Wed, Oct 5, 2016 at 12:10 PM, Dmitry Torokhov
<[email protected]> wrote:
> On Mon, Oct 03, 2016 at 09:41:30PM -0700, John Stultz wrote:
>> +#ifdef CONFIG_CGROUP_NICE_ATTACH
>> +int cgroup_nice_allow_attach(struct cgroup_taskset *tset)
>> +{
>> + const struct cred *cred = current_cred(), *tcred;
>> + struct task_struct *task;
>> + struct cgroup_subsys_state *css;
>> +
>> + if (capable(CAP_SYS_NICE))
>> + return 0;
>> +
>> + cgroup_taskset_for_each(task, css, tset) {
>> + tcred = __task_cred(task);
>
> __task_cred() requires RCU lock (courtesy Ricky Z).

Again, hopefully this isn't an issue with the new approach, but for
the short term I'll see if we can get this fixed in the android tree.

Thanks again for your careful review!
-john

2016-10-05 19:23:38

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 1/2] cgroup: Add generic cgroup subsystem permission checks

On Wed, Oct 5, 2016 at 12:16 PM, John Stultz <[email protected]> wrote:
> On Wed, Oct 5, 2016 at 12:09 PM, Dmitry Torokhov
> <[email protected]> wrote:
>> [ Some comments are form Ricky Zhou <[email protected]>, some from
>> myself ]
>> On Mon, Oct 03, 2016 at 09:41:29PM -0700, John Stultz wrote:
>>> From: Colin Cross <[email protected]>
>>>
> [snip]
>>> +
>>> + cset = task_css_set(task);
>>
>> Do we need to take css_set_lock here? If not, why?
>>
>>> + list_add(&cset->mg_node, &tset.src_csets);
>>> + ret = cgroup_allow_attach(dst_cgrp, &tset);
>>> + list_del(&tset.src_csets);
>>
>> This should be
>>
>> list_del_init(&cset->mg_node);
>>
>> since you are deleting task's cset from the tset list, not other way
>> around. It only happen to work because there is exactly 1 member in
>> tset.src_csets and list_del done on it is exactly list_del_init on the
>> node, so you are not leaving with uncorrupted mg_node in task's cset.
>>
>>> + if (ret)
>>> + ret = -EACCES;
>>> + }
>>>
>>> if (!ret && cgroup_on_dfl(dst_cgrp)) {
>>> struct super_block *sb = of->file->f_path.dentry->d_sb;
>>
>> Isn't this, generally speaking, racy? We take current task's cset and
>> check if we have rights to move it over. But we do not have any locking
>> between check and actual move, so can the cset change between these 2
>> operations?
>>
>> And if cset can't really change and it is only 1 task, then why do we
>> bother with forming taskset at all? Can we make allow_attach take just
>> the target task argument?
>
> After Tejun's feedback, I've tried reworking the same functionality in
> a much simpler fashion by introducing a new capability bit.
> https://lkml.org/lkml/2016/10/4/479
>
> I believe that approach doesn't have the drawbacks you've pointed out
> here, but would appreciate your input on it.
>
> As for your feedback on this patch, I'll have to look into it a bit,
> as I don't have good answers for you for you right off. But these do
> seem like valid concerns and since the Android common.git kernels are
> using the code I submitted here, this issues likely need to be fixed
> there.

Yeah, we are looking into the same for ChromeOS, so we have this:

https://chromium-review.googlesource.com/c/393907/

Thanks.

--
Dmitry

2016-10-06 22:44:03

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Add a allow_attach policy for Android

On Wed, Oct 05, 2016 at 12:18:17PM -0700, John Stultz wrote:
> On Wed, Oct 5, 2016 at 12:10 PM, Dmitry Torokhov
> <[email protected]> wrote:
> > On Mon, Oct 03, 2016 at 09:41:30PM -0700, John Stultz wrote:
> >> +#ifdef CONFIG_CGROUP_NICE_ATTACH
> >> +int cgroup_nice_allow_attach(struct cgroup_taskset *tset)
> >> +{
> >> + const struct cred *cred = current_cred(), *tcred;
> >> + struct task_struct *task;
> >> + struct cgroup_subsys_state *css;
> >> +
> >> + if (capable(CAP_SYS_NICE))
> >> + return 0;
> >> +
> >> + cgroup_taskset_for_each(task, css, tset) {
> >> + tcred = __task_cred(task);
> >
> > __task_cred() requires RCU lock (courtesy Ricky Z).
>
> Again, hopefully this isn't an issue with the new approach, but for
> the short term I'll see if we can get this fixed in the android tree.
>

Actually, it should all be simply removed from there right away, as this
ends up being basically noop (but with all the locking violations and
races):

cgroup_taskset_for_each() needs tasks to be placed on cset->mg_tasks
list, but nobody does this in the ->allow_access() code path, so this
loops always executes exactly 0 times and the whole thing is exactly
equivalent of doing

return capable(CAP_SYS_NICE) ? 0 : -EACESS;

which can be done right in cgroup_procs_write_permission().

Thanks.

--
Dmitry

2016-10-06 22:53:02

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Add a allow_attach policy for Android

On Thu, Oct 06, 2016 at 03:43:51PM -0700, Dmitry Torokhov wrote:
> On Wed, Oct 05, 2016 at 12:18:17PM -0700, John Stultz wrote:
> > On Wed, Oct 5, 2016 at 12:10 PM, Dmitry Torokhov
> > <[email protected]> wrote:
> > > On Mon, Oct 03, 2016 at 09:41:30PM -0700, John Stultz wrote:
> > >> +#ifdef CONFIG_CGROUP_NICE_ATTACH
> > >> +int cgroup_nice_allow_attach(struct cgroup_taskset *tset)
> > >> +{
> > >> + const struct cred *cred = current_cred(), *tcred;
> > >> + struct task_struct *task;
> > >> + struct cgroup_subsys_state *css;
> > >> +
> > >> + if (capable(CAP_SYS_NICE))
> > >> + return 0;
> > >> +
> > >> + cgroup_taskset_for_each(task, css, tset) {
> > >> + tcred = __task_cred(task);
> > >
> > > __task_cred() requires RCU lock (courtesy Ricky Z).
> >
> > Again, hopefully this isn't an issue with the new approach, but for
> > the short term I'll see if we can get this fixed in the android tree.
> >
>
> Actually, it should all be simply removed from there right away, as this
> ends up being basically noop (but with all the locking violations and
> races):
>
> cgroup_taskset_for_each() needs tasks to be placed on cset->mg_tasks
> list, but nobody does this in the ->allow_access() code path, so this
> loops always executes exactly 0 times and the whole thing is exactly
> equivalent of doing
>
> return capable(CAP_SYS_NICE) ? 0 : -EACESS;
>
> which can be done right in cgroup_procs_write_permission().

Umm, sorry, no, it actually always returns 0, regardless even of
capabilities. Permissions are indeed being relaxed ;)

--
Dmitry