As a heads up, this is just a first RFC and not a submission.
Android currently loosens the cgroup attchment permissions, allowing
tasks with CAP_SYS_NICE to be able to allow tasks to move arbitrary
tasks across cgroups.
At first glance, overloading CAP_SYS_NICE seems a bit hackish, but this
shows that there is a active and widely deployed use for different cgroup
attachment rules then what is currently available.
I've tried to rework the patches so this attachment policy is build
time configurable, and wanted to send them out for review so folks
might give their thoughts on this implementation and what they might
see as a better way to go about achieving the same goal.
Thoughts and feedback would be appriciated!
thanks
-john
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Android Kernel Team <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Colin Cross <[email protected]>
Colin Cross (1):
cgroup: Add generic cgroup subsystem permission checks
Rom Lemarchand (1):
cgroup: Add a memcg and cpu cg allow_attach policy for Android
Documentation/cgroups/cgroups.txt | 9 +++++++++
include/linux/cgroup.h | 14 ++++++++++++++
init/Kconfig | 7 +++++++
kernel/Makefile | 1 +
kernel/cgroup.c | 39 ++++++++++++++++++++++++++++++++++++---
kernel/cgroup_nice_attach.c | 29 +++++++++++++++++++++++++++++
kernel/sched/core.c | 3 +++
mm/memcontrol.c | 3 +++
8 files changed, 102 insertions(+), 3 deletions(-)
create mode 100644 kernel/cgroup_nice_attach.c
--
1.9.1
From: Colin Cross <[email protected]>
Rather than using explicit euid == 0 checks when trying to move
tasks into a cgroup, move permission checks into each specific
cgroup subsystem. If a subsystem does not specify a 'allow_attach'
handler, then we fall back to doing the checks the old way.
This patch adds a 'allow_attach' handler instead of reusing the
'can_attach' handler, since if the 'can_attach' handler was
reused, a new cgroup that implements 'can_attach' but not the
permission checks could end up with no permission checks at all.
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Android Kernel Team <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Colin Cross <[email protected]>
Original-Author: San Mehat <[email protected]>
Signed-off-by: Colin Cross <[email protected]>
[jstultz: Rewording of commit message]
Signed-off-by: John Stultz <[email protected]>
---
Documentation/cgroups/cgroups.txt | 9 +++++++++
include/linux/cgroup.h | 2 ++
kernel/cgroup.c | 39 ++++++++++++++++++++++++++++++++++++---
3 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index f935fac..88af0f0 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -578,6 +578,15 @@ is completely unused; @cgrp->parent is still valid. (Note - can also
be called for a newly-created cgroup if an error occurs after this
subsystem's create() method has been called for the new cgroup).
+int allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
+(cgroup_mutex held by caller)
+
+Called prior to moving a task into a cgroup; if the subsystem
+returns an error, this will abort the attach operation. Used
+to extend the permission checks - if all subsystems in a cgroup
+return 0, the attach will be allowed to proceed, even if the
+default permission check (root or same user) fails.
+
int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..0ea785d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -643,6 +643,8 @@ struct cgroup_subsys {
void (*css_reset)(struct cgroup_subsys_state *css);
void (*css_e_css_changed)(struct cgroup_subsys_state *css);
+ int (*allow_attach)(struct cgroup_subsys_state *css,
+ struct cgroup_taskset *tset);
int (*can_attach)(struct cgroup_subsys_state *css,
struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_subsys_state *css,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 29a7b2c..57e0fd7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2376,6 +2376,25 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
}
+static int cgroup_allow_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ int i;
+ int ret;
+
+ for_each_css(css, i, cgrp) {
+ if (css->ss->allow_attach) {
+ ret = css->ss->allow_attach(css, tset);
+ if (ret)
+ return ret;
+ } else {
+ return -EACCES;
+ }
+ }
+
+ return 0;
+}
+
/*
* Find the task_struct of the task to attach by vpid and pass it along to the
* function to attach either it or all tasks in its threadgroup. Will lock
@@ -2414,9 +2433,23 @@ retry_find_task:
if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
!uid_eq(cred->euid, tcred->uid) &&
!uid_eq(cred->euid, tcred->suid)) {
- rcu_read_unlock();
- ret = -EACCES;
- goto out_unlock_cgroup;
+ /*
+ * if the default permission check fails, give each
+ * cgroup a chance to extend the permission check
+ */
+ struct cgroup_taskset tset = {
+ .src_csets = LIST_HEAD_INIT(tset.src_csets),
+ .dst_csets = LIST_HEAD_INIT(tset.dst_csets),
+ .csets = &tset.src_csets,
+ };
+ struct css_set *cset;
+ cset = task_css_set(tsk);
+ list_add(&cset->mg_node, &tset.src_csets);
+ ret = cgroup_allow_attach(cgrp, &tset);
+ if (ret) {
+ rcu_read_unlock();
+ goto out_unlock_cgroup;
+ }
}
} else
tsk = current;
--
1.9.1
From: Rom Lemarchand <[email protected]>
If CONFIG_CGROUP_NICE_ATTACH is enabled, this implements an
allow_attach policy for Android, which allows any process with
CAP_SYS_NICE to move tasks across mem and cpu cgroups.
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Android Kernel Team <[email protected]>
Cc: Rom Lemarchand <[email protected]>
Cc: Colin Cross <[email protected]>
Signed-off-by: Rom Lemarchand <[email protected]>
[jstultz: Majorly reworked to make this policy function
configurable, also squished in cpu and mem cgroup enablement.]
Signed-off-by: John Stultz <[email protected]>
---
include/linux/cgroup.h | 12 ++++++++++++
init/Kconfig | 7 +++++++
kernel/Makefile | 1 +
kernel/cgroup_nice_attach.c | 29 +++++++++++++++++++++++++++++
kernel/sched/core.c | 3 +++
mm/memcontrol.c | 3 +++
6 files changed, 55 insertions(+)
create mode 100644 kernel/cgroup_nice_attach.c
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0ea785d..d584d31 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -943,6 +943,18 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
struct cgroup_subsys *ss);
+#ifdef CONFIG_CGROUP_NICE_ATTACH
+/*
+ * Default Android check for whether the current process is allowed to move a
+ * task across cgroups, either because CAP_SYS_NICE is set or because the uid
+ * of the calling process is the same as the moved task or because we are
+ * running as root.
+ * Returns 0 if this is allowed, or -EACCES otherwise.
+ */
+int cgroup_nice_allow_attach(struct cgroup_subsys_state *css,
+ struct cgroup_taskset *tset);
+#endif
+
#else /* !CONFIG_CGROUPS */
struct cgroup_subsys_state;
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..0e66e44 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1132,6 +1132,13 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.
+config CGROUP_NICE_ATTACH
+ bool "Enabled Android-style loosening of perm checks for attachment"
+ default n
+ ---help---
+ Allows non-root processes to add arbitrary processes to mem and cpu
+ cgroups if they have CAP_SYS_NICE set. This is useful for Android.
+
endif # CGROUPS
config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..c81256b 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NICE_ATTACH) += cgroup_nice_attach.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup_nice_attach.c b/kernel/cgroup_nice_attach.c
new file mode 100644
index 0000000..b94c68e
--- /dev/null
+++ b/kernel/cgroup_nice_attach.c
@@ -0,0 +1,29 @@
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+
+/*
+ * Default Android check for whether the current process is allowed to move a
+ * task across cgroups, either because CAP_SYS_NICE is set or because the uid
+ * of the calling process is the same as the moved task or because we are
+ * running as root.
+ */
+int cgroup_nice_allow_attach(struct cgroup_subsys_state *css,
+ struct cgroup_taskset *tset)
+{
+ const struct cred *cred = current_cred(), *tcred;
+ struct task_struct *task;
+
+ if (capable(CAP_SYS_NICE))
+ return 0;
+
+ cgroup_taskset_for_each(task, tset) {
+ tcred = __task_cred(task);
+
+ if (current != task && !uid_eq(cred->euid, tcred->uid) &&
+ !uid_eq(cred->euid, tcred->suid))
+ return -EACCES;
+ }
+
+ return 0;
+}
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62671f5..51dc86f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8368,6 +8368,9 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.fork = cpu_cgroup_fork,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
+#ifdef CONFIG_CGROUP_NICE_ATTACH
+ .allow_attach = cgroup_nice_allow_attach,
+#endif
.exit = cpu_cgroup_exit,
.legacy_cftypes = cpu_files,
.early_init = 1,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b34ef4a..6287697 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5387,6 +5387,9 @@ struct cgroup_subsys memory_cgrp_subsys = {
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+#ifdef CONFIG_CGROUP_NICE_ATTACH
+ .allow_attach = cgroup_nice_allow_attach,
+#endif
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
--
1.9.1
On Wed, May 20, 2015 at 8:41 PM, John Stultz <[email protected]> wrote:
> As a heads up, this is just a first RFC and not a submission.
>
> Android currently loosens the cgroup attchment permissions, allowing
> tasks with CAP_SYS_NICE to be able to allow tasks to move arbitrary
> tasks across cgroups.
>
> At first glance, overloading CAP_SYS_NICE seems a bit hackish, but this
> shows that there is a active and widely deployed use for different cgroup
> attachment rules then what is currently available.
>
> I've tried to rework the patches so this attachment policy is build
> time configurable, and wanted to send them out for review so folks
> might give their thoughts on this implementation and what they might
> see as a better way to go about achieving the same goal.
>
> Thoughts and feedback would be appriciated!
Ping? Not sure if I hit folks at a busy time or if I didn't cc the right folks?
thanks
-john
Hello, John.
On Tue, Jun 02, 2015 at 12:07:23PM -0700, John Stultz wrote:
> On Wed, May 20, 2015 at 8:41 PM, John Stultz <[email protected]> wrote:
> > As a heads up, this is just a first RFC and not a submission.
> >
> > Android currently loosens the cgroup attchment permissions, allowing
> > tasks with CAP_SYS_NICE to be able to allow tasks to move arbitrary
> > tasks across cgroups.
> >
> > At first glance, overloading CAP_SYS_NICE seems a bit hackish, but this
> > shows that there is a active and widely deployed use for different cgroup
> > attachment rules then what is currently available.
> >
> > I've tried to rework the patches so this attachment policy is build
> > time configurable, and wanted to send them out for review so folks
> > might give their thoughts on this implementation and what they might
> > see as a better way to go about achieving the same goal.
> >
> > Thoughts and feedback would be appriciated!
>
> Ping? Not sure if I hit folks at a busy time or if I didn't cc the right folks?
Can you explain why this is necessary? android's cgroups usage is
pretty weird at this point. IIRC, it moves tasks between fg and bg
cgroups with memory immigration turned on which basically forces memcg
to walk every single page relabeling them on each fg/bg switch, which
is a pretty silly thing to do and for this android kernel also removes
a rcu sync operation which makes break things on removal of cgroups,
which android doesn't do.
memcg usage came up a while ago and there wasn't anything major which
can't be achieved (usually better) by following more standard cgroup
usage - changing knobs rather than moving tasks around.
Given that, I'm not sure this patchset makes sense for upstream.
Thanks.
--
tejun
On Tue, Jun 2, 2015 at 10:50 PM, Tejun Heo <[email protected]> wrote:
> Hello, John.
>
> On Tue, Jun 02, 2015 at 12:07:23PM -0700, John Stultz wrote:
>> On Wed, May 20, 2015 at 8:41 PM, John Stultz <[email protected]> wrote:
>> > As a heads up, this is just a first RFC and not a submission.
>> >
>> > Android currently loosens the cgroup attchment permissions, allowing
>> > tasks with CAP_SYS_NICE to be able to allow tasks to move arbitrary
>> > tasks across cgroups.
>> >
>> > At first glance, overloading CAP_SYS_NICE seems a bit hackish, but this
>> > shows that there is a active and widely deployed use for different cgroup
>> > attachment rules then what is currently available.
>> >
>> > I've tried to rework the patches so this attachment policy is build
>> > time configurable, and wanted to send them out for review so folks
>> > might give their thoughts on this implementation and what they might
>> > see as a better way to go about achieving the same goal.
>> >
>> > Thoughts and feedback would be appriciated!
>>
>> Ping? Not sure if I hit folks at a busy time or if I didn't cc the right folks?
>
> Can you explain why this is necessary? android's cgroups usage is
> pretty weird at this point. IIRC, it moves tasks between fg and bg
> cgroups with memory immigration turned on which basically forces memcg
> to walk every single page relabeling them on each fg/bg switch, which
> is a pretty silly thing to do and for this android kernel also removes
> a rcu sync operation which makes break things on removal of cgroups,
> which android doesn't do.
So Colin and Rom could likely do a better job explaining this then me,
but my understanding is that from the scheduling side, they use
cgroups to bound cpu usage of tasks in various states (ie: foreground
applications, background applications, audio application, system
audio, and system tasks). This allows for things like audio
applications to be SCHED_FIFO but not run-away hogging infinite cpu,
and background task cpu usage to be similarly limited.
The migration of a task from the foreground to background, or to
elevate a task to audio priority, may be done by system service that
does not run as root. So this patch allows processes with CAP_SYS_NICE
to be able to migrate tasks between cgroups. I suspect if there was a
specific cap (CAP_SYS_CHANGE_CGROUP) for this, it would be usable
here, but they just overloaded CAP_SYS_NICE.
One example of this:
https://android.googlesource.com/platform/frameworks/base/+/b267554/services/java/com/android/server/os/SchedulingPolicyService.java
As for the move_charge_at_immigrate, that does seem to be true, and
I'm not sure at this moment of the details (Rom, Colin?), but I
suspect it has to do with the memgc mempressure notifiers.
And at least in the current android-3.18 kernel, I don't see any rcu
sync modifications. But maybe I'm missing what you mean?
> memcg usage came up a while ago and there wasn't anything major which
> can't be achieved (usually better) by following more standard cgroup
> usage - changing knobs rather than moving tasks around.
Do you have a pointer to that discussion or maybe even just a sense of
who was involved so I can trawl the list and better understand it?
> Given that, I'm not sure this patchset makes sense for upstream.
Right, I'm not suggesting the patch go in "as-is". I just wanted to
show what Android is currently using, and start a discussion of what
would be a better approach for Android to use, or what the kernel
might need to be able to support Android's use case.
thanks
-john
On Thu, Jun 04, 2015 at 10:11:17AM -0700, John Stultz wrote:
> On Tue, Jun 2, 2015 at 10:50 PM, Tejun Heo <[email protected]> wrote:
> > memcg usage came up a while ago and there wasn't anything major which
> > can't be achieved (usually better) by following more standard cgroup
> > usage - changing knobs rather than moving tasks around.
>
> Do you have a pointer to that discussion or maybe even just a sense of
> who was involved so I can trawl the list and better understand it?
I wrote a lengthy explanation of why moving tasks between cgroups is
problematic from a memcg view: https://lkml.org/lkml/2014/12/19/358
Rather than creating super-cgroups as configuration domains and using
migration as a reconfiguration mechanism, it's much better to group
tasks per app and reconfigure the groups themselves using specific
presets for classes of apps, like foreground, background, audio.
Hello, John.
On Thu, Jun 04, 2015 at 10:11:17AM -0700, John Stultz wrote:
> And at least in the current android-3.18 kernel, I don't see any rcu
> sync modifications. But maybe I'm missing what you mean?
It was from years ago so maybe it's no longer needed anymore and
removed but there was a patch in the android kernel to remove
synchronize_rcu() somewhere to make things go faster, which was fine
for android's use case at the time but would lead to oops in other
cases. Hmmm... the upstream kernel dropped synchronize_rcu() quite a
while ago so maybe that's why it's no longer there.
> > Given that, I'm not sure this patchset makes sense for upstream.
>
> Right, I'm not suggesting the patch go in "as-is". I just wanted to
> show what Android is currently using, and start a discussion of what
> would be a better approach for Android to use, or what the kernel
> might need to be able to support Android's use case.
Sure, it'd be great if android's cgroup usage converges with other use
cases. The main problem I see is that for controllers with persistent
state (memcg), migration is either gonna be extremely expensive or
won't update existing charge state and this is a lot more of a
fundamental property rather than an incidental technical detail.
Each page requires tracking info and access to that tracking info must
be optimized for hot path memory operations - some of them get really
hot, so we can't pay much overhead there both in terms of locking and
indirection, which pretty much implies that updating the association
of each object is gonna be extremely expensive and likely fragile with
weird corner cases where things don't quite work as expected and so
on.
While currently this primarily matters only to memcg, I think it's
reasonable to assume that for any resource tracking regarding
persistent objects similar attributes are likely to be in play and as
such the general direction of cgroup development is towards mostly
static organizational structure with dynamic config update happening
through per-controller knobs.
Thanks.
--
tejun
On Thu, Jun 4, 2015 at 11:36 AM, Johannes Weiner <[email protected]> wrote:
> On Thu, Jun 04, 2015 at 10:11:17AM -0700, John Stultz wrote:
>> On Tue, Jun 2, 2015 at 10:50 PM, Tejun Heo <[email protected]> wrote:
>> > memcg usage came up a while ago and there wasn't anything major which
>> > can't be achieved (usually better) by following more standard cgroup
>> > usage - changing knobs rather than moving tasks around.
>>
>> Do you have a pointer to that discussion or maybe even just a sense of
>> who was involved so I can trawl the list and better understand it?
>
> I wrote a lengthy explanation of why moving tasks between cgroups is
> problematic from a memcg view: https://lkml.org/lkml/2014/12/19/358
>
> Rather than creating super-cgroups as configuration domains and using
> migration as a reconfiguration mechanism, it's much better to group
> tasks per app and reconfigure the groups themselves using specific
> presets for classes of apps, like foreground, background, audio.
Hmm. So I've not heard from the Android guys on the viability of this,
but one downside to this approach seems like that for cpu-scheduling
at least, using a per-task allocation would make it harder to bound
"background" cpu processing as a whole, no?
For example, In order to keep all background tasks to less then 20% of
the cputime, you'd have to set each process to get .2/nr_tasks (which
you'd have to recalculate & re-assign every time a new task is
launched). But that would end up being wasteful since each task gets
limited by that amount and cannot make use of cputime not used by
other background tasks.
Or am I misunderstanding the suggestion? Or is your concern just
limited to the memcg and this sort of cgroup migration ok for
scheduling?
And I guess, if its ok for scheduling, is it reasonable to consider
the proposed patch to allow "privileged" but non-root tasks to migrate
processes between cgroups?
thanks
-john
On Thu, Jun 4, 2015 at 11:36 AM, Johannes Weiner <[email protected]> wrote:
> On Thu, Jun 04, 2015 at 10:11:17AM -0700, John Stultz wrote:
>> On Tue, Jun 2, 2015 at 10:50 PM, Tejun Heo <[email protected]> wrote:
>> > memcg usage came up a while ago and there wasn't anything major which
>> > can't be achieved (usually better) by following more standard cgroup
>> > usage - changing knobs rather than moving tasks around.
>>
>> Do you have a pointer to that discussion or maybe even just a sense of
>> who was involved so I can trawl the list and better understand it?
>
> I wrote a lengthy explanation of why moving tasks between cgroups is
> problematic from a memcg view: https://lkml.org/lkml/2014/12/19/358
And many thanks for the discussion pointer here! Its very enlightening!
thanks
-john