In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
important work. Given that CPU shares of root cgroup can't be changed,
leaving the tasks inside root cgroup will give them higher share
compared to the other tasks inside important cgroups. This is mitigated
by moving all tasks inside root cgroup to a different cgroup after
Android is booted. However, there are many kernel tasks stuck in the
root cgroup after the boot.
We see all kworker threads are in the root cpu cgroup. This is because,
tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
This restriction is in place to avoid kworkers getting moved to a cpuset
which conflicts with kworker affinity. Relax this restriction by explicitly
checking if the task is moving out of a cpuset cgroup. This allows kworkers
to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
are mounted on different hierarchies.
We also see kthreadd_task and any kernel thread created after the Android boot
also stuck in the root cgroup. The current code prevents kthreadd_task moving
out root cgroup to avoid the possibility of creating new RT kernel threads
inside a cgroup with no RT runtime allocated. Apply this restriction when tasks
are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
kernel threads to be moved out of root cpu cgroup if the kernel does not
enable RT group scheduling.
[1] https://android.googlesource.com/kernel/common/+/f08f049de11c15a4251cb1db08cf0bee20bd9b59
Signed-off-by: Pavankumar Kondeti <[email protected]>
---
v2:
- Added cgroup_task_migration_allowed() wrapper function
kernel/cgroup/cgroup-internal.h | 28 +++++++++++++++++++++++++++-
kernel/cgroup/cgroup-v1.c | 2 +-
kernel/cgroup/cgroup.c | 13 ++++---------
3 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bfbeabc..cd69302 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -202,6 +202,31 @@ static inline void get_css_set(struct css_set *cset)
refcount_inc(&cset->refcount);
}
+static inline bool cgroup_task_migration_allowed(struct task_struct *tsk,
+ struct cgroup *dst_cgrp)
+{
+ /*
+ * RT kthreads may be born in a cgroup with no rt_runtime allocated.
+ * Just say no.
+ */
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (tsk->no_cgroup_migration && (dst_cgrp->root->subsys_mask & (1U << cpu_cgrp_id)))
+ return false;
+#endif
+
+ /*
+ * kthreads may acquire PF_NO_SETAFFINITY during initialization.
+ * If userland migrates such a kthread to a non-root cgroup, it can
+ * become trapped in a cpuset. Just say no.
+ */
+#ifdef CONFIG_CPUSETS
+ if ((tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) &&
+ (dst_cgrp->root->subsys_mask & (1U << cpuset_cgrp_id)))
+ return false;
+#endif
+ return true;
+}
+
bool cgroup_ssid_enabled(int ssid);
bool cgroup_on_dfl(const struct cgroup *cgrp);
bool cgroup_is_thread_root(struct cgroup *cgrp);
@@ -232,7 +257,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
bool threadgroup);
struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
- bool *locked)
+ bool *locked,
+ struct cgroup *dst_cgrp)
__acquires(&cgroup_threadgroup_rwsem);
void cgroup_procs_write_finish(struct task_struct *task, bool locked)
__releases(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index a575178..d674a6c 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -497,7 +497,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
if (!cgrp)
return -ENODEV;
- task = cgroup_procs_write_start(buf, threadgroup, &locked);
+ task = cgroup_procs_write_start(buf, threadgroup, &locked, cgrp);
ret = PTR_ERR_OR_ZERO(task);
if (ret)
goto out_unlock;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20..44cc653 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2744,7 +2744,8 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
}
struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
- bool *locked)
+ bool *locked,
+ struct cgroup *dst_cgrp)
__acquires(&cgroup_threadgroup_rwsem)
{
struct task_struct *tsk;
@@ -2783,13 +2784,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
if (threadgroup)
tsk = tsk->group_leader;
- /*
- * kthreads may acquire PF_NO_SETAFFINITY during initialization.
- * If userland migrates such a kthread to a non-root cgroup, it can
- * become trapped in a cpuset, or RT kthread may be born in a
- * cgroup with no rt_runtime allocated. Just say no.
- */
- if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
+ if (!cgroup_task_migration_allowed(tsk, dst_cgrp)) {
tsk = ERR_PTR(-EINVAL);
goto out_unlock_threadgroup;
}
@@ -4740,7 +4735,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
if (!dst_cgrp)
return -ENODEV;
- task = cgroup_procs_write_start(buf, threadgroup, &locked);
+ task = cgroup_procs_write_start(buf, threadgroup, &locked, dst_cgrp);
ret = PTR_ERR_OR_ZERO(task);
if (ret)
goto out_unlock;
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Hello,
On Tue, Apr 06, 2021 at 06:34:21PM +0530, Pavankumar Kondeti wrote:
> In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> important work. Given that CPU shares of root cgroup can't be changed,
> leaving the tasks inside root cgroup will give them higher share
> compared to the other tasks inside important cgroups. This is mitigated
> by moving all tasks inside root cgroup to a different cgroup after
> Android is booted. However, there are many kernel tasks stuck in the
> root cgroup after the boot.
>
> We see all kworker threads are in the root cpu cgroup. This is because,
> tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> This restriction is in place to avoid kworkers getting moved to a cpuset
> which conflicts with kworker affinity. Relax this restriction by explicitly
> checking if the task is moving out of a cpuset cgroup. This allows kworkers
> to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
> are mounted on different hierarchies.
>
> We also see kthreadd_task and any kernel thread created after the Android boot
> also stuck in the root cgroup. The current code prevents kthreadd_task moving
> out root cgroup to avoid the possibility of creating new RT kernel threads
> inside a cgroup with no RT runtime allocated. Apply this restriction when tasks
> are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> kernel threads to be moved out of root cpu cgroup if the kernel does not
> enable RT group scheduling.
The fundamental reason why those kthreads are in the root cgroup is because
they're doing work on behalf of the entire system and their resource usages
can't be attributed to any specific cgroup. What we want to do is accounting
actual usages to the originating cgroups so that cpu cycles spent by kswapd
is charged to the originating cgroups, however well we can define them, and
then throttle the origin if the consumption is going over budget for that
cgroup's allocation. This is how we already handle shared IOs.
The problem with the proposed patch is that it breaks the logical
organization of resource hierarchy in a way which hinders proper future
solutions.
If all you want is deprioritizing certain kworkers, please use workqueue
attrs instead.
Thanks.
--
tejun
Hi Tejun,
On Tue, Apr 06, 2021 at 09:36:00AM -0400, Tejun Heo wrote:
> Hello,
>
> On Tue, Apr 06, 2021 at 06:34:21PM +0530, Pavankumar Kondeti wrote:
> > In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> > important work. Given that CPU shares of root cgroup can't be changed,
> > leaving the tasks inside root cgroup will give them higher share
> > compared to the other tasks inside important cgroups. This is mitigated
> > by moving all tasks inside root cgroup to a different cgroup after
> > Android is booted. However, there are many kernel tasks stuck in the
> > root cgroup after the boot.
> >
> > We see all kworker threads are in the root cpu cgroup. This is because,
> > tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> > This restriction is in place to avoid kworkers getting moved to a cpuset
> > which conflicts with kworker affinity. Relax this restriction by explicitly
> > checking if the task is moving out of a cpuset cgroup. This allows kworkers
> > to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
> > are mounted on different hierarchies.
> >
> > We also see kthreadd_task and any kernel thread created after the Android boot
> > also stuck in the root cgroup. The current code prevents kthreadd_task moving
> > out root cgroup to avoid the possibility of creating new RT kernel threads
> > inside a cgroup with no RT runtime allocated. Apply this restriction when tasks
> > are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> > kernel threads to be moved out of root cpu cgroup if the kernel does not
> > enable RT group scheduling.
>
> The fundamental reason why those kthreads are in the root cgroup is because
> they're doing work on behalf of the entire system and their resource usages
> can't be attributed to any specific cgroup. What we want to do is accounting
> actual usages to the originating cgroups so that cpu cycles spent by kswapd
> is charged to the originating cgroups, however well we can define them, and
> then throttle the origin if the consumption is going over budget for that
> cgroup's allocation. This is how we already handle shared IOs.
Thanks for your reply. I understand the reasoning on why we don't allow
kworkers to a custom cgroup.
>
> The problem with the proposed patch is that it breaks the logical
> organization of resource hierarchy in a way which hinders proper future
> solutions.
>
> If all you want is deprioritizing certain kworkers, please use workqueue
> attrs instead.
>
Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
background workqueue if we identify that it is cpu intensive. However, this
needs case by case analysis, tweaking etc. If there is no other alternative,
we might end up chasing the background workers and reduce their nice value.
The problem at our hand (which you might be knowing already) is that, lets say
we have 2 cgroups in our setup and we want to prioritize UX over background.
IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
can potentially have CPU share equal to the entire UX cgroup.
-kworker/0:0
-kworker/1:0
- UX
----Task-A
----Task-B
- background
----Task-X
----Task-Y
Hence we want to move all kernel threads to another cgroup so that this cgroup
will have CPU share equal to UX.
The patch presented here allows us to create the above setup. Any other
alternative approaches to achieve the same without impacting any future
designs/requirements?
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Hello,
On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> background workqueue if we identify that it is cpu intensive. However, this
> needs case by case analysis, tweaking etc. If there is no other alternative,
> we might end up chasing the background workers and reduce their nice value.
There shouldn't be that many workqueues that consume a lot of cpu cycles.
The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
shouldn't be a long list and we want them identified anyway.
> The problem at our hand (which you might be knowing already) is that, lets say
> we have 2 cgroups in our setup and we want to prioritize UX over background.
> IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
> Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> can potentially have CPU share equal to the entire UX cgroup.
>
> -kworker/0:0
> -kworker/1:0
> - UX
> ----Task-A
> ----Task-B
> - background
> ----Task-X
> ----Task-Y
> Hence we want to move all kernel threads to another cgroup so that this cgroup
> will have CPU share equal to UX.
>
> The patch presented here allows us to create the above setup. Any other
> alternative approaches to achieve the same without impacting any future
> designs/requirements?
Not quite the same but we already have
/sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
so maybe a global default priority knob would help here?
Thanks.
--
tejun
On Tue, Apr 06, 2021 at 12:15:24PM -0400, Tejun Heo wrote:
> Hello,
>
> On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> > Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> > background workqueue if we identify that it is cpu intensive. However, this
> > needs case by case analysis, tweaking etc. If there is no other alternative,
> > we might end up chasing the background workers and reduce their nice value.
>
> There shouldn't be that many workqueues that consume a lot of cpu cycles.
> The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
> shouldn't be a long list and we want them identified anyway.
>
Sure. I have not done a complete study on which workers in our system can
compete with important tasks in other cgroups. We will have to do that to
adjust the workqueue priority so that the impact can be minimized.
> > The problem at our hand (which you might be knowing already) is that, lets say
> > we have 2 cgroups in our setup and we want to prioritize UX over background.
> > IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
> > Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> > can potentially have CPU share equal to the entire UX cgroup.
> >
> > -kworker/0:0
> > -kworker/1:0
> > - UX
> > ----Task-A
> > ----Task-B
> > - background
> > ----Task-X
> > ----Task-Y
> > Hence we want to move all kernel threads to another cgroup so that this cgroup
> > will have CPU share equal to UX.
> >
> > The patch presented here allows us to create the above setup. Any other
> > alternative approaches to achieve the same without impacting any future
> > designs/requirements?
>
> Not quite the same but we already have
> /sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
> so maybe a global default priority knob would help here?
>
yeah, not exactly what we are looking for. It gives us the abiility to restrict
the priority of all unbound workqueues at the expense of not being able to
prioritize one workqueue over another workqueue.
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Tue, Apr 6, 2021 at 6:39 PM Pavan Kondeti <[email protected]> wrote:
>
> On Tue, Apr 06, 2021 at 12:15:24PM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> > > Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> > > background workqueue if we identify that it is cpu intensive. However, this
> > > needs case by case analysis, tweaking etc. If there is no other alternative,
> > > we might end up chasing the background workers and reduce their nice value.
> >
> > There shouldn't be that many workqueues that consume a lot of cpu cycles.
> > The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
> > shouldn't be a long list and we want them identified anyway.
> >
> Sure. I have not done a complete study on which workers in our system can
> compete with important tasks in other cgroups. We will have to do that to
> adjust the workqueue priority so that the impact can be minimized.
>
kswapd0 is actually migratable to subgroup.
But echo what Pavan said, the real world is not ideal and the
problematice drivers are inactive when the use case is not activated
in Android, e.g. connectivity, camera, etc. It is tricky sometimes to
track all of those.
> > > The problem at our hand (which you might be knowing already) is that, lets say
> > > we have 2 cgroups in our setup and we want to prioritize UX over background.
> > > IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
> > > Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> > > can potentially have CPU share equal to the entire UX cgroup.
> > >
> > > -kworker/0:0
> > > -kworker/1:0
> > > - UX
> > > ----Task-A
> > > ----Task-B
> > > - background
> > > ----Task-X
> > > ----Task-Y
> > > Hence we want to move all kernel threads to another cgroup so that this cgroup
> > > will have CPU share equal to UX.
> > >
> > > The patch presented here allows us to create the above setup. Any other
> > > alternative approaches to achieve the same without impacting any future
> > > designs/requirements?
> >
> > Not quite the same but we already have
> > /sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
> > so maybe a global default priority knob would help here?
> >
>
> yeah, not exactly what we are looking for. It gives us the abiility to restrict
> the priority of all unbound workqueues at the expense of not being able to
> prioritize one workqueue over another workqueue.
>
Same here, Android used to have its cgroup setup like this, where the
BG group can be starved too long potentially (and sometimes PI is not
inevitable of course, that's the reason why I removed BG cgroup in
Android (https://android-review.googlesource.com/q/topic:remove_bg_cgroup)).
-Task-A
-Task-B
-kworker/0:0
- background
----Task-X
----Task-Y
So having things left in root will post the same risk,
> Thanks,
> Pavan
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
>