2021-05-10 14:55:38

by Qais Yousef

[permalink] [raw]
Subject: [PATCH RESEND 0/2] Uclamp cgroup fixes

A couple of fixes for uclamp cgroup. The first one is addressing wrong
implementation details that is restricting usability of uclamp in cgroup. The
2nd is fixing locking issues.

This is a resend since I forgot to CC LKML...

Qais Yousef (2):
sched/uclamp: Fix wrong implementation of cpu.uclamp.min
sched/uclamp: Fix locking around cpu_util_update_eff()

kernel/sched/core.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)

--
2.25.1


2021-05-10 14:57:34

by Qais Yousef

[permalink] [raw]
Subject: [PATCH RESEND 1/2] sched/uclamp: Fix wrong implementation of cpu.uclamp.min

cpu.uclamp.min is a protection as described in cgroup-v2 Resource
Distribution Model

Documentation/admin-guide/cgroup-v2.rst

which means we try our best to preserve the minimum performance point of
tasks in this group. See full description of cpu.uclamp.min in the
cgroup-v2.rst.

But the current implementation makes it a limit, which is not what was
intended.

For example:

tg->cpu.uclamp.min = 20%

p0->uclamp[UCLAMP_MIN] = 0
p1->uclamp[UCLAMP_MIN] = 50%

Previous Behavior (limit):

p0->effective_uclamp = 0
p1->effective_uclamp = 20%

New Behavior (Protection):

p0->effective_uclamp = 20%
p1->effective_uclamp = 50%

Which is inline with how protections should work.

With this change the cgroup and per-task behaviors are the same, as
expected.

Additionally, we remove the confusing relationship between cgroup and
!user_defined flag.

We don't want for example RT tasks that are boosted by default to max to
change their boost value when they attach to a cgroup. If a cgroup wants
to limit the max performance point of tasks attached to it, then
cpu.uclamp.max must be set accordingly.

Or if they want to set different boost value based on cgroup, then
sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max
and set the right cpu.uclamp.min for each group to let the RT tasks
obtain the desired boost value when attached to that group.

As it stands the dependency on !user_defined flag adds an extra layer of
complexity that is not required now cpu.uclamp.min behaves properly as
a protection.

The propagation model of effective cpu.uclamp.min in child cgroups as
implemented by cpu_util_update_eff() is still correct. The parent
protection sets an upper limit of what the child cgroups will
effectively get.

Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps)
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/core.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95bd6ab8115d..b61329299379 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1057,7 +1057,6 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
#ifdef CONFIG_UCLAMP_TASK_GROUP
- struct uclamp_se uc_max;

/*
* Tasks in autogroups or root task group will be
@@ -1068,9 +1067,23 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
if (task_group(p) == &root_task_group)
return uc_req;

- uc_max = task_group(p)->uclamp[clamp_id];
- if (uc_req.value > uc_max.value || !uc_req.user_defined)
- return uc_max;
+ switch (clamp_id) {
+ case UCLAMP_MIN: {
+ struct uclamp_se uc_min = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value < uc_min.value)
+ return uc_min;
+ break;
+ }
+ case UCLAMP_MAX: {
+ struct uclamp_se uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value)
+ return uc_max;
+ break;
+ }
+ default:
+ WARN_ON_ONCE(1);
+ break;
+ }
#endif

return uc_req;
--
2.25.1

2021-05-10 14:57:35

by Qais Yousef

[permalink] [raw]
Subject: [PATCH RESEND 2/2] sched/uclamp: Fix locking around cpu_util_update_eff()

cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the
uclamp_mutex or rcu_read_lock() like other call sites, which is
a mistake.

The uclamp_mutex is required to protect against concurrent reads and
writes that could update the cgroup hierarchy.

The rcu_read_lock() is required to traverse the cgroup data structures
in cpu_util_update_eff().

Surround the caller with the required locks and add some asserts to
better document the dependency in cpu_util_update_eff().

Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups")
Reported-by: Quentin Perret <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---

There was no real failure observed because of this. Quentin just observed the
oddity and reported it.

kernel/sched/core.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b61329299379..efa15287d09e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8690,7 +8690,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)

#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Propagate the effective uclamp value for the new group */
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
cpu_util_update_eff(css);
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
#endif

return 0;
@@ -8780,6 +8784,9 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
enum uclamp_id clamp_id;
unsigned int clamps;

+ lockdep_assert_held(&uclamp_mutex);
+ SCHED_WARN_ON(!rcu_read_lock_held());
+
css_for_each_descendant_pre(css, top_css) {
uc_parent = css_tg(css)->parent
? css_tg(css)->parent->uclamp : NULL;
--
2.25.1

2021-05-13 20:01:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RESEND 0/2] Uclamp cgroup fixes

On Mon, May 10, 2021 at 03:50:30PM +0100, Qais Yousef wrote:
> A couple of fixes for uclamp cgroup. The first one is addressing wrong
> implementation details that is restricting usability of uclamp in cgroup. The
> 2nd is fixing locking issues.
>
> This is a resend since I forgot to CC LKML...
>
> Qais Yousef (2):
> sched/uclamp: Fix wrong implementation of cpu.uclamp.min
> sched/uclamp: Fix locking around cpu_util_update_eff()
>
> kernel/sched/core.c | 28 ++++++++++++++++++++++++----
> 1 file changed, 24 insertions(+), 4 deletions(-)

Thanks!

2021-05-19 19:21:34

by tip-bot2 for Haifeng Xu

[permalink] [raw]
Subject: [tip: sched/core] sched/uclamp: Fix wrong implementation of cpu.uclamp.min

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 6938840392c89f0ef81e9efe51e2efcdd209fd83
Gitweb: https://git.kernel.org/tip/6938840392c89f0ef81e9efe51e2efcdd209fd83
Author: Qais Yousef <[email protected]>
AuthorDate: Mon, 10 May 2021 15:50:31 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 May 2021 12:53:54 +02:00

sched/uclamp: Fix wrong implementation of cpu.uclamp.min

cpu.uclamp.min is a protection as described in cgroup-v2 Resource
Distribution Model

Documentation/admin-guide/cgroup-v2.rst

which means we try our best to preserve the minimum performance point of
tasks in this group. See full description of cpu.uclamp.min in the
cgroup-v2.rst.

But the current implementation makes it a limit, which is not what was
intended.

For example:

tg->cpu.uclamp.min = 20%

p0->uclamp[UCLAMP_MIN] = 0
p1->uclamp[UCLAMP_MIN] = 50%

Previous Behavior (limit):

p0->effective_uclamp = 0
p1->effective_uclamp = 20%

New Behavior (Protection):

p0->effective_uclamp = 20%
p1->effective_uclamp = 50%

Which is inline with how protections should work.

With this change the cgroup and per-task behaviors are the same, as
expected.

Additionally, we remove the confusing relationship between cgroup and
!user_defined flag.

We don't want for example RT tasks that are boosted by default to max to
change their boost value when they attach to a cgroup. If a cgroup wants
to limit the max performance point of tasks attached to it, then
cpu.uclamp.max must be set accordingly.

Or if they want to set different boost value based on cgroup, then
sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max
and set the right cpu.uclamp.min for each group to let the RT tasks
obtain the desired boost value when attached to that group.

As it stands the dependency on !user_defined flag adds an extra layer of
complexity that is not required now cpu.uclamp.min behaves properly as
a protection.

The propagation model of effective cpu.uclamp.min in child cgroups as
implemented by cpu_util_update_eff() is still correct. The parent
protection sets an upper limit of what the child cgroups will
effectively get.

Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps)
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a5124c..f97eb73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1405,7 +1405,6 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
#ifdef CONFIG_UCLAMP_TASK_GROUP
- struct uclamp_se uc_max;

/*
* Tasks in autogroups or root task group will be
@@ -1416,9 +1415,23 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
if (task_group(p) == &root_task_group)
return uc_req;

- uc_max = task_group(p)->uclamp[clamp_id];
- if (uc_req.value > uc_max.value || !uc_req.user_defined)
- return uc_max;
+ switch (clamp_id) {
+ case UCLAMP_MIN: {
+ struct uclamp_se uc_min = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value < uc_min.value)
+ return uc_min;
+ break;
+ }
+ case UCLAMP_MAX: {
+ struct uclamp_se uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value)
+ return uc_max;
+ break;
+ }
+ default:
+ WARN_ON_ONCE(1);
+ break;
+ }
#endif

return uc_req;

2021-05-19 19:22:08

by tip-bot2 for Haifeng Xu

[permalink] [raw]
Subject: [tip: sched/core] sched/uclamp: Fix locking around cpu_util_update_eff()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: b837122e297fd0429555243c16ac3c341c39a7f5
Gitweb: https://git.kernel.org/tip/b837122e297fd0429555243c16ac3c341c39a7f5
Author: Qais Yousef <[email protected]>
AuthorDate: Mon, 10 May 2021 15:50:32 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 May 2021 12:53:54 +02:00

sched/uclamp: Fix locking around cpu_util_update_eff()

cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the
uclamp_mutex or rcu_read_lock() like other call sites, which is
a mistake.

The uclamp_mutex is required to protect against concurrent reads and
writes that could update the cgroup hierarchy.

The rcu_read_lock() is required to traverse the cgroup data structures
in cpu_util_update_eff().

Surround the caller with the required locks and add some asserts to
better document the dependency in cpu_util_update_eff().

Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups")
Reported-by: Quentin Perret <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f97eb73..3ec420c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9507,7 +9507,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)

#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Propagate the effective uclamp value for the new group */
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
cpu_util_update_eff(css);
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
#endif

return 0;
@@ -9597,6 +9601,9 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
enum uclamp_id clamp_id;
unsigned int clamps;

+ lockdep_assert_held(&uclamp_mutex);
+ SCHED_WARN_ON(!rcu_read_lock_held());
+
css_for_each_descendant_pre(css, top_css) {
uc_parent = css_tg(css)->parent
? css_tg(css)->parent->uclamp : NULL;

2021-05-19 19:24:28

by tip-bot2 for Haifeng Xu

[permalink] [raw]
Subject: [tip: sched/core] sched/uclamp: Fix wrong implementation of cpu.uclamp.min

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0c18f2ecfcc274a4bcc1d122f79ebd4001c3b445
Gitweb: https://git.kernel.org/tip/0c18f2ecfcc274a4bcc1d122f79ebd4001c3b445
Author: Qais Yousef <[email protected]>
AuthorDate: Mon, 10 May 2021 15:50:31 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 19 May 2021 10:53:02 +02:00

sched/uclamp: Fix wrong implementation of cpu.uclamp.min

cpu.uclamp.min is a protection as described in cgroup-v2 Resource
Distribution Model

Documentation/admin-guide/cgroup-v2.rst

which means we try our best to preserve the minimum performance point of
tasks in this group. See full description of cpu.uclamp.min in the
cgroup-v2.rst.

But the current implementation makes it a limit, which is not what was
intended.

For example:

tg->cpu.uclamp.min = 20%

p0->uclamp[UCLAMP_MIN] = 0
p1->uclamp[UCLAMP_MIN] = 50%

Previous Behavior (limit):

p0->effective_uclamp = 0
p1->effective_uclamp = 20%

New Behavior (Protection):

p0->effective_uclamp = 20%
p1->effective_uclamp = 50%

Which is inline with how protections should work.

With this change the cgroup and per-task behaviors are the same, as
expected.

Additionally, we remove the confusing relationship between cgroup and
!user_defined flag.

We don't want for example RT tasks that are boosted by default to max to
change their boost value when they attach to a cgroup. If a cgroup wants
to limit the max performance point of tasks attached to it, then
cpu.uclamp.max must be set accordingly.

Or if they want to set different boost value based on cgroup, then
sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max
and set the right cpu.uclamp.min for each group to let the RT tasks
obtain the desired boost value when attached to that group.

As it stands the dependency on !user_defined flag adds an extra layer of
complexity that is not required now cpu.uclamp.min behaves properly as
a protection.

The propagation model of effective cpu.uclamp.min in child cgroups as
implemented by cpu_util_update_eff() is still correct. The parent
protection sets an upper limit of what the child cgroups will
effectively get.

Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps)
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a5124c..f97eb73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1405,7 +1405,6 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
#ifdef CONFIG_UCLAMP_TASK_GROUP
- struct uclamp_se uc_max;

/*
* Tasks in autogroups or root task group will be
@@ -1416,9 +1415,23 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
if (task_group(p) == &root_task_group)
return uc_req;

- uc_max = task_group(p)->uclamp[clamp_id];
- if (uc_req.value > uc_max.value || !uc_req.user_defined)
- return uc_max;
+ switch (clamp_id) {
+ case UCLAMP_MIN: {
+ struct uclamp_se uc_min = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value < uc_min.value)
+ return uc_min;
+ break;
+ }
+ case UCLAMP_MAX: {
+ struct uclamp_se uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value)
+ return uc_max;
+ break;
+ }
+ default:
+ WARN_ON_ONCE(1);
+ break;
+ }
#endif

return uc_req;

2021-05-19 19:25:00

by tip-bot2 for Haifeng Xu

[permalink] [raw]
Subject: [tip: sched/core] sched/uclamp: Fix locking around cpu_util_update_eff()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 93b73858701fd01de26a4a874eb95f9b7156fd4b
Gitweb: https://git.kernel.org/tip/93b73858701fd01de26a4a874eb95f9b7156fd4b
Author: Qais Yousef <[email protected]>
AuthorDate: Mon, 10 May 2021 15:50:32 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 19 May 2021 10:53:02 +02:00

sched/uclamp: Fix locking around cpu_util_update_eff()

cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the
uclamp_mutex or rcu_read_lock() like other call sites, which is
a mistake.

The uclamp_mutex is required to protect against concurrent reads and
writes that could update the cgroup hierarchy.

The rcu_read_lock() is required to traverse the cgroup data structures
in cpu_util_update_eff().

Surround the caller with the required locks and add some asserts to
better document the dependency in cpu_util_update_eff().

Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups")
Reported-by: Quentin Perret <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f97eb73..3ec420c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9507,7 +9507,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)

#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Propagate the effective uclamp value for the new group */
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
cpu_util_update_eff(css);
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
#endif

return 0;
@@ -9597,6 +9601,9 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
enum uclamp_id clamp_id;
unsigned int clamps;

+ lockdep_assert_held(&uclamp_mutex);
+ SCHED_WARN_ON(!rcu_read_lock_held());
+
css_for_each_descendant_pre(css, top_css) {
uc_parent = css_tg(css)->parent
? css_tg(css)->parent->uclamp : NULL;