2023-07-07 20:13:32

by Phil Auld

[permalink] [raw]
Subject: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota

In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min. It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 11 +++++++----
kernel/sched/fair.c | 7 ++++---
kernel/sched/sched.h | 2 +-
3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..1b214e10c25d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9904,7 +9904,7 @@ void __init sched_init(void)
ptr += nr_cpu_ids * sizeof(void **);

root_task_group.shares = ROOT_TASK_GROUP_LOAD;
- init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
+ init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)

/*
* Ensure max(child_quota) <= parent_quota. On cgroup2,
- * always take the min. On cgroup1, only inherit when no
- * limit is set:
+ * always take the non-RUNTIME_INF min. On cgroup1, only
+ * inherit when no limit is set:
*/
if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
- quota = min(quota, parent_quota);
+ if (quota == RUNTIME_INF)
+ quota = parent_quota;
+ else if (parent_quota != RUNTIME_INF)
+ quota = min(quota, parent_quota);
} else {
if (quota == RUNTIME_INF)
quota = parent_quota;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..92381f9ecf37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
}

-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
{
raw_spin_lock_init(&cfs_b->lock);
cfs_b->runtime = 0;
cfs_b->quota = RUNTIME_INF;
cfs_b->period = ns_to_ktime(default_cfs_period());
cfs_b->burst = 0;
+ cfs_b->hierarchical_quota = ((parent) ? parent->hierarchical_quota : RUNTIME_INF);

INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
@@ -6168,7 +6169,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
return 0;
}

-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}

#ifdef CONFIG_FAIR_GROUP_SCHED
static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -12373,7 +12374,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

tg->shares = NICE_0_LOAD;

- init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+ init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));

for_each_possible_cpu(i) {
cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..63822c9238cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -460,7 +460,7 @@ extern void unregister_fair_sched_group(struct task_group *tg);
extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct sched_entity *se, int cpu,
struct sched_entity *parent);
-extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);

extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
--
2.31.1



2023-07-10 20:44:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota

Hello,

On Fri, Jul 07, 2023 at 03:57:47PM -0400, Phil Auld wrote:
> @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>
> /*
> * Ensure max(child_quota) <= parent_quota. On cgroup2,
> - * always take the min. On cgroup1, only inherit when no
> - * limit is set:
> + * always take the non-RUNTIME_INF min. On cgroup1, only
> + * inherit when no limit is set:
> */
> if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> - quota = min(quota, parent_quota);
> + if (quota == RUNTIME_INF)
> + quota = parent_quota;
> + else if (parent_quota != RUNTIME_INF)
> + quota = min(quota, parent_quota);

Can you please add a comment explaining how hierarchical_quota is used in
cgroup2? It seems like you're using it to detect whether bw limit is
enforced for a given cgroup, which is fine, but the above code in isolation
looks a bit curious because it doesn't serve any purpose by itself.

Thanks.

--
tejun

2023-07-10 21:48:12

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota

On Mon, Jul 10, 2023 at 10:17:08AM -1000 Tejun Heo wrote:
> Hello,
>
> On Fri, Jul 07, 2023 at 03:57:47PM -0400, Phil Auld wrote:
> > @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> >
> > /*
> > * Ensure max(child_quota) <= parent_quota. On cgroup2,
> > - * always take the min. On cgroup1, only inherit when no
> > - * limit is set:
> > + * always take the non-RUNTIME_INF min. On cgroup1, only
> > + * inherit when no limit is set:
> > */
> > if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> > - quota = min(quota, parent_quota);
> > + if (quota == RUNTIME_INF)
> > + quota = parent_quota;
> > + else if (parent_quota != RUNTIME_INF)
> > + quota = min(quota, parent_quota);
>
> Can you please add a comment explaining how hierarchical_quota is used in
> cgroup2? It seems like you're using it to detect whether bw limit is
> enforced for a given cgroup, which is fine, but the above code in isolation
> looks a bit curious because it doesn't serve any purpose by itself.

Sure, I can do that.



Cheers,
Phil

>
> Thanks.
>
> --
> tejun
>

--


2023-07-11 00:39:29

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota

Phil Auld <[email protected]> writes:

> In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> groups due to the previous fix simply taking the min. It should
> reflect a limit imposed at that level or by an ancestor. Even
> though cgroupv2 does not require child quota to be less than or
> equal to that of its ancestors the task group will still be
> constrained by such a quota so this should be shown here. Cgroupv1
> continues to set this correctly.
>
> In both cases, add initialization when a new task group is created
> based on the current parent's value (or RUNTIME_INF in the case of
> root_task_group). Otherwise, the field is wrong until a quota is
> changed after creation and __cfs_schedulable() is called.

Reviewed-by: Ben Segall <[email protected]>

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..1b214e10c25d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>
> /*
> * Ensure max(child_quota) <= parent_quota. On cgroup2,
> - * always take the min. On cgroup1, only inherit when no
> - * limit is set:
> + * always take the non-RUNTIME_INF min. On cgroup1, only
> + * inherit when no limit is set:
> */
> if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> - quota = min(quota, parent_quota);
> + if (quota == RUNTIME_INF)
> + quota = parent_quota;
> + else if (parent_quota != RUNTIME_INF)
> + quota = min(quota, parent_quota);
> } else {
> if (quota == RUNTIME_INF)
> quota = parent_quota;

I suppose you could also set RUNTIME_INF to be a positive value or
better yet just compare at unsigned, but it's not like config needs to
be fast, so no need to mess with that.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..92381f9ecf37 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> }
>
> -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
> {
> raw_spin_lock_init(&cfs_b->lock);
> cfs_b->runtime = 0;
> cfs_b->quota = RUNTIME_INF;
> cfs_b->period = ns_to_ktime(default_cfs_period());
> cfs_b->burst = 0;
> + cfs_b->hierarchical_quota = ((parent) ? parent->hierarchical_quota : RUNTIME_INF);

Minor style nit: don't need any of these parens here.

2023-07-11 13:30:22

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] sched, cgroup: Restore meaning to hierarchical_quota

On Mon, Jul 10, 2023 at 05:00:11PM -0700 Benjamin Segall wrote:
> Phil Auld <[email protected]> writes:
>
> > In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> > groups due to the previous fix simply taking the min. It should
> > reflect a limit imposed at that level or by an ancestor. Even
> > though cgroupv2 does not require child quota to be less than or
> > equal to that of its ancestors the task group will still be
> > constrained by such a quota so this should be shown here. Cgroupv1
> > continues to set this correctly.
> >
> > In both cases, add initialization when a new task group is created
> > based on the current parent's value (or RUNTIME_INF in the case of
> > root_task_group). Otherwise, the field is wrong until a quota is
> > changed after creation and __cfs_schedulable() is called.
>
> Reviewed-by: Ben Segall <[email protected]>
>

Thanks, I'll hold on to this for the next version where I update the comment
if that's okay. I was just going to send that but based on your comment
on patch 2 may just do a v6 of the whole thing.


Cheers,
Phil

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a68d1276bab0..1b214e10c25d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -11038,11 +11038,14 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> >
> > /*
> > * Ensure max(child_quota) <= parent_quota. On cgroup2,
> > - * always take the min. On cgroup1, only inherit when no
> > - * limit is set:
> > + * always take the non-RUNTIME_INF min. On cgroup1, only
> > + * inherit when no limit is set:
> > */
> > if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> > - quota = min(quota, parent_quota);
> > + if (quota == RUNTIME_INF)
> > + quota = parent_quota;
> > + else if (parent_quota != RUNTIME_INF)
> > + quota = min(quota, parent_quota);
> > } else {
> > if (quota == RUNTIME_INF)
> > quota = parent_quota;
>
> I suppose you could also set RUNTIME_INF to be a positive value or
> better yet just compare at unsigned, but it's not like config needs to
> be fast, so no need to mess with that.
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 373ff5f55884..92381f9ecf37 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> > }
> >
> > -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
> > {
> > raw_spin_lock_init(&cfs_b->lock);
> > cfs_b->runtime = 0;
> > cfs_b->quota = RUNTIME_INF;
> > cfs_b->period = ns_to_ktime(default_cfs_period());
> > cfs_b->burst = 0;
> > + cfs_b->hierarchical_quota = ((parent) ? parent->hierarchical_quota : RUNTIME_INF);
>
> Minor style nit: don't need any of these parens here.
>

--