2020-01-14 14:15:56

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH] sched/fair : prevent unlimited runtime on throttled group

When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.

We have the following sequence:

sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.

Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7b08d52db93..d0acc67336c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7062,8 +7062,15 @@ void sched_move_task(struct task_struct *tsk)

if (queued)
enqueue_task(rq, tsk, queue_flags);
- if (running)
+ if (running) {
set_next_task(rq, tsk);
+ /*
+ * After changing group, the running task may have joined a
+ * throttled one but it's still the running task. Trigger a
+ * resched to make sure that task can still run.
+ */
+ resched_curr(rq);
+ }

task_rq_unlock(rq, tsk, &rf);
}
--
2.7.4


2020-01-14 18:30:54

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair : prevent unlimited runtime on throttled group

Vincent Guittot <[email protected]> writes:

> When a running task is moved on a throttled task group and there is no
> other task enqueued on the CPU, the task can keep running using 100% CPU
> whatever the allocated bandwidth for the group and although its cfs rq is
> throttled. Furthermore, the group entity of the cfs_rq and its parents are
> not enqueued but only set as curr on their respective cfs_rqs.
>
> We have the following sequence:
>
> sched_move_task
> -dequeue_task: dequeue task and group_entities.
> -put_prev_task: put task and group entities.
> -sched_change_group: move task to new group.
> -enqueue_task: enqueue only task but not group entities because cfs_rq is
> throttled.
> -set_next_task : set task and group_entities as current sched_entity of
> their cfs_rq.
>
> Another impact is that the root cfs_rq runnable_load_avg at root rq stays
> null because the group_entities are not enqueued. This situation will stay
> the same until an "external" event triggers a reschedule. Let trigger it
> immediately instead.

Sounds reasonable to me, "moved group" being an explicit resched check
doesn't sound like a problem in general.

>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/core.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e7b08d52db93..d0acc67336c0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7062,8 +7062,15 @@ void sched_move_task(struct task_struct *tsk)
>
> if (queued)
> enqueue_task(rq, tsk, queue_flags);
> - if (running)
> + if (running) {
> set_next_task(rq, tsk);
> + /*
> + * After changing group, the running task may have joined a
> + * throttled one but it's still the running task. Trigger a
> + * resched to make sure that task can still run.
> + */
> + resched_curr(rq);
> + }
>
> task_rq_unlock(rq, tsk, &rf);
> }

2020-01-20 09:49:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair : prevent unlimited runtime on throttled group

On Tue, Jan 14, 2020 at 10:29:43AM -0800, [email protected] wrote:
> Vincent Guittot <[email protected]> writes:
>
> > When a running task is moved on a throttled task group and there is no
> > other task enqueued on the CPU, the task can keep running using 100% CPU
> > whatever the allocated bandwidth for the group and although its cfs rq is
> > throttled. Furthermore, the group entity of the cfs_rq and its parents are
> > not enqueued but only set as curr on their respective cfs_rqs.
> >
> > We have the following sequence:
> >
> > sched_move_task
> > -dequeue_task: dequeue task and group_entities.
> > -put_prev_task: put task and group entities.
> > -sched_change_group: move task to new group.
> > -enqueue_task: enqueue only task but not group entities because cfs_rq is
> > throttled.
> > -set_next_task : set task and group_entities as current sched_entity of
> > their cfs_rq.
> >
> > Another impact is that the root cfs_rq runnable_load_avg at root rq stays
> > null because the group_entities are not enqueued. This situation will stay
> > the same until an "external" event triggers a reschedule. Let trigger it
> > immediately instead.
>
> Sounds reasonable to me, "moved group" being an explicit resched check
> doesn't sound like a problem in general.

Do I read that as an Ack from you Ben? :-)

2020-01-21 18:28:06

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair : prevent unlimited runtime on throttled group

Peter Zijlstra <[email protected]> writes:

> On Tue, Jan 14, 2020 at 10:29:43AM -0800, [email protected] wrote:
>> Vincent Guittot <[email protected]> writes:
>>
>> > When a running task is moved on a throttled task group and there is no
>> > other task enqueued on the CPU, the task can keep running using 100% CPU
>> > whatever the allocated bandwidth for the group and although its cfs rq is
>> > throttled. Furthermore, the group entity of the cfs_rq and its parents are
>> > not enqueued but only set as curr on their respective cfs_rqs.
>> >
>> > We have the following sequence:
>> >
>> > sched_move_task
>> > -dequeue_task: dequeue task and group_entities.
>> > -put_prev_task: put task and group entities.
>> > -sched_change_group: move task to new group.
>> > -enqueue_task: enqueue only task but not group entities because cfs_rq is
>> > throttled.
>> > -set_next_task : set task and group_entities as current sched_entity of
>> > their cfs_rq.
>> >
>> > Another impact is that the root cfs_rq runnable_load_avg at root rq stays
>> > null because the group_entities are not enqueued. This situation will stay
>> > the same until an "external" event triggers a reschedule. Let trigger it
>> > immediately instead.
>>
>> Sounds reasonable to me, "moved group" being an explicit resched check
>> doesn't sound like a problem in general.
>
> Do I read that as an Ack from you Ben? :-)

Yeah,

Acked-by: Ben Segall <[email protected]>

The only question I see is if we care about avoiding the overhead for
non-cfsb cases, but cgroup attach is already slow enough that it's
probably not a real problem, and it's reasonable to check if it's still
right to run this task in general.

2020-01-29 11:35:28

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: sched/core] sched/fair: Prevent unlimited runtime on throttled group

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d
Gitweb: https://git.kernel.org/tip/2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 14 Jan 2020 15:13:56 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 28 Jan 2020 21:36:58 +01:00

sched/fair: Prevent unlimited runtime on throttled group

When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.

We have the following sequence:

sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.

Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Ben Segall <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a8a5d5b..89e54f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7072,8 +7072,15 @@ void sched_move_task(struct task_struct *tsk)

if (queued)
enqueue_task(rq, tsk, queue_flags);
- if (running)
+ if (running) {
set_next_task(rq, tsk);
+ /*
+ * After changing group, the running task may have joined a
+ * throttled one but it's still the running task. Trigger a
+ * resched to make sure that task can still run.
+ */
+ resched_curr(rq);
+ }

task_rq_unlock(rq, tsk, &rf);
}