2014-06-24 07:53:58

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

We kill rq->rd on the CPU_DOWN_PREPARE stage:

cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
-> cpu_attach_domain -> rq_attach_root -> set_rq_offline

This unthrottles all throttled cfs_rqs.

But the cpu is still able to call schedule() till

take_cpu_down->__cpu_disable()

is called from stop_machine.

This case the tasks from just unthrottled cfs_rqs are pickable
in a standard scheduler way, and they are picked by dying cpu.
The cfs_rqs becomes throttled again, and migrate_tasks()
in migration_call skips their tasks (one more unthrottle
in migrate_tasks()->CPU_DYING does not happen, because rq->rd
is already NULL).

Patch sets runtime_enabled to zero. This guarantees, the runtime
is not accounted, and the cfs_rqs won't exceed given
cfs_rq->runtime_remaining = 1, and tasks will be pickable
in migrate_tasks(). runtime_enabled is recalculated again
when rq becomes online again.

Ben Segall also noticed, we always enable runtime in
tg_set_cfs_bandwidth(). Actually, we should do that for online
cpus only. To fix that, we check if a cpu is online when
its rq is locked. This guarantees we do not have races with
set_rq_offline(), which also requires rq->lock.

v2: Fix race with tg_set_cfs_bandwidth().
Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().

Signed-off-by: Kirill Tkhai <[email protected]>
CC: Konstantin Khorenko <[email protected]>
CC: Ben Segall <[email protected]>
CC: Paul Turner <[email protected]>
CC: Srikar Dronamraju <[email protected]>
CC: Mike Galbraith <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 15 +++++++++++----
kernel/sched/fair.c | 22 ++++++++++++++++++++++
2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f3063c..707a3c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
struct rq *rq = cfs_rq->rq;

raw_spin_lock_irq(&rq->lock);
- cfs_rq->runtime_enabled = runtime_enabled;
- cfs_rq->runtime_remaining = 0;
+ /*
+ * Do not enable runtime on offline runqueues. We specially
+ * make it disabled in unthrottle_offline_cfs_rqs().
+ */
+ if (cpu_online(i)) {
+ cfs_rq->runtime_enabled = runtime_enabled;
+ cfs_rq->runtime_remaining = 0;
+
+ if (cfs_rq->throttled)
+ unthrottle_cfs_rq(cfs_rq);
+ }

- if (cfs_rq->throttled)
- unthrottle_cfs_rq(cfs_rq);
raw_spin_unlock_irq(&rq->lock);
}
if (runtime_was_enabled && !runtime_enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f9c457..5616d23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3776,6 +3776,19 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
hrtimer_cancel(&cfs_b->slack_timer);
}

+static void __maybe_unused update_runtime_enabled(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+
+ raw_spin_lock(&cfs_b->lock);
+ cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
+ raw_spin_unlock(&cfs_b->lock);
+ }
+}
+
static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
{
struct cfs_rq *cfs_rq;
@@ -3789,6 +3802,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
* there's some valid quota amount
*/
cfs_rq->runtime_remaining = 1;
+ /*
+ * Offline rq is schedulable till cpu is completely disabled
+ * in take_cpu_down(), so we prevent new cfs throttling here.
+ */
+ cfs_rq->runtime_enabled = 0;
+
if (cfs_rq_throttled(cfs_rq))
unthrottle_cfs_rq(cfs_rq);
}
@@ -3832,6 +3851,7 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
return NULL;
}
static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static inline void update_runtime_enabled(struct rq *rq) {}
static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}

#endif /* CONFIG_CFS_BANDWIDTH */
@@ -7325,6 +7345,8 @@ void trigger_load_balance(struct rq *rq)
static void rq_online_fair(struct rq *rq)
{
update_sysctl();
+
+ update_runtime_enabled(rq);
}

static void rq_offline_fair(struct rq *rq)



2014-06-24 17:03:55

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

Kirill Tkhai <[email protected]> writes:

> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>
> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>
> This unthrottles all throttled cfs_rqs.
>
> But the cpu is still able to call schedule() till
>
> take_cpu_down->__cpu_disable()
>
> is called from stop_machine.
>
> This case the tasks from just unthrottled cfs_rqs are pickable
> in a standard scheduler way, and they are picked by dying cpu.
> The cfs_rqs becomes throttled again, and migrate_tasks()
> in migration_call skips their tasks (one more unthrottle
> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> is already NULL).
>
> Patch sets runtime_enabled to zero. This guarantees, the runtime
> is not accounted, and the cfs_rqs won't exceed given
> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> in migrate_tasks(). runtime_enabled is recalculated again
> when rq becomes online again.
>
> Ben Segall also noticed, we always enable runtime in
> tg_set_cfs_bandwidth(). Actually, we should do that for online
> cpus only. To fix that, we check if a cpu is online when
> its rq is locked. This guarantees we do not have races with
> set_rq_offline(), which also requires rq->lock.
>
> v2: Fix race with tg_set_cfs_bandwidth().
> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> CC: Konstantin Khorenko <[email protected]>
> CC: Ben Segall <[email protected]>
> CC: Paul Turner <[email protected]>
> CC: Srikar Dronamraju <[email protected]>
> CC: Mike Galbraith <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Ingo Molnar <[email protected]>
> ---
> kernel/sched/core.c | 15 +++++++++++----
> kernel/sched/fair.c | 22 ++++++++++++++++++++++
> 2 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f3063c..707a3c5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> struct rq *rq = cfs_rq->rq;
>
> raw_spin_lock_irq(&rq->lock);
> - cfs_rq->runtime_enabled = runtime_enabled;
> - cfs_rq->runtime_remaining = 0;
> + /*
> + * Do not enable runtime on offline runqueues. We specially
> + * make it disabled in unthrottle_offline_cfs_rqs().
> + */
> + if (cpu_online(i)) {
> + cfs_rq->runtime_enabled = runtime_enabled;
> + cfs_rq->runtime_remaining = 0;
> +
> + if (cfs_rq->throttled)
> + unthrottle_cfs_rq(cfs_rq);
> + }

We can just do for_each_online_cpu, yes? Also we probably need
get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
right?

2014-06-24 19:02:05

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

On 24.06.2014 21:03, [email protected] wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>>
>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>>
>> This unthrottles all throttled cfs_rqs.
>>
>> But the cpu is still able to call schedule() till
>>
>> take_cpu_down->__cpu_disable()
>>
>> is called from stop_machine.
>>
>> This case the tasks from just unthrottled cfs_rqs are pickable
>> in a standard scheduler way, and they are picked by dying cpu.
>> The cfs_rqs becomes throttled again, and migrate_tasks()
>> in migration_call skips their tasks (one more unthrottle
>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
>> is already NULL).
>>
>> Patch sets runtime_enabled to zero. This guarantees, the runtime
>> is not accounted, and the cfs_rqs won't exceed given
>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
>> in migrate_tasks(). runtime_enabled is recalculated again
>> when rq becomes online again.
>>
>> Ben Segall also noticed, we always enable runtime in
>> tg_set_cfs_bandwidth(). Actually, we should do that for online
>> cpus only. To fix that, we check if a cpu is online when
>> its rq is locked. This guarantees we do not have races with
>> set_rq_offline(), which also requires rq->lock.
>>
>> v2: Fix race with tg_set_cfs_bandwidth().
>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> CC: Konstantin Khorenko <[email protected]>
>> CC: Ben Segall <[email protected]>
>> CC: Paul Turner <[email protected]>
>> CC: Srikar Dronamraju <[email protected]>
>> CC: Mike Galbraith <[email protected]>
>> CC: Peter Zijlstra <[email protected]>
>> CC: Ingo Molnar <[email protected]>
>> ---
>> kernel/sched/core.c | 15 +++++++++++----
>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
>> 2 files changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 7f3063c..707a3c5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> struct rq *rq = cfs_rq->rq;
>>
>> raw_spin_lock_irq(&rq->lock);
>> - cfs_rq->runtime_enabled = runtime_enabled;
>> - cfs_rq->runtime_remaining = 0;
>> + /*
>> + * Do not enable runtime on offline runqueues. We specially
>> + * make it disabled in unthrottle_offline_cfs_rqs().
>> + */
>> + if (cpu_online(i)) {
>> + cfs_rq->runtime_enabled = runtime_enabled;
>> + cfs_rq->runtime_remaining = 0;
>> +
>> + if (cfs_rq->throttled)
>> + unthrottle_cfs_rq(cfs_rq);
>> + }
>
> We can just do for_each_online_cpu, yes? Also we probably need
> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
> right?
>

Yes, we can use for_each_online_cpu/for_each_active_cpu with
get_online_cpus() taken. But it adds one more lock dependence.
This looks worse for me.

2014-06-24 19:13:54

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

Kirill Tkhai <[email protected]> writes:

> On 24.06.2014 21:03, [email protected] wrote:
>> Kirill Tkhai <[email protected]> writes:
>>
>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>>>
>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>>>
>>> This unthrottles all throttled cfs_rqs.
>>>
>>> But the cpu is still able to call schedule() till
>>>
>>> take_cpu_down->__cpu_disable()
>>>
>>> is called from stop_machine.
>>>
>>> This case the tasks from just unthrottled cfs_rqs are pickable
>>> in a standard scheduler way, and they are picked by dying cpu.
>>> The cfs_rqs becomes throttled again, and migrate_tasks()
>>> in migration_call skips their tasks (one more unthrottle
>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
>>> is already NULL).
>>>
>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
>>> is not accounted, and the cfs_rqs won't exceed given
>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
>>> in migrate_tasks(). runtime_enabled is recalculated again
>>> when rq becomes online again.
>>>
>>> Ben Segall also noticed, we always enable runtime in
>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
>>> cpus only. To fix that, we check if a cpu is online when
>>> its rq is locked. This guarantees we do not have races with
>>> set_rq_offline(), which also requires rq->lock.
>>>
>>> v2: Fix race with tg_set_cfs_bandwidth().
>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
>>>
>>> Signed-off-by: Kirill Tkhai <[email protected]>
>>> CC: Konstantin Khorenko <[email protected]>
>>> CC: Ben Segall <[email protected]>
>>> CC: Paul Turner <[email protected]>
>>> CC: Srikar Dronamraju <[email protected]>
>>> CC: Mike Galbraith <[email protected]>
>>> CC: Peter Zijlstra <[email protected]>
>>> CC: Ingo Molnar <[email protected]>
>>> ---
>>> kernel/sched/core.c | 15 +++++++++++----
>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
>>> 2 files changed, 33 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 7f3063c..707a3c5 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>>> struct rq *rq = cfs_rq->rq;
>>>
>>> raw_spin_lock_irq(&rq->lock);
>>> - cfs_rq->runtime_enabled = runtime_enabled;
>>> - cfs_rq->runtime_remaining = 0;
>>> + /*
>>> + * Do not enable runtime on offline runqueues. We specially
>>> + * make it disabled in unthrottle_offline_cfs_rqs().
>>> + */
>>> + if (cpu_online(i)) {
>>> + cfs_rq->runtime_enabled = runtime_enabled;
>>> + cfs_rq->runtime_remaining = 0;
>>> +
>>> + if (cfs_rq->throttled)
>>> + unthrottle_cfs_rq(cfs_rq);
>>> + }
>>
>> We can just do for_each_online_cpu, yes? Also we probably need
>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
>> right?
>>
>
> Yes, we can use for_each_online_cpu/for_each_active_cpu with
> get_online_cpus() taken. But it adds one more lock dependence.
> This looks worse for me.

I mean, you need get_online_cpus anyway - cpu_online is just a test
against the same mask that for_each_online_cpu uses, and without taking
the lock you can still race with offlining and reset runtime_enabled.

2014-06-24 19:26:48

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

On 24.06.2014 23:13, [email protected] wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 24.06.2014 21:03, [email protected] wrote:
>>> Kirill Tkhai <[email protected]> writes:
>>>
>>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>>>>
>>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
>>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>>>>
>>>> This unthrottles all throttled cfs_rqs.
>>>>
>>>> But the cpu is still able to call schedule() till
>>>>
>>>> take_cpu_down->__cpu_disable()
>>>>
>>>> is called from stop_machine.
>>>>
>>>> This case the tasks from just unthrottled cfs_rqs are pickable
>>>> in a standard scheduler way, and they are picked by dying cpu.
>>>> The cfs_rqs becomes throttled again, and migrate_tasks()
>>>> in migration_call skips their tasks (one more unthrottle
>>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
>>>> is already NULL).
>>>>
>>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
>>>> is not accounted, and the cfs_rqs won't exceed given
>>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
>>>> in migrate_tasks(). runtime_enabled is recalculated again
>>>> when rq becomes online again.
>>>>
>>>> Ben Segall also noticed, we always enable runtime in
>>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
>>>> cpus only. To fix that, we check if a cpu is online when
>>>> its rq is locked. This guarantees we do not have races with
>>>> set_rq_offline(), which also requires rq->lock.
>>>>
>>>> v2: Fix race with tg_set_cfs_bandwidth().
>>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
>>>>
>>>> Signed-off-by: Kirill Tkhai <[email protected]>
>>>> CC: Konstantin Khorenko <[email protected]>
>>>> CC: Ben Segall <[email protected]>
>>>> CC: Paul Turner <[email protected]>
>>>> CC: Srikar Dronamraju <[email protected]>
>>>> CC: Mike Galbraith <[email protected]>
>>>> CC: Peter Zijlstra <[email protected]>
>>>> CC: Ingo Molnar <[email protected]>
>>>> ---
>>>> kernel/sched/core.c | 15 +++++++++++----
>>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
>>>> 2 files changed, 33 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 7f3063c..707a3c5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>>>> struct rq *rq = cfs_rq->rq;
>>>>
>>>> raw_spin_lock_irq(&rq->lock);
>>>> - cfs_rq->runtime_enabled = runtime_enabled;
>>>> - cfs_rq->runtime_remaining = 0;
>>>> + /*
>>>> + * Do not enable runtime on offline runqueues. We specially
>>>> + * make it disabled in unthrottle_offline_cfs_rqs().
>>>> + */
>>>> + if (cpu_online(i)) {
>>>> + cfs_rq->runtime_enabled = runtime_enabled;
>>>> + cfs_rq->runtime_remaining = 0;
>>>> +
>>>> + if (cfs_rq->throttled)
>>>> + unthrottle_cfs_rq(cfs_rq);
>>>> + }
>>>
>>> We can just do for_each_online_cpu, yes? Also we probably need
>>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
>>> right?
>>>
>>
>> Yes, we can use for_each_online_cpu/for_each_active_cpu with
>> get_online_cpus() taken. But it adds one more lock dependence.
>> This looks worse for me.
>
> I mean, you need get_online_cpus anyway - cpu_online is just a test
> against the same mask that for_each_online_cpu uses, and without taking
> the lock you can still race with offlining and reset runtime_enabled.
>

Oh, I see. Thanks.

2014-06-25 07:53:43

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

В Вт, 24/06/2014 в 23:26 +0400, Kirill Tkhai пишет:
> On 24.06.2014 23:13, [email protected] wrote:
> > Kirill Tkhai <[email protected]> writes:
> >
> >> On 24.06.2014 21:03, [email protected] wrote:
> >>> Kirill Tkhai <[email protected]> writes:
> >>>
> >>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
> >>>>
> >>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> >>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
> >>>>
> >>>> This unthrottles all throttled cfs_rqs.
> >>>>
> >>>> But the cpu is still able to call schedule() till
> >>>>
> >>>> take_cpu_down->__cpu_disable()
> >>>>
> >>>> is called from stop_machine.
> >>>>
> >>>> This case the tasks from just unthrottled cfs_rqs are pickable
> >>>> in a standard scheduler way, and they are picked by dying cpu.
> >>>> The cfs_rqs becomes throttled again, and migrate_tasks()
> >>>> in migration_call skips their tasks (one more unthrottle
> >>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> >>>> is already NULL).
> >>>>
> >>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
> >>>> is not accounted, and the cfs_rqs won't exceed given
> >>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> >>>> in migrate_tasks(). runtime_enabled is recalculated again
> >>>> when rq becomes online again.
> >>>>
> >>>> Ben Segall also noticed, we always enable runtime in
> >>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
> >>>> cpus only. To fix that, we check if a cpu is online when
> >>>> its rq is locked. This guarantees we do not have races with
> >>>> set_rq_offline(), which also requires rq->lock.
> >>>>
> >>>> v2: Fix race with tg_set_cfs_bandwidth().
> >>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
> >>>>
> >>>> Signed-off-by: Kirill Tkhai <[email protected]>
> >>>> CC: Konstantin Khorenko <[email protected]>
> >>>> CC: Ben Segall <[email protected]>
> >>>> CC: Paul Turner <[email protected]>
> >>>> CC: Srikar Dronamraju <[email protected]>
> >>>> CC: Mike Galbraith <[email protected]>
> >>>> CC: Peter Zijlstra <[email protected]>
> >>>> CC: Ingo Molnar <[email protected]>
> >>>> ---
> >>>> kernel/sched/core.c | 15 +++++++++++----
> >>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
> >>>> 2 files changed, 33 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>>> index 7f3063c..707a3c5 100644
> >>>> --- a/kernel/sched/core.c
> >>>> +++ b/kernel/sched/core.c
> >>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> >>>> struct rq *rq = cfs_rq->rq;
> >>>>
> >>>> raw_spin_lock_irq(&rq->lock);
> >>>> - cfs_rq->runtime_enabled = runtime_enabled;
> >>>> - cfs_rq->runtime_remaining = 0;
> >>>> + /*
> >>>> + * Do not enable runtime on offline runqueues. We specially
> >>>> + * make it disabled in unthrottle_offline_cfs_rqs().
> >>>> + */
> >>>> + if (cpu_online(i)) {
> >>>> + cfs_rq->runtime_enabled = runtime_enabled;
> >>>> + cfs_rq->runtime_remaining = 0;
> >>>> +
> >>>> + if (cfs_rq->throttled)
> >>>> + unthrottle_cfs_rq(cfs_rq);
> >>>> + }
> >>>
> >>> We can just do for_each_online_cpu, yes? Also we probably need
> >>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
> >>> right?
> >>>
> >>
> >> Yes, we can use for_each_online_cpu/for_each_active_cpu with
> >> get_online_cpus() taken. But it adds one more lock dependence.
> >> This looks worse for me.
> >
> > I mean, you need get_online_cpus anyway - cpu_online is just a test
> > against the same mask that for_each_online_cpu uses, and without taking
> > the lock you can still race with offlining and reset runtime_enabled.
> >
>
> Oh, I see. Thanks.

But we can check for rq->online, don't we? How about this?

sched/fair: Disable runtime_enabled on dying rq

We kill rq->rd on the CPU_DOWN_PREPARE stage:

cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
-> cpu_attach_domain -> rq_attach_root -> set_rq_offline

This unthrottles all throttled cfs_rqs.

But the cpu is still able to call schedule() till

take_cpu_down->__cpu_disable()

is called from stop_machine.

This case the tasks from just unthrottled cfs_rqs are pickable
in a standard scheduler way, and they are picked by dying cpu.
The cfs_rqs becomes throttled again, and migrate_tasks()
in migration_call skips their tasks (one more unthrottle
in migrate_tasks()->CPU_DYING does not happen, because rq->rd
is already NULL).

Patch sets runtime_enabled to zero. This guarantees, the runtime
is not accounted, and the cfs_rqs won't exceed given
cfs_rq->runtime_remaining = 1, and tasks will be pickable
in migrate_tasks(). runtime_enabled is recalculated again
when rq becomes online again.

Ben Segall also noticed, we always enable runtime in
tg_set_cfs_bandwidth(). Actually, we should do that for online
cpus only. To fix that, we check if a cpu is online when
its rq is locked. This guarantees we do not have races with
set_rq_offline(), which also requires rq->lock.

v2: Fix race with tg_set_cfs_bandwidth().
Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
v3: Check for rq->online instead of cpu_active.

Signed-off-by: Kirill Tkhai <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ceea8d0..4c163c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7837,11 +7837,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
struct rq *rq = cfs_rq->rq;

raw_spin_lock_irq(&rq->lock);
- cfs_rq->runtime_enabled = runtime_enabled;
- cfs_rq->runtime_remaining = 0;
+ /*
+ * Do not enable runtime on offline runqueues. We specially
+ * disable it in unthrottle_offline_cfs_rqs().
+ */
+ if (rq->online) {
+ cfs_rq->runtime_enabled = runtime_enabled;
+ cfs_rq->runtime_remaining = 0;

- if (cfs_rq->throttled)
- unthrottle_cfs_rq(cfs_rq);
+ if (cfs_rq->throttled)
+ unthrottle_cfs_rq(cfs_rq);
+ }
raw_spin_unlock_irq(&rq->lock);
}
if (runtime_was_enabled && !runtime_enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f9c457..5616d23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3776,6 +3776,19 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
hrtimer_cancel(&cfs_b->slack_timer);
}

+static void __maybe_unused update_runtime_enabled(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+
+ raw_spin_lock(&cfs_b->lock);
+ cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
+ raw_spin_unlock(&cfs_b->lock);
+ }
+}
+
static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
{
struct cfs_rq *cfs_rq;
@@ -3789,6 +3802,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
* there's some valid quota amount
*/
cfs_rq->runtime_remaining = 1;
+ /*
+ * Offline rq is schedulable till cpu is completely disabled
+ * in take_cpu_down(), so we prevent new cfs throttling here.
+ */
+ cfs_rq->runtime_enabled = 0;
+
if (cfs_rq_throttled(cfs_rq))
unthrottle_cfs_rq(cfs_rq);
}
@@ -3832,6 +3851,7 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
return NULL;
}
static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static inline void update_runtime_enabled(struct rq *rq) {}
static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}

#endif /* CONFIG_CFS_BANDWIDTH */
@@ -7325,6 +7345,8 @@ void trigger_load_balance(struct rq *rq)
static void rq_online_fair(struct rq *rq)
{
update_sysctl();
+
+ update_runtime_enabled(rq);
}

static void rq_offline_fair(struct rq *rq)

2014-06-25 08:03:47

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

В Ср, 25/06/2014 в 11:53 +0400, Kirill Tkhai пишет:
> В Вт, 24/06/2014 в 23:26 +0400, Kirill Tkhai пишет:
> > On 24.06.2014 23:13, [email protected] wrote:
> > > Kirill Tkhai <[email protected]> writes:
> > >
> > >> On 24.06.2014 21:03, [email protected] wrote:
> > >>> Kirill Tkhai <[email protected]> writes:
> > >>>
> > >>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
> > >>>>
> > >>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> > >>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
> > >>>>
> > >>>> This unthrottles all throttled cfs_rqs.
> > >>>>
> > >>>> But the cpu is still able to call schedule() till
> > >>>>
> > >>>> take_cpu_down->__cpu_disable()
> > >>>>
> > >>>> is called from stop_machine.
> > >>>>
> > >>>> This case the tasks from just unthrottled cfs_rqs are pickable
> > >>>> in a standard scheduler way, and they are picked by dying cpu.
> > >>>> The cfs_rqs becomes throttled again, and migrate_tasks()
> > >>>> in migration_call skips their tasks (one more unthrottle
> > >>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> > >>>> is already NULL).
> > >>>>
> > >>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
> > >>>> is not accounted, and the cfs_rqs won't exceed given
> > >>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> > >>>> in migrate_tasks(). runtime_enabled is recalculated again
> > >>>> when rq becomes online again.
> > >>>>
> > >>>> Ben Segall also noticed, we always enable runtime in
> > >>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
> > >>>> cpus only. To fix that, we check if a cpu is online when
> > >>>> its rq is locked. This guarantees we do not have races with
> > >>>> set_rq_offline(), which also requires rq->lock.
> > >>>>
> > >>>> v2: Fix race with tg_set_cfs_bandwidth().
> > >>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
> > >>>>
> > >>>> Signed-off-by: Kirill Tkhai <[email protected]>
> > >>>> CC: Konstantin Khorenko <[email protected]>
> > >>>> CC: Ben Segall <[email protected]>
> > >>>> CC: Paul Turner <[email protected]>
> > >>>> CC: Srikar Dronamraju <[email protected]>
> > >>>> CC: Mike Galbraith <[email protected]>
> > >>>> CC: Peter Zijlstra <[email protected]>
> > >>>> CC: Ingo Molnar <[email protected]>
> > >>>> ---
> > >>>> kernel/sched/core.c | 15 +++++++++++----
> > >>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
> > >>>> 2 files changed, 33 insertions(+), 4 deletions(-)
> > >>>>
> > >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > >>>> index 7f3063c..707a3c5 100644
> > >>>> --- a/kernel/sched/core.c
> > >>>> +++ b/kernel/sched/core.c
> > >>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> > >>>> struct rq *rq = cfs_rq->rq;
> > >>>>
> > >>>> raw_spin_lock_irq(&rq->lock);
> > >>>> - cfs_rq->runtime_enabled = runtime_enabled;
> > >>>> - cfs_rq->runtime_remaining = 0;
> > >>>> + /*
> > >>>> + * Do not enable runtime on offline runqueues. We specially
> > >>>> + * make it disabled in unthrottle_offline_cfs_rqs().
> > >>>> + */
> > >>>> + if (cpu_online(i)) {
> > >>>> + cfs_rq->runtime_enabled = runtime_enabled;
> > >>>> + cfs_rq->runtime_remaining = 0;
> > >>>> +
> > >>>> + if (cfs_rq->throttled)
> > >>>> + unthrottle_cfs_rq(cfs_rq);
> > >>>> + }
> > >>>
> > >>> We can just do for_each_online_cpu, yes? Also we probably need
> > >>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
> > >>> right?
> > >>>
> > >>
> > >> Yes, we can use for_each_online_cpu/for_each_active_cpu with
> > >> get_online_cpus() taken. But it adds one more lock dependence.
> > >> This looks worse for me.
> > >
> > > I mean, you need get_online_cpus anyway - cpu_online is just a test
> > > against the same mask that for_each_online_cpu uses, and without taking
> > > the lock you can still race with offlining and reset runtime_enabled.
> > >
> >
> > Oh, I see. Thanks.
>
> But we can check for rq->online, don't we? How about this?

Oh, rq->online presents only in SMP case... Ok, lets do with
get_online_cpus()...

>
> sched/fair: Disable runtime_enabled on dying rq
>
> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>
> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>
> This unthrottles all throttled cfs_rqs.
>
> But the cpu is still able to call schedule() till
>
> take_cpu_down->__cpu_disable()
>
> is called from stop_machine.
>
> This case the tasks from just unthrottled cfs_rqs are pickable
> in a standard scheduler way, and they are picked by dying cpu.
> The cfs_rqs becomes throttled again, and migrate_tasks()
> in migration_call skips their tasks (one more unthrottle
> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> is already NULL).
>
> Patch sets runtime_enabled to zero. This guarantees, the runtime
> is not accounted, and the cfs_rqs won't exceed given
> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> in migrate_tasks(). runtime_enabled is recalculated again
> when rq becomes online again.
>
> Ben Segall also noticed, we always enable runtime in
> tg_set_cfs_bandwidth(). Actually, we should do that for online
> cpus only. To fix that, we check if a cpu is online when
> its rq is locked. This guarantees we do not have races with
> set_rq_offline(), which also requires rq->lock.
>
> v2: Fix race with tg_set_cfs_bandwidth().
> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
> v3: Check for rq->online instead of cpu_active.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ceea8d0..4c163c9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7837,11 +7837,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> struct rq *rq = cfs_rq->rq;
>
> raw_spin_lock_irq(&rq->lock);
> - cfs_rq->runtime_enabled = runtime_enabled;
> - cfs_rq->runtime_remaining = 0;
> + /*
> + * Do not enable runtime on offline runqueues. We specially
> + * disable it in unthrottle_offline_cfs_rqs().
> + */
> + if (rq->online) {
> + cfs_rq->runtime_enabled = runtime_enabled;
> + cfs_rq->runtime_remaining = 0;
>
> - if (cfs_rq->throttled)
> - unthrottle_cfs_rq(cfs_rq);
> + if (cfs_rq->throttled)
> + unthrottle_cfs_rq(cfs_rq);
> + }
> raw_spin_unlock_irq(&rq->lock);
> }
> if (runtime_was_enabled && !runtime_enabled)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1f9c457..5616d23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3776,6 +3776,19 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> hrtimer_cancel(&cfs_b->slack_timer);
> }
>
> +static void __maybe_unused update_runtime_enabled(struct rq *rq)
> +{
> + struct cfs_rq *cfs_rq;
> +
> + for_each_leaf_cfs_rq(rq, cfs_rq) {
> + struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
> +
> + raw_spin_lock(&cfs_b->lock);
> + cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
> + raw_spin_unlock(&cfs_b->lock);
> + }
> +}
> +
> static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> {
> struct cfs_rq *cfs_rq;
> @@ -3789,6 +3802,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> * there's some valid quota amount
> */
> cfs_rq->runtime_remaining = 1;
> + /*
> + * Offline rq is schedulable till cpu is completely disabled
> + * in take_cpu_down(), so we prevent new cfs throttling here.
> + */
> + cfs_rq->runtime_enabled = 0;
> +
> if (cfs_rq_throttled(cfs_rq))
> unthrottle_cfs_rq(cfs_rq);
> }
> @@ -3832,6 +3851,7 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> return NULL;
> }
> static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> +static inline void update_runtime_enabled(struct rq *rq) {}
> static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
>
> #endif /* CONFIG_CFS_BANDWIDTH */
> @@ -7325,6 +7345,8 @@ void trigger_load_balance(struct rq *rq)
> static void rq_online_fair(struct rq *rq)
> {
> update_sysctl();
> +
> + update_runtime_enabled(rq);
> }
>
> static void rq_offline_fair(struct rq *rq)
>

2014-06-25 16:52:17

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

Kirill Tkhai <[email protected]> writes:

> В Вт, 24/06/2014 в 23:26 +0400, Kirill Tkhai пишет:
>> On 24.06.2014 23:13, [email protected] wrote:
>> > Kirill Tkhai <[email protected]> writes:
>> >
>> >> On 24.06.2014 21:03, [email protected] wrote:
>> >>> Kirill Tkhai <[email protected]> writes:
>> >>>
>> >>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>> >>>>
>> >>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
>> >>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>> >>>>
>> >>>> This unthrottles all throttled cfs_rqs.
>> >>>>
>> >>>> But the cpu is still able to call schedule() till
>> >>>>
>> >>>> take_cpu_down->__cpu_disable()
>> >>>>
>> >>>> is called from stop_machine.
>> >>>>
>> >>>> This case the tasks from just unthrottled cfs_rqs are pickable
>> >>>> in a standard scheduler way, and they are picked by dying cpu.
>> >>>> The cfs_rqs becomes throttled again, and migrate_tasks()
>> >>>> in migration_call skips their tasks (one more unthrottle
>> >>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
>> >>>> is already NULL).
>> >>>>
>> >>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
>> >>>> is not accounted, and the cfs_rqs won't exceed given
>> >>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
>> >>>> in migrate_tasks(). runtime_enabled is recalculated again
>> >>>> when rq becomes online again.
>> >>>>
>> >>>> Ben Segall also noticed, we always enable runtime in
>> >>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
>> >>>> cpus only. To fix that, we check if a cpu is online when
>> >>>> its rq is locked. This guarantees we do not have races with
>> >>>> set_rq_offline(), which also requires rq->lock.
>> >>>>
>> >>>> v2: Fix race with tg_set_cfs_bandwidth().
>> >>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
>> >>>>
>> >>>> Signed-off-by: Kirill Tkhai <[email protected]>
>> >>>> CC: Konstantin Khorenko <[email protected]>
>> >>>> CC: Ben Segall <[email protected]>
>> >>>> CC: Paul Turner <[email protected]>
>> >>>> CC: Srikar Dronamraju <[email protected]>
>> >>>> CC: Mike Galbraith <[email protected]>
>> >>>> CC: Peter Zijlstra <[email protected]>
>> >>>> CC: Ingo Molnar <[email protected]>
>> >>>> ---
>> >>>> kernel/sched/core.c | 15 +++++++++++----
>> >>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
>> >>>> 2 files changed, 33 insertions(+), 4 deletions(-)
>> >>>>
>> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >>>> index 7f3063c..707a3c5 100644
>> >>>> --- a/kernel/sched/core.c
>> >>>> +++ b/kernel/sched/core.c
>> >>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> >>>> struct rq *rq = cfs_rq->rq;
>> >>>>
>> >>>> raw_spin_lock_irq(&rq->lock);
>> >>>> - cfs_rq->runtime_enabled = runtime_enabled;
>> >>>> - cfs_rq->runtime_remaining = 0;
>> >>>> + /*
>> >>>> + * Do not enable runtime on offline runqueues. We specially
>> >>>> + * make it disabled in unthrottle_offline_cfs_rqs().
>> >>>> + */
>> >>>> + if (cpu_online(i)) {
>> >>>> + cfs_rq->runtime_enabled = runtime_enabled;
>> >>>> + cfs_rq->runtime_remaining = 0;
>> >>>> +
>> >>>> + if (cfs_rq->throttled)
>> >>>> + unthrottle_cfs_rq(cfs_rq);
>> >>>> + }
>> >>>
>> >>> We can just do for_each_online_cpu, yes? Also we probably need
>> >>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
>> >>> right?
>> >>>
>> >>
>> >> Yes, we can use for_each_online_cpu/for_each_active_cpu with
>> >> get_online_cpus() taken. But it adds one more lock dependence.
>> >> This looks worse for me.
>> >
>> > I mean, you need get_online_cpus anyway - cpu_online is just a test
>> > against the same mask that for_each_online_cpu uses, and without taking
>> > the lock you can still race with offlining and reset runtime_enabled.
>> >
>>
>> Oh, I see. Thanks.
>
> But we can check for rq->online, don't we? How about this?

Yeah, that should work.
>
> sched/fair: Disable runtime_enabled on dying rq
>
> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>
> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>
> This unthrottles all throttled cfs_rqs.
>
> But the cpu is still able to call schedule() till
>
> take_cpu_down->__cpu_disable()
>
> is called from stop_machine.
>
> This case the tasks from just unthrottled cfs_rqs are pickable
> in a standard scheduler way, and they are picked by dying cpu.
> The cfs_rqs becomes throttled again, and migrate_tasks()
> in migration_call skips their tasks (one more unthrottle
> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> is already NULL).
>
> Patch sets runtime_enabled to zero. This guarantees, the runtime
> is not accounted, and the cfs_rqs won't exceed given
> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> in migrate_tasks(). runtime_enabled is recalculated again
> when rq becomes online again.
>
> Ben Segall also noticed, we always enable runtime in
> tg_set_cfs_bandwidth(). Actually, we should do that for online
> cpus only. To fix that, we check if a cpu is online when
> its rq is locked. This guarantees we do not have races with
> set_rq_offline(), which also requires rq->lock.
>
> v2: Fix race with tg_set_cfs_bandwidth().
> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
> v3: Check for rq->online instead of cpu_active.
>
> Signed-off-by: Kirill Tkhai <[email protected]>

Reviewed-by: Ben Segall <[email protected]>

2014-06-25 17:33:15

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

В Ср, 25/06/2014 в 09:52 -0700, [email protected] пишет:
> Kirill Tkhai <[email protected]> writes:
>
> > В Вт, 24/06/2014 в 23:26 +0400, Kirill Tkhai пишет:
> >> On 24.06.2014 23:13, [email protected] wrote:
> >> > Kirill Tkhai <[email protected]> writes:
> >> >
> >> >> On 24.06.2014 21:03, [email protected] wrote:
> >> >>> Kirill Tkhai <[email protected]> writes:
> >> >>>
> >> >>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
> >> >>>>
> >> >>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> >> >>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
> >> >>>>
> >> >>>> This unthrottles all throttled cfs_rqs.
> >> >>>>
> >> >>>> But the cpu is still able to call schedule() till
> >> >>>>
> >> >>>> take_cpu_down->__cpu_disable()
> >> >>>>
> >> >>>> is called from stop_machine.
> >> >>>>
> >> >>>> This case the tasks from just unthrottled cfs_rqs are pickable
> >> >>>> in a standard scheduler way, and they are picked by dying cpu.
> >> >>>> The cfs_rqs becomes throttled again, and migrate_tasks()
> >> >>>> in migration_call skips their tasks (one more unthrottle
> >> >>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> >> >>>> is already NULL).
> >> >>>>
> >> >>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
> >> >>>> is not accounted, and the cfs_rqs won't exceed given
> >> >>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> >> >>>> in migrate_tasks(). runtime_enabled is recalculated again
> >> >>>> when rq becomes online again.
> >> >>>>
> >> >>>> Ben Segall also noticed, we always enable runtime in
> >> >>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
> >> >>>> cpus only. To fix that, we check if a cpu is online when
> >> >>>> its rq is locked. This guarantees we do not have races with
> >> >>>> set_rq_offline(), which also requires rq->lock.
> >> >>>>
> >> >>>> v2: Fix race with tg_set_cfs_bandwidth().
> >> >>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
> >> >>>>
> >> >>>> Signed-off-by: Kirill Tkhai <[email protected]>
> >> >>>> CC: Konstantin Khorenko <[email protected]>
> >> >>>> CC: Ben Segall <[email protected]>
> >> >>>> CC: Paul Turner <[email protected]>
> >> >>>> CC: Srikar Dronamraju <[email protected]>
> >> >>>> CC: Mike Galbraith <[email protected]>
> >> >>>> CC: Peter Zijlstra <[email protected]>
> >> >>>> CC: Ingo Molnar <[email protected]>
> >> >>>> ---
> >> >>>> kernel/sched/core.c | 15 +++++++++++----
> >> >>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
> >> >>>> 2 files changed, 33 insertions(+), 4 deletions(-)
> >> >>>>
> >> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> >>>> index 7f3063c..707a3c5 100644
> >> >>>> --- a/kernel/sched/core.c
> >> >>>> +++ b/kernel/sched/core.c
> >> >>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> >> >>>> struct rq *rq = cfs_rq->rq;
> >> >>>>
> >> >>>> raw_spin_lock_irq(&rq->lock);
> >> >>>> - cfs_rq->runtime_enabled = runtime_enabled;
> >> >>>> - cfs_rq->runtime_remaining = 0;
> >> >>>> + /*
> >> >>>> + * Do not enable runtime on offline runqueues. We specially
> >> >>>> + * make it disabled in unthrottle_offline_cfs_rqs().
> >> >>>> + */
> >> >>>> + if (cpu_online(i)) {
> >> >>>> + cfs_rq->runtime_enabled = runtime_enabled;
> >> >>>> + cfs_rq->runtime_remaining = 0;
> >> >>>> +
> >> >>>> + if (cfs_rq->throttled)
> >> >>>> + unthrottle_cfs_rq(cfs_rq);
> >> >>>> + }
> >> >>>
> >> >>> We can just do for_each_online_cpu, yes? Also we probably need
> >> >>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
> >> >>> right?
> >> >>>
> >> >>
> >> >> Yes, we can use for_each_online_cpu/for_each_active_cpu with
> >> >> get_online_cpus() taken. But it adds one more lock dependence.
> >> >> This looks worse for me.
> >> >
> >> > I mean, you need get_online_cpus anyway - cpu_online is just a test
> >> > against the same mask that for_each_online_cpu uses, and without taking
> >> > the lock you can still race with offlining and reset runtime_enabled.
> >> >
> >>
> >> Oh, I see. Thanks.
> >
> > But we can check for rq->online, don't we? How about this?
>
> Yeah, that should work.

We can't base on it because rq->offline is not available in !SMP.
Could you review the message from [PATCH v3 1/3] topic?

> >
> > sched/fair: Disable runtime_enabled on dying rq
> >
> > We kill rq->rd on the CPU_DOWN_PREPARE stage:
> >
> > cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> > -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
> >
> > This unthrottles all throttled cfs_rqs.
> >
> > But the cpu is still able to call schedule() till
> >
> > take_cpu_down->__cpu_disable()
> >
> > is called from stop_machine.
> >
> > This case the tasks from just unthrottled cfs_rqs are pickable
> > in a standard scheduler way, and they are picked by dying cpu.
> > The cfs_rqs becomes throttled again, and migrate_tasks()
> > in migration_call skips their tasks (one more unthrottle
> > in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> > is already NULL).
> >
> > Patch sets runtime_enabled to zero. This guarantees, the runtime
> > is not accounted, and the cfs_rqs won't exceed given
> > cfs_rq->runtime_remaining = 1, and tasks will be pickable
> > in migrate_tasks(). runtime_enabled is recalculated again
> > when rq becomes online again.
> >
> > Ben Segall also noticed, we always enable runtime in
> > tg_set_cfs_bandwidth(). Actually, we should do that for online
> > cpus only. To fix that, we check if a cpu is online when
> > its rq is locked. This guarantees we do not have races with
> > set_rq_offline(), which also requires rq->lock.
> >
> > v2: Fix race with tg_set_cfs_bandwidth().
> > Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
> > v3: Check for rq->online instead of cpu_active.
> >
> > Signed-off-by: Kirill Tkhai <[email protected]>
>
> Reviewed-by: Ben Segall <[email protected]>

2014-06-25 17:40:52

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

Kirill Tkhai <[email protected]> writes:

> В Ср, 25/06/2014 в 09:52 -0700, [email protected] пишет:
>> Kirill Tkhai <[email protected]> writes:
>>
>> > В Вт, 24/06/2014 в 23:26 +0400, Kirill Tkhai пишет:
>> >> On 24.06.2014 23:13, [email protected] wrote:
>> >> > Kirill Tkhai <[email protected]> writes:
>> >> >
>> >> >> On 24.06.2014 21:03, [email protected] wrote:
>> >> >>> Kirill Tkhai <[email protected]> writes:
>> >> >>>
>> >> >>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>> >> >>>>
>> >> >>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
>> >> >>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>> >> >>>>
>> >> >>>> This unthrottles all throttled cfs_rqs.
>> >> >>>>
>> >> >>>> But the cpu is still able to call schedule() till
>> >> >>>>
>> >> >>>> take_cpu_down->__cpu_disable()
>> >> >>>>
>> >> >>>> is called from stop_machine.
>> >> >>>>
>> >> >>>> This case the tasks from just unthrottled cfs_rqs are pickable
>> >> >>>> in a standard scheduler way, and they are picked by dying cpu.
>> >> >>>> The cfs_rqs becomes throttled again, and migrate_tasks()
>> >> >>>> in migration_call skips their tasks (one more unthrottle
>> >> >>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
>> >> >>>> is already NULL).
>> >> >>>>
>> >> >>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
>> >> >>>> is not accounted, and the cfs_rqs won't exceed given
>> >> >>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
>> >> >>>> in migrate_tasks(). runtime_enabled is recalculated again
>> >> >>>> when rq becomes online again.
>> >> >>>>
>> >> >>>> Ben Segall also noticed, we always enable runtime in
>> >> >>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
>> >> >>>> cpus only. To fix that, we check if a cpu is online when
>> >> >>>> its rq is locked. This guarantees we do not have races with
>> >> >>>> set_rq_offline(), which also requires rq->lock.
>> >> >>>>
>> >> >>>> v2: Fix race with tg_set_cfs_bandwidth().
>> >> >>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
>> >> >>>>
>> >> >>>> Signed-off-by: Kirill Tkhai <[email protected]>
>> >> >>>> CC: Konstantin Khorenko <[email protected]>
>> >> >>>> CC: Ben Segall <[email protected]>
>> >> >>>> CC: Paul Turner <[email protected]>
>> >> >>>> CC: Srikar Dronamraju <[email protected]>
>> >> >>>> CC: Mike Galbraith <[email protected]>
>> >> >>>> CC: Peter Zijlstra <[email protected]>
>> >> >>>> CC: Ingo Molnar <[email protected]>
>> >> >>>> ---
>> >> >>>> kernel/sched/core.c | 15 +++++++++++----
>> >> >>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
>> >> >>>> 2 files changed, 33 insertions(+), 4 deletions(-)
>> >> >>>>
>> >> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> >>>> index 7f3063c..707a3c5 100644
>> >> >>>> --- a/kernel/sched/core.c
>> >> >>>> +++ b/kernel/sched/core.c
>> >> >>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> >> >>>> struct rq *rq = cfs_rq->rq;
>> >> >>>>
>> >> >>>> raw_spin_lock_irq(&rq->lock);
>> >> >>>> - cfs_rq->runtime_enabled = runtime_enabled;
>> >> >>>> - cfs_rq->runtime_remaining = 0;
>> >> >>>> + /*
>> >> >>>> + * Do not enable runtime on offline runqueues. We specially
>> >> >>>> + * make it disabled in unthrottle_offline_cfs_rqs().
>> >> >>>> + */
>> >> >>>> + if (cpu_online(i)) {
>> >> >>>> + cfs_rq->runtime_enabled = runtime_enabled;
>> >> >>>> + cfs_rq->runtime_remaining = 0;
>> >> >>>> +
>> >> >>>> + if (cfs_rq->throttled)
>> >> >>>> + unthrottle_cfs_rq(cfs_rq);
>> >> >>>> + }
>> >> >>>
>> >> >>> We can just do for_each_online_cpu, yes? Also we probably need
>> >> >>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
>> >> >>> right?
>> >> >>>
>> >> >>
>> >> >> Yes, we can use for_each_online_cpu/for_each_active_cpu with
>> >> >> get_online_cpus() taken. But it adds one more lock dependence.
>> >> >> This looks worse for me.
>> >> >
>> >> > I mean, you need get_online_cpus anyway - cpu_online is just a test
>> >> > against the same mask that for_each_online_cpu uses, and without taking
>> >> > the lock you can still race with offlining and reset runtime_enabled.
>> >> >
>> >>
>> >> Oh, I see. Thanks.
>> >
>> > But we can check for rq->online, don't we? How about this?
>>
>> Yeah, that should work.
>
> We can't base on it because rq->offline is not available in !SMP.
> Could you review the message from [PATCH v3 1/3] topic?

I'm not sure what you mean here. The patch just checking cpu_online
won't work, is there another version you want me to look at?

2014-06-25 17:44:40

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

В Ср, 25/06/2014 в 10:40 -0700, [email protected] пишет:
> Kirill Tkhai <[email protected]> writes:
>
> > В Ср, 25/06/2014 в 09:52 -0700, [email protected] пишет:
> >> Kirill Tkhai <[email protected]> writes:
> >>
> >> > В Вт, 24/06/2014 в 23:26 +0400, Kirill Tkhai пишет:
> >> >> On 24.06.2014 23:13, [email protected] wrote:
> >> >> > Kirill Tkhai <[email protected]> writes:
> >> >> >
> >> >> >> On 24.06.2014 21:03, [email protected] wrote:
> >> >> >>> Kirill Tkhai <[email protected]> writes:
> >> >> >>>
> >> >> >>>> We kill rq->rd on the CPU_DOWN_PREPARE stage:
> >> >> >>>>
> >> >> >>>> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> >> >> >>>> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
> >> >> >>>>
> >> >> >>>> This unthrottles all throttled cfs_rqs.
> >> >> >>>>
> >> >> >>>> But the cpu is still able to call schedule() till
> >> >> >>>>
> >> >> >>>> take_cpu_down->__cpu_disable()
> >> >> >>>>
> >> >> >>>> is called from stop_machine.
> >> >> >>>>
> >> >> >>>> This case the tasks from just unthrottled cfs_rqs are pickable
> >> >> >>>> in a standard scheduler way, and they are picked by dying cpu.
> >> >> >>>> The cfs_rqs becomes throttled again, and migrate_tasks()
> >> >> >>>> in migration_call skips their tasks (one more unthrottle
> >> >> >>>> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> >> >> >>>> is already NULL).
> >> >> >>>>
> >> >> >>>> Patch sets runtime_enabled to zero. This guarantees, the runtime
> >> >> >>>> is not accounted, and the cfs_rqs won't exceed given
> >> >> >>>> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> >> >> >>>> in migrate_tasks(). runtime_enabled is recalculated again
> >> >> >>>> when rq becomes online again.
> >> >> >>>>
> >> >> >>>> Ben Segall also noticed, we always enable runtime in
> >> >> >>>> tg_set_cfs_bandwidth(). Actually, we should do that for online
> >> >> >>>> cpus only. To fix that, we check if a cpu is online when
> >> >> >>>> its rq is locked. This guarantees we do not have races with
> >> >> >>>> set_rq_offline(), which also requires rq->lock.
> >> >> >>>>
> >> >> >>>> v2: Fix race with tg_set_cfs_bandwidth().
> >> >> >>>> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
> >> >> >>>>
> >> >> >>>> Signed-off-by: Kirill Tkhai <[email protected]>
> >> >> >>>> CC: Konstantin Khorenko <[email protected]>
> >> >> >>>> CC: Ben Segall <[email protected]>
> >> >> >>>> CC: Paul Turner <[email protected]>
> >> >> >>>> CC: Srikar Dronamraju <[email protected]>
> >> >> >>>> CC: Mike Galbraith <[email protected]>
> >> >> >>>> CC: Peter Zijlstra <[email protected]>
> >> >> >>>> CC: Ingo Molnar <[email protected]>
> >> >> >>>> ---
> >> >> >>>> kernel/sched/core.c | 15 +++++++++++----
> >> >> >>>> kernel/sched/fair.c | 22 ++++++++++++++++++++++
> >> >> >>>> 2 files changed, 33 insertions(+), 4 deletions(-)
> >> >> >>>>
> >> >> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> >> >>>> index 7f3063c..707a3c5 100644
> >> >> >>>> --- a/kernel/sched/core.c
> >> >> >>>> +++ b/kernel/sched/core.c
> >> >> >>>> @@ -7842,11 +7842,18 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> >> >> >>>> struct rq *rq = cfs_rq->rq;
> >> >> >>>>
> >> >> >>>> raw_spin_lock_irq(&rq->lock);
> >> >> >>>> - cfs_rq->runtime_enabled = runtime_enabled;
> >> >> >>>> - cfs_rq->runtime_remaining = 0;
> >> >> >>>> + /*
> >> >> >>>> + * Do not enable runtime on offline runqueues. We specially
> >> >> >>>> + * make it disabled in unthrottle_offline_cfs_rqs().
> >> >> >>>> + */
> >> >> >>>> + if (cpu_online(i)) {
> >> >> >>>> + cfs_rq->runtime_enabled = runtime_enabled;
> >> >> >>>> + cfs_rq->runtime_remaining = 0;
> >> >> >>>> +
> >> >> >>>> + if (cfs_rq->throttled)
> >> >> >>>> + unthrottle_cfs_rq(cfs_rq);
> >> >> >>>> + }
> >> >> >>>
> >> >> >>> We can just do for_each_online_cpu, yes? Also we probably need
> >> >> >>> get_online_cpus/put_online_cpus, and/or want cpu_active_mask instead
> >> >> >>> right?
> >> >> >>>
> >> >> >>
> >> >> >> Yes, we can use for_each_online_cpu/for_each_active_cpu with
> >> >> >> get_online_cpus() taken. But it adds one more lock dependence.
> >> >> >> This looks worse for me.
> >> >> >
> >> >> > I mean, you need get_online_cpus anyway - cpu_online is just a test
> >> >> > against the same mask that for_each_online_cpu uses, and without taking
> >> >> > the lock you can still race with offlining and reset runtime_enabled.
> >> >> >
> >> >>
> >> >> Oh, I see. Thanks.
> >> >
> >> > But we can check for rq->online, don't we? How about this?
> >>
> >> Yeah, that should work.
> >
> > We can't base on it because rq->offline is not available in !SMP.
> > Could you review the message from [PATCH v3 1/3] topic?
>
> I'm not sure what you mean here. The patch just checking cpu_online
> won't work, is there another version you want me to look at?

I mean this one: https://lkml.org/lkml/2014/6/25/123

Have you received it? My email client shows you are properly CCed.

2014-06-26 11:08:22

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/fair: Disable runtime_enabled on dying rq

* Kirill Tkhai <[email protected]> [2014-06-24 11:53:52]:

> We kill rq->rd on the CPU_DOWN_PREPARE stage:
>
> cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline
>
> This unthrottles all throttled cfs_rqs.
>
> But the cpu is still able to call schedule() till
>
> take_cpu_down->__cpu_disable()
>
> is called from stop_machine.
>
> This case the tasks from just unthrottled cfs_rqs are pickable
> in a standard scheduler way, and they are picked by dying cpu.
> The cfs_rqs becomes throttled again, and migrate_tasks()
> in migration_call skips their tasks (one more unthrottle
> in migrate_tasks()->CPU_DYING does not happen, because rq->rd
> is already NULL).
>
> Patch sets runtime_enabled to zero. This guarantees, the runtime
> is not accounted, and the cfs_rqs won't exceed given
> cfs_rq->runtime_remaining = 1, and tasks will be pickable
> in migrate_tasks(). runtime_enabled is recalculated again
> when rq becomes online again.
>
> Ben Segall also noticed, we always enable runtime in
> tg_set_cfs_bandwidth(). Actually, we should do that for online
> cpus only. To fix that, we check if a cpu is online when
> its rq is locked. This guarantees we do not have races with
> set_rq_offline(), which also requires rq->lock.
>
> v2: Fix race with tg_set_cfs_bandwidth().
> Move cfs_rq->runtime_enabled=0 above unthrottle_cfs_rq().
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> CC: Konstantin Khorenko <[email protected]>
> CC: Ben Segall <[email protected]>
> CC: Paul Turner <[email protected]>
> CC: Srikar Dronamraju <[email protected]>
> CC: Mike Galbraith <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Ingo Molnar <[email protected]>

looks good to me.

Reviewed-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar Dronamraju