2014-06-17 13:24:18

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq


We kill rq->rd on the CPU_DOWN_PREPARE stage:

cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains ->
-> cpu_attach_domain -> rq_attach_root -> set_rq_offline

This unthrottles all throttled cfs_rqs.

But the cpu is still able to call schedule() till

take_cpu_down->__cpu_disable()

is called from stop_machine.

This case the tasks from just unthrottled cfs_rqs are pickable
in a standard scheduler way, and they are picked by dying cpu.
The cfs_rqs becomes throttled again, and migrate_tasks()
in migration_call skips their tasks (one more unthrottle
in migrate_tasks()->CPU_DYING does not happen, because rq->rd
is already NULL).

Patch sets runtime_enabled to zero. This guarantees, the runtime
is not accounted, and the cfs_rqs won't exceed given
cfs_rq->runtime_remaining = 1, and tasks will be pickable
in migrate_tasks(). runtime_enabled is recalculated again
when rq becomes online again.

Signed-off-by: Kirill Tkhai <[email protected]>
CC: Srikar Dronamraju <[email protected]>
CC: Mike Galbraith <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..f120525 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3775,6 +3775,19 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
hrtimer_cancel(&cfs_b->slack_timer);
}

+static void __maybe_unused update_runtime_enabled(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+
+ raw_spin_lock(&cfs_b->lock);
+ cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
+ raw_spin_unlock(&cfs_b->lock);
+ }
+}
+
static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
{
struct cfs_rq *cfs_rq;
@@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
cfs_rq->runtime_remaining = 1;
if (cfs_rq_throttled(cfs_rq))
unthrottle_cfs_rq(cfs_rq);
+
+ /*
+ * Offline rq is schedulable till cpu is completely disabled
+ * in take_cpu_down(), so we prevent new cfs throttling here.
+ */
+ cfs_rq->runtime_enabled = 0;
}
}

@@ -3831,6 +3850,7 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
return NULL;
}
static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+static inline void update_runtime_enabled(struct rq *rq) {}
static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}

#endif /* CONFIG_CFS_BANDWIDTH */
@@ -7325,6 +7345,8 @@ void trigger_load_balance(struct rq *rq)
static void rq_online_fair(struct rq *rq)
{
update_sysctl();
+
+ update_runtime_enabled(rq);
}

static void rq_offline_fair(struct rq *rq)



2014-06-23 10:07:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq

On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> cfs_rq->runtime_remaining = 1;
> if (cfs_rq_throttled(cfs_rq))
> unthrottle_cfs_rq(cfs_rq);
> +
> + /*
> + * Offline rq is schedulable till cpu is completely disabled
> + * in take_cpu_down(), so we prevent new cfs throttling here.
> + */
> + cfs_rq->runtime_enabled = 0;

Does it make sense to clear this before calling unthrottle_cfs_rq()?
Just to make sure they're in the right order..

> }
> }

2014-06-23 10:58:31

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq

В Пн, 23/06/2014 в 12:07 +0200, Peter Zijlstra пишет:
> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
> > @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> > cfs_rq->runtime_remaining = 1;
> > if (cfs_rq_throttled(cfs_rq))
> > unthrottle_cfs_rq(cfs_rq);
> > +
> > + /*
> > + * Offline rq is schedulable till cpu is completely disabled
> > + * in take_cpu_down(), so we prevent new cfs throttling here.
> > + */
> > + cfs_rq->runtime_enabled = 0;
>
> Does it make sense to clear this before calling unthrottle_cfs_rq()?
> Just to make sure they're in the right order..

This looks good for me. I'll test and resend.

>
> > }
> > }

2014-06-23 17:29:20

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq

Peter Zijlstra <[email protected]> writes:

> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>> cfs_rq->runtime_remaining = 1;
>> if (cfs_rq_throttled(cfs_rq))
>> unthrottle_cfs_rq(cfs_rq);
>> +
>> + /*
>> + * Offline rq is schedulable till cpu is completely disabled
>> + * in take_cpu_down(), so we prevent new cfs throttling here.
>> + */
>> + cfs_rq->runtime_enabled = 0;
>
> Does it make sense to clear this before calling unthrottle_cfs_rq()?
> Just to make sure they're in the right order..

I believe that order is irrelevant here - I do not believe that
unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
see any code that will check it at all from unthrottle, although I might
be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
but that should be fine as long as for_each_leaf_cfs_rq is sorted
correctly.

That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
wake another task onto this cpu, which could then enqueue_throttle its
cfs_rq (which previously had no tasks and thus wasn't on
leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
turn runtime_enabled on.

I think the general idea of turning runtime_enabled off during offline
is fine, ccing pjt to double check.

2014-06-23 20:49:50

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq

On 23.06.2014 21:29, [email protected] wrote:
> Peter Zijlstra <[email protected]> writes:
>
>> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>>> cfs_rq->runtime_remaining = 1;
>>> if (cfs_rq_throttled(cfs_rq))
>>> unthrottle_cfs_rq(cfs_rq);
>>> +
>>> + /*
>>> + * Offline rq is schedulable till cpu is completely disabled
>>> + * in take_cpu_down(), so we prevent new cfs throttling here.
>>> + */
>>> + cfs_rq->runtime_enabled = 0;
>>
>> Does it make sense to clear this before calling unthrottle_cfs_rq()?
>> Just to make sure they're in the right order..
>
> I believe that order is irrelevant here - I do not believe that
> unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
> see any code that will check it at all from unthrottle, although I might
> be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
> but that should be fine as long as for_each_leaf_cfs_rq is sorted
> correctly.

I think this is correct. We may change the order just for the hope, that
anybody will work on it in some way in the future, and this person could
skip this opaque place. Ok, I don't know how is better :)

> That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
> wake another task onto this cpu, which could then enqueue_throttle its
> cfs_rq (which previously had no tasks and thus wasn't on
> leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
> turn runtime_enabled on.

We mask cpu inactive on CPU_DOWN_PREPARE stage (in sched_cpu_inactive).
Other cpu is not able to wake a task there after that.

rq is masked offline in cpuset_cpu_inactive() (during the same stage).
But priority of sched_cpu_inactive() is higher than priority of
cpuset_cpu_inactive().

CPU_PRI_SCHED_INACTIVE = INT_MIN + 1,
CPU_PRI_CPUSET_INACTIVE = INT_MIN,

This guarantees that nobody could use dying_cpu even before we start
unthrottling. Another cpu will see dying_cpu is inactive.

So, it looks like we are free of this problem.

> I think the general idea of turning runtime_enabled off during offline
> is fine, ccing pjt to double check.

Thanks,
Kirill

2014-06-23 21:05:29

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq

Kirill Tkhai <[email protected]> writes:

> On 23.06.2014 21:29, [email protected] wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>>> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>>>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>>>> cfs_rq->runtime_remaining = 1;
>>>> if (cfs_rq_throttled(cfs_rq))
>>>> unthrottle_cfs_rq(cfs_rq);
>>>> +
>>>> + /*
>>>> + * Offline rq is schedulable till cpu is completely disabled
>>>> + * in take_cpu_down(), so we prevent new cfs throttling here.
>>>> + */
>>>> + cfs_rq->runtime_enabled = 0;
>>>
>>> Does it make sense to clear this before calling unthrottle_cfs_rq()?
>>> Just to make sure they're in the right order..
>>
>> I believe that order is irrelevant here - I do not believe that
>> unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
>> see any code that will check it at all from unthrottle, although I might
>> be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
>> but that should be fine as long as for_each_leaf_cfs_rq is sorted
>> correctly.
>
> I think this is correct. We may change the order just for the hope, that
> anybody will work on it in some way in the future, and this person could
> skip this opaque place. Ok, I don't know how is better :)
>
>> That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
>> wake another task onto this cpu, which could then enqueue_throttle its
>> cfs_rq (which previously had no tasks and thus wasn't on
>> leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
>> turn runtime_enabled on.
>
> We mask cpu inactive on CPU_DOWN_PREPARE stage (in sched_cpu_inactive).
> Other cpu is not able to wake a task there after that.
>
> rq is masked offline in cpuset_cpu_inactive() (during the same stage).
> But priority of sched_cpu_inactive() is higher than priority of
> cpuset_cpu_inactive().
>
> CPU_PRI_SCHED_INACTIVE = INT_MIN + 1,
> CPU_PRI_CPUSET_INACTIVE = INT_MIN,
>
> This guarantees that nobody could use dying_cpu even before we start
> unthrottling. Another cpu will see dying_cpu is inactive.
>
> So, it looks like we are free of this problem.

Ah, ok, I haven't looked that hard at hotplug, and wasn't sure of the
ordering there. We still have the tg_set_cfs_bandwidth issue, because
that uses for_each_possible_cpu. However, with the addition of
rq_online_fair, that can probably be changed to for_each_active_cpu, and
then I think we would be fine.

2014-06-23 21:15:13

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Disable runtime_enabled on dying rq

On 24.06.2014 01:05, [email protected] wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 23.06.2014 21:29, [email protected] wrote:
>>> Peter Zijlstra <[email protected]> writes:
>>>
>>>> On Tue, Jun 17, 2014 at 05:24:10PM +0400, Kirill Tkhai wrote:
>>>>> @@ -3790,6 +3803,12 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>>>>> cfs_rq->runtime_remaining = 1;
>>>>> if (cfs_rq_throttled(cfs_rq))
>>>>> unthrottle_cfs_rq(cfs_rq);
>>>>> +
>>>>> + /*
>>>>> + * Offline rq is schedulable till cpu is completely disabled
>>>>> + * in take_cpu_down(), so we prevent new cfs throttling here.
>>>>> + */
>>>>> + cfs_rq->runtime_enabled = 0;
>>>>
>>>> Does it make sense to clear this before calling unthrottle_cfs_rq()?
>>>> Just to make sure they're in the right order..
>>>
>>> I believe that order is irrelevant here - I do not believe that
>>> unthrottle_cfs_rq(a) can cause a throttle_cfs_rq(a). In fact, I don't
>>> see any code that will check it at all from unthrottle, although I might
>>> be missing something. It _can_ cause a throttle_cfs_rq(parent_cfs_rq(a)),
>>> but that should be fine as long as for_each_leaf_cfs_rq is sorted
>>> correctly.
>>
>> I think this is correct. We may change the order just for the hope, that
>> anybody will work on it in some way in the future, and this person could
>> skip this opaque place. Ok, I don't know how is better :)
>>
>>> That said, migrate_tasks drops rq->lock, and I /think/ another cpu could
>>> wake another task onto this cpu, which could then enqueue_throttle its
>>> cfs_rq (which previously had no tasks and thus wasn't on
>>> leaf_cfs_rq_list). You certainly could have tg_set_bandwidth come in and
>>> turn runtime_enabled on.
>>
>> We mask cpu inactive on CPU_DOWN_PREPARE stage (in sched_cpu_inactive).
>> Other cpu is not able to wake a task there after that.
>>
>> rq is masked offline in cpuset_cpu_inactive() (during the same stage).
>> But priority of sched_cpu_inactive() is higher than priority of
>> cpuset_cpu_inactive().
>>
>> CPU_PRI_SCHED_INACTIVE = INT_MIN + 1,
>> CPU_PRI_CPUSET_INACTIVE = INT_MIN,
>>
>> This guarantees that nobody could use dying_cpu even before we start
>> unthrottling. Another cpu will see dying_cpu is inactive.
>>
>> So, it looks like we are free of this problem.
>
> Ah, ok, I haven't looked that hard at hotplug, and wasn't sure of the
> ordering there. We still have the tg_set_cfs_bandwidth issue, because
> that uses for_each_possible_cpu. However, with the addition of
> rq_online_fair, that can probably be changed to for_each_active_cpu, and
> then I think we would be fine.

Ok, now I see. Thanks, Ben.