I observe that dl task can't be migrated to other cpus during cpu hotplug, in
addition, task may/may not be running again if cpu is added back. The root cause
which I found is that dl task will be throtted and removed from dl rq after
comsuming all budget, which leads to stop task can't pick it up from dl rq and
migrate to other cpus during hotplug.
The method to reproduce:
schedtool -E -t 50000:100000 -e ./test
Actually test is just a simple for loop. Then observe which cpu the test
task is on.
echo 0 > /sys/devices/system/cpu/cpuN/online
This patch fix it by push the task to another cpu in dl_task_timer() if
rq is offline.
Note: dl task can be migrated successfully if rq is offline currently, however,
I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
task previous running on, so cpu_active_mask is used in the patch.
Peterz, Juri?
Signed-off-by: Wanpeng Li <[email protected]>
---
v1 -> v2:
* push the task to another cpu in dl_task_timer() if rq is offline.
kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++++++---
1 file changed, 36 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 04c2cbb..233e482 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -487,6 +487,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
return hrtimer_active(&dl_se->dl_timer);
}
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
/*
* This is the bandwidth enforcement timer callback. If here, we know
* a task is not on its dl_rq, since the fact that the timer was running
@@ -538,6 +539,39 @@ again:
update_rq_clock(rq);
dl_se->dl_throttled = 0;
dl_se->dl_yielded = 0;
+
+ /*
+ * So if we find that the rq the task was on is no longer
+ * available, we need to select a new rq.
+ */
+ if (!rq->online) {
+ struct rq *later_rq = NULL;
+
+ /* We will release rq lock */
+ get_task_struct(p);
+
+ raw_spin_unlock(&rq->lock);
+
+ later_rq = find_lock_later_rq(p, rq);
+
+ if (!later_rq) {
+ put_task_struct(p);
+ goto out;
+ }
+
+ deactivate_task(rq, p, 0);
+ set_task_cpu(p, later_rq->cpu);
+ activate_task(later_rq, p, 0);
+
+ resched_curr(later_rq);
+
+ double_unlock_balance(rq, later_rq);
+
+ put_task_struct(p);
+
+ goto out;
+ }
+
if (task_on_rq_queued(p)) {
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (dl_task(rq->curr))
@@ -555,7 +589,7 @@ again:
}
unlock:
raw_spin_unlock(&rq->lock);
-
+out:
return HRTIMER_NORESTART;
}
@@ -1182,8 +1216,7 @@ static int find_later_rq(struct task_struct *task)
* We have to consider system topology and task affinity
* first, then we can look for a suitable cpu.
*/
- cpumask_copy(later_mask, task_rq(task)->rd->span);
- cpumask_and(later_mask, later_mask, cpu_active_mask);
+ cpumask_copy(later_mask, cpu_active_mask);
cpumask_and(later_mask, later_mask, &task->cpus_allowed);
best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
task, later_mask);
--
1.9.1
Hi,
On 05/11/14 08:51, Wanpeng Li wrote:
> I observe that dl task can't be migrated to other cpus during cpu hotplug, in
> addition, task may/may not be running again if cpu is added back. The root cause
> which I found is that dl task will be throtted and removed from dl rq after
> comsuming all budget, which leads to stop task can't pick it up from dl rq and
> migrate to other cpus during hotplug.
>
> The method to reproduce:
> schedtool -E -t 50000:100000 -e ./test
> Actually test is just a simple for loop. Then observe which cpu the test
> task is on.
> echo 0 > /sys/devices/system/cpu/cpuN/online
>
> This patch fix it by push the task to another cpu in dl_task_timer() if
> rq is offline.
>
> Note: dl task can be migrated successfully if rq is offline currently, however,
> I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
> task previous running on, so cpu_active_mask is used in the patch.
>
> Peterz, Juri?
>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> v1 -> v2:
> * push the task to another cpu in dl_task_timer() if rq is offline.
>
> kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++++++---
> 1 file changed, 36 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 04c2cbb..233e482 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -487,6 +487,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
> return hrtimer_active(&dl_se->dl_timer);
> }
>
> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
> /*
> * This is the bandwidth enforcement timer callback. If here, we know
> * a task is not on its dl_rq, since the fact that the timer was running
> @@ -538,6 +539,39 @@ again:
> update_rq_clock(rq);
> dl_se->dl_throttled = 0;
> dl_se->dl_yielded = 0;
> +
> + /*
> + * So if we find that the rq the task was on is no longer
> + * available, we need to select a new rq.
> + */
> + if (!rq->online) {
> + struct rq *later_rq = NULL;
> +
> + /* We will release rq lock */
> + get_task_struct(p);
> +
> + raw_spin_unlock(&rq->lock);
> +
> + later_rq = find_lock_later_rq(p, rq);
> +
> + if (!later_rq) {
> + put_task_struct(p);
> + goto out;
> + }
> +
> + deactivate_task(rq, p, 0);
> + set_task_cpu(p, later_rq->cpu);
> + activate_task(later_rq, p, 0);
> +
> + resched_curr(later_rq);
> +
> + double_unlock_balance(rq, later_rq);
> +
> + put_task_struct(p);
> +
> + goto out;
> + }
> +
> if (task_on_rq_queued(p)) {
> enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> if (dl_task(rq->curr))
> @@ -555,7 +589,7 @@ again:
> }
> unlock:
> raw_spin_unlock(&rq->lock);
> -
> +out:
> return HRTIMER_NORESTART;
> }
>
> @@ -1182,8 +1216,7 @@ static int find_later_rq(struct task_struct *task)
> * We have to consider system topology and task affinity
> * first, then we can look for a suitable cpu.
> */
> - cpumask_copy(later_mask, task_rq(task)->rd->span);
> - cpumask_and(later_mask, later_mask, cpu_active_mask);
> + cpumask_copy(later_mask, cpu_active_mask);
I fear this breaks what I lately fixed in commit 91ec6778ec4f
("sched/deadline: Fix inter- exclusive cpusets migrations"), as
we first have to consider exclusive cpusets topology in looking
for a cpu. But, I'd have to test this to see if I'm right, and
I'll try to do it soon.
Thanks,
- Juri
> cpumask_and(later_mask, later_mask, &task->cpus_allowed);
> best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
> task, later_mask);
>
Hi Juri,
On 14/11/5 下午6:08, Juri Lelli wrote:
> Hi,
>
> On 05/11/14 08:51, Wanpeng Li wrote:
>> I observe that dl task can't be migrated to other cpus during cpu hotplug, in
>> addition, task may/may not be running again if cpu is added back. The root cause
>> which I found is that dl task will be throtted and removed from dl rq after
>> comsuming all budget, which leads to stop task can't pick it up from dl rq and
>> migrate to other cpus during hotplug.
>>
>> The method to reproduce:
>> schedtool -E -t 50000:100000 -e ./test
>> Actually test is just a simple for loop. Then observe which cpu the test
>> task is on.
>> echo 0 > /sys/devices/system/cpu/cpuN/online
>>
>> This patch fix it by push the task to another cpu in dl_task_timer() if
>> rq is offline.
>>
>> Note: dl task can be migrated successfully if rq is offline currently, however,
>> I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
>> task previous running on, so cpu_active_mask is used in the patch.
>>
>> Peterz, Juri?
>>
>> Signed-off-by: Wanpeng Li <[email protected]>
>> ---
>> v1 -> v2:
>> * push the task to another cpu in dl_task_timer() if rq is offline.
>>
>> kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 36 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 04c2cbb..233e482 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -487,6 +487,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
>> return hrtimer_active(&dl_se->dl_timer);
>> }
>>
>> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
>> /*
>> * This is the bandwidth enforcement timer callback. If here, we know
>> * a task is not on its dl_rq, since the fact that the timer was running
>> @@ -538,6 +539,39 @@ again:
>> update_rq_clock(rq);
>> dl_se->dl_throttled = 0;
>> dl_se->dl_yielded = 0;
>> +
>> + /*
>> + * So if we find that the rq the task was on is no longer
>> + * available, we need to select a new rq.
>> + */
>> + if (!rq->online) {
>> + struct rq *later_rq = NULL;
>> +
>> + /* We will release rq lock */
>> + get_task_struct(p);
>> +
>> + raw_spin_unlock(&rq->lock);
>> +
>> + later_rq = find_lock_later_rq(p, rq);
>> +
>> + if (!later_rq) {
>> + put_task_struct(p);
>> + goto out;
>> + }
>> +
>> + deactivate_task(rq, p, 0);
>> + set_task_cpu(p, later_rq->cpu);
>> + activate_task(later_rq, p, 0);
>> +
>> + resched_curr(later_rq);
>> +
>> + double_unlock_balance(rq, later_rq);
>> +
>> + put_task_struct(p);
>> +
>> + goto out;
>> + }
>> +
>> if (task_on_rq_queued(p)) {
>> enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
>> if (dl_task(rq->curr))
>> @@ -555,7 +589,7 @@ again:
>> }
>> unlock:
>> raw_spin_unlock(&rq->lock);
>> -
>> +out:
>> return HRTIMER_NORESTART;
>> }
>>
>> @@ -1182,8 +1216,7 @@ static int find_later_rq(struct task_struct *task)
>> * We have to consider system topology and task affinity
>> * first, then we can look for a suitable cpu.
>> */
>> - cpumask_copy(later_mask, task_rq(task)->rd->span);
>> - cpumask_and(later_mask, later_mask, cpu_active_mask);
>> + cpumask_copy(later_mask, cpu_active_mask);
> I fear this breaks what I lately fixed in commit 91ec6778ec4f
> ("sched/deadline: Fix inter- exclusive cpusets migrations"), as
As I mentioned in the patch description:
Note: dl task can be migrated successfully if rq is offline currently,
however, I'm still not sure why
task_rq(task)->rd->span just include the cpu which the dl task previous
running on, so cpu_active_mask is used in the patch.
Any explantion after your test is a great approciated. ;-)
> we first have to consider exclusive cpusets topology in looking
> for a cpu. But, I'd have to test this to see if I'm right, and
> I'll try to do it soon.
Thanks for your help. ;-)
Regards,
Wanpeng Li
>
> Thanks,
>
> - Juri
>
>> cpumask_and(later_mask, later_mask, &task->cpus_allowed);
>> best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
>> task, later_mask);
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, Nov 05, 2014 at 04:51:57PM +0800, Wanpeng Li wrote:
> Note: dl task can be migrated successfully if rq is offline currently, however,
> I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
> task previous running on, so cpu_active_mask is used in the patch.
>
> Peterz, Juri?
So the root domain span is for exclusive cpusets
(Documentation/cgroups/cpusets.txt) where we fully separate sets of CPUs
and have no load-balancing between the sets.
For 'normal' setups the rd->span is the entire machine, but using
cpusets you can create more (smaller) domains.
Is this what you were asking?
Hi Peter,
On 14/11/5 下午6:50, Peter Zijlstra wrote:
> On Wed, Nov 05, 2014 at 04:51:57PM +0800, Wanpeng Li wrote:
>> Note: dl task can be migrated successfully if rq is offline currently, however,
>> I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
>> task previous running on, so cpu_active_mask is used in the patch.
>>
>> Peterz, Juri?
> So the root domain span is for exclusive cpusets
> (Documentation/cgroups/cpusets.txt) where we fully separate sets of CPUs
> and have no load-balancing between the sets.
>
> For 'normal' setups the rd->span is the entire machine, but using
> cpusets you can create more (smaller) domains.
I don't setup any cpusets related stuff, however,
task_rq(task)->rd->span just include the cpu which the dl task previous
running on instead of the entire machine in find_later_rq().
Regards,
Wanpeng Li
>
> Is this what you were asking?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, Nov 05, 2014 at 06:59:35PM +0800, Wanpeng Li wrote:
> Hi Peter,
> On 14/11/5 下午6:50, Peter Zijlstra wrote:
> >On Wed, Nov 05, 2014 at 04:51:57PM +0800, Wanpeng Li wrote:
> >>Note: dl task can be migrated successfully if rq is offline currently, however,
> >>I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
> >>task previous running on, so cpu_active_mask is used in the patch.
> >>
> >>Peterz, Juri?
> >So the root domain span is for exclusive cpusets
> >(Documentation/cgroups/cpusets.txt) where we fully separate sets of CPUs
> >and have no load-balancing between the sets.
> >
> >For 'normal' setups the rd->span is the entire machine, but using
> >cpusets you can create more (smaller) domains.
>
> I don't setup any cpusets related stuff, however, task_rq(task)->rd->span
> just include the cpu which the dl task previous running on instead of the
> entire machine in find_later_rq().
Ah, it could be that for offline cpus we have a singleton rd. Lemme try
and remember wth we did there.
Hi, all,
В Ср, 05/11/2014 в 16:51 +0800, Wanpeng Li пишет:
> I observe that dl task can't be migrated to other cpus during cpu hotplug, in
> addition, task may/may not be running again if cpu is added back. The root cause
> which I found is that dl task will be throtted and removed from dl rq after
> comsuming all budget, which leads to stop task can't pick it up from dl rq and
> migrate to other cpus during hotplug.
>
> The method to reproduce:
> schedtool -E -t 50000:100000 -e ./test
> Actually test is just a simple for loop. Then observe which cpu the test
> task is on.
> echo 0 > /sys/devices/system/cpu/cpuN/online
>
> This patch fix it by push the task to another cpu in dl_task_timer() if
> rq is offline.
>
> Note: dl task can be migrated successfully if rq is offline currently, however,
> I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
> task previous running on, so cpu_active_mask is used in the patch.
>
> Peterz, Juri?
>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> v1 -> v2:
> * push the task to another cpu in dl_task_timer() if rq is offline.
>
> kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++++++---
> 1 file changed, 36 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 04c2cbb..233e482 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -487,6 +487,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
> return hrtimer_active(&dl_se->dl_timer);
> }
>
> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
> /*
> * This is the bandwidth enforcement timer callback. If here, we know
> * a task is not on its dl_rq, since the fact that the timer was running
> @@ -538,6 +539,39 @@ again:
> update_rq_clock(rq);
> dl_se->dl_throttled = 0;
> dl_se->dl_yielded = 0;
> +
> + /*
> + * So if we find that the rq the task was on is no longer
> + * available, we need to select a new rq.
> + */
> + if (!rq->online) {
> + struct rq *later_rq = NULL;
> +
> + /* We will release rq lock */
> + get_task_struct(p);
> +
> + raw_spin_unlock(&rq->lock);
> +
> + later_rq = find_lock_later_rq(p, rq);
> +
> + if (!later_rq) {
> + put_task_struct(p);
> + goto out;
> + }
> +
> + deactivate_task(rq, p, 0);
> + set_task_cpu(p, later_rq->cpu);
> + activate_task(later_rq, p, 0);
> +
> + resched_curr(later_rq);
> +
> + double_unlock_balance(rq, later_rq);
> +
> + put_task_struct(p);
> +
> + goto out;
> + }
> +
> if (task_on_rq_queued(p)) {
> enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> if (dl_task(rq->curr))
> @@ -555,7 +589,7 @@ again:
> }
> unlock:
> raw_spin_unlock(&rq->lock);
> -
> +out:
> return HRTIMER_NORESTART;
> }
>
> @@ -1182,8 +1216,7 @@ static int find_later_rq(struct task_struct *task)
> * We have to consider system topology and task affinity
> * first, then we can look for a suitable cpu.
> */
> - cpumask_copy(later_mask, task_rq(task)->rd->span);
> - cpumask_and(later_mask, later_mask, cpu_active_mask);
> + cpumask_copy(later_mask, cpu_active_mask);
> cpumask_and(later_mask, later_mask, &task->cpus_allowed);
> best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
> task, later_mask);
isn't this too complicated?
Can't we simply queue throttled tasks in rq_offline_dl() (without clearing
of dl_throttled() status)? migrate_tasks() will do the migration right.
This case the patch from the topic may be simply transformed in:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f3d7776..5059aa8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -531,7 +531,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
* in the runqueue or is going to be enqueued back anyway.
*/
if (!dl_task(p) || dl_se->dl_new ||
- dl_se->dl_boosted || !dl_se->dl_throttled)
+ dl_se->dl_boosted || !dl_se->dl_throttled || on_dl_rq(dl_se))
goto unlock;
sched_clock_tick();
On Wed, Nov 05, 2014 at 04:14:00PM +0300, Kirill Tkhai wrote:
> > @@ -538,6 +539,39 @@ again:
> > update_rq_clock(rq);
> > dl_se->dl_throttled = 0;
> > dl_se->dl_yielded = 0;
> > +
> > + /*
> > + * So if we find that the rq the task was on is no longer
> > + * available, we need to select a new rq.
> > + */
> > + if (!rq->online) {
> > + struct rq *later_rq = NULL;
> > +
> > + /* We will release rq lock */
> > + get_task_struct(p);
No need for this, due to task_dead_dl() -> hrtimer_cancel() this task
cannot go away while the timer callback is running.
> > + raw_spin_unlock(&rq->lock);
> > +
> > + later_rq = find_lock_later_rq(p, rq);
> > +
> > + if (!later_rq) {
> > + put_task_struct(p);
> > + goto out;
> > + }
This is wrong I think, we _must_ migrate the task, if we let it reside
on this offline rq it will never come back to us.
find_lock_later_rq() will fail for tasks that aren't currently eligible
to run. You could either try and change/parameterize it to return the
latest rq in that case, or just punt and pick any online cpu.
> isn't this too complicated?
>
> Can't we simply queue throttled tasks in rq_offline_dl() (without clearing
> of dl_throttled() status)? migrate_tasks() will do the migration right.
We can't find these tasks, we'd have to add extra lists etc. And it
seems consistent with the normal ttwu thing, which migrates tasks when
they wake up.
Hi Peter,
On Wed, Nov 05, 2014 at 01:52:54PM +0100, Peter Zijlstra wrote:
>On Wed, Nov 05, 2014 at 06:59:35PM +0800, Wanpeng Li wrote:
>> Hi Peter,
>> On 14/11/5 下午6:50, Peter Zijlstra wrote:
>> >On Wed, Nov 05, 2014 at 04:51:57PM +0800, Wanpeng Li wrote:
>> >>Note: dl task can be migrated successfully if rq is offline currently, however,
>> >>I'm still not sure why task_rq(task)->rd->span just include the cpu which the dl
>> >>task previous running on, so cpu_active_mask is used in the patch.
>> >>
>> >>Peterz, Juri?
>> >So the root domain span is for exclusive cpusets
>> >(Documentation/cgroups/cpusets.txt) where we fully separate sets of CPUs
>> >and have no load-balancing between the sets.
>> >
>> >For 'normal' setups the rd->span is the entire machine, but using
>> >cpusets you can create more (smaller) domains.
>>
>> I don't setup any cpusets related stuff, however, task_rq(task)->rd->span
>> just include the cpu which the dl task previous running on instead of the
>> entire machine in find_later_rq().
>
>Ah, it could be that for offline cpus we have a singleton rd. Lemme try
I still cannot find where build the singleton rd in the codes, could you
point out?
Regards,
Wanpeng Li
>and remember wth we did there.
On Thu, Nov 06, 2014 at 09:46:34AM +0800, Wanpeng Li wrote:
> >Ah, it could be that for offline cpus we have a singleton rd. Lemme try
>
> I still cannot find where build the singleton rd in the codes, could you
> point out?
So this is all quite horrible code, but what I think happens is that:
sched_cpu_inactive() -> set_cpu_active(cpu, false);
cpuset_cpu_inactive() -> cpuset_update_active_cpus(false)
-> partition_sched_domains(1, NULL, NULL)
-> build_sched_domains(cpu_active_mask)
-> for_each_cpu()
cpu_attach_domain() -> rq_attach_root()
Now, that will detach all active cpus from the current root domain and
attach them to the new root domain. Which leaves behind the old root
domain attached to only the one 'dead' cpu.
On 14/11/6 下午6:08, Peter Zijlstra wrote:
> On Thu, Nov 06, 2014 at 09:46:34AM +0800, Wanpeng Li wrote:
>
>>> Ah, it could be that for offline cpus we have a singleton rd. Lemme try
>> I still cannot find where build the singleton rd in the codes, could you
>> point out?
> So this is all quite horrible code, but what I think happens is that:
>
> sched_cpu_inactive() -> set_cpu_active(cpu, false);
> cpuset_cpu_inactive() -> cpuset_update_active_cpus(false)
> -> partition_sched_domains(1, NULL, NULL)
> -> build_sched_domains(cpu_active_mask)
> -> for_each_cpu()
> cpu_attach_domain() -> rq_attach_root()
>
> Now, that will detach all active cpus from the current root domain and
> attach them to the new root domain. Which leaves behind the old root
> domain attached to only the one 'dead' cpu.
Got it, thanks for your great explanation. ;-)
Regards,
Wanpeng Li