Date: Mon, 3 Nov 2014 10:03:27 +0800
From: Wanpeng Li <wanpeng.li@linux.intel.com>
To: Kirill Tkhai <ktkhai@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@arm.com>, linux-kernel@vger.kernel.org,
        Wanpeng Li <wanpeng.li@linux.intel.com>
Subject: Re: [PATCH RFC] sched/deadline: support dl task migrate during cpu
 hotplug
Message-ID: <20141103020327.GA2849@kernel>
Reply-To: Wanpeng Li <wanpeng.li@linux.intel.com>
References: <1414740497-7232-1-git-send-email-wanpeng.li@linux.intel.com>
 <1414747229.8574.99.camel@tkhai>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1414747229.8574.99.camel@tkhai>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org

Hi Kirill,
On Fri, Oct 31, 2014 at 12:20:29PM +0300, Kirill Tkhai wrote:
>В Пт, 31/10/2014 в 15:28 +0800, Wanpeng Li пишет:
>> Hi all,
>> 
>> I observe that dl task can't be migrated to other cpus during cpu hotplug, in 
>> addition, task may/may not be running again if cpu is added back. The root cause 
>> which I found is that dl task will be throtted and removed from dl rq after 
>> comsuming all budget, which leads to stop task can't pick it up from dl rq and 
>> migrate to other cpus during hotplug. 
>> 
>> So I try two methods.
>> 
>> - add throttled dl sched_entity to a throttled_list, the list will be traversed
>>   during cpu hotplug, and the dl sched_entity will be picked and enqueue, then 
>>   stop task will pick and migrate it. However, dl sched_entity is throttled again 
>>   before stop task running since the below path. This path will set rq->online 0 
>>   which lead to set_rq_offline() won't be called in function migration_call().
>> 
>> Call Trace:
>>  [...] rq_offline_dl+0x44/0x66
>>  [...] set_rq_offline+0x29/0x54
>>  [...] rq_attach_root+0x3f/0xb7
>>  [...] cpu_attach_domain+0x1c7/0x354
>>  [...] build_sched_domains+0x295/0x304
>>  [...] partition_sched_domains+0x26a/0x2e6
>>  [...] ? emulator_write_gpr+0x27/0x27 [kvm]
>>  [...] cpuset_update_active_cpus+0x12/0x2c
>>  [...] cpuset_cpu_inactive+0x1b/0x38
>>  [...] notifier_call_chain+0x32/0x5e
>>  [...] __raw_notifier_call_chain+0x9/0xb
>>  [...] __cpu_notify+0x1b/0x2d
>>  [...] _cpu_down+0x81/0x22a
>>  [...] cpu_down+0x28/0x35
>>  [...] cpu_subsys_offline+0xf/0x11
>>  [...] device_offline+0x78/0xa8
>>  [...] online_store+0x48/0x69
>>  [...] ? kernfs_fop_write+0x61/0x129
>>  [...] dev_attr_store+0x1b/0x1d
>>  [...] sysfs_kf_write+0x37/0x39
>>  [...] kernfs_fop_write+0xe9/0x129
>>  [...] vfs_write+0xc6/0x19e
>>  [...] SyS_write+0x4b/0x8f
>>  [...] system_call_fastpath+0x16/0x1b
>> 
>> 
>> - The difference of the method two is that dl sched_entity won't be throtted
>>   if rq is offline, the dl sched_entity will be replenished in update_curr_dl().
>>   However, the echo 0 > /sys/devices/system/cpu/cpuN/online hung.
>> 
>> Juri, your proposal is a great welcome. ;-)
>> 
>> Note: This patch is just a proposal and still can't successfully migrate 
>> dl task during cpu hotplug.
>> 
>> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>> ---
>>  include/linux/sched.h   |  2 ++
>>  kernel/sched/deadline.c | 22 +++++++++++++++++++++-
>>  kernel/sched/sched.h    |  3 +++
>>  3 files changed, 26 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 4400ddc..bd71f19 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1253,6 +1253,8 @@ struct sched_dl_entity {
>>  	 * own bandwidth to be enforced, thus we need one timer per task.
>>  	 */
>>  	struct hrtimer dl_timer;
>> +	struct list_head throttled_node;
>> +	int on_list;
>
>Get rig of on_list. It's better to check for list_empty(&dl->throttled_node)
>instead. Of course, you should change list_del() on list_del_init() for this.

Agreed.

>
>>  };
>>  
>>  union rcu_special {
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 2e31a30..d6d6b71 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -80,6 +80,7 @@ void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
>>  	dl_rq->dl_nr_migratory = 0;
>>  	dl_rq->overloaded = 0;
>>  	dl_rq->pushable_dl_tasks_root = RB_ROOT;
>> +	INIT_LIST_HEAD(&dl_rq->throttled_list);
>>  #else
>>  	init_dl_bw(&dl_rq->dl_bw);
>>  #endif
>> @@ -538,6 +539,10 @@ again:
>>  	update_rq_clock(rq);
>>  	dl_se->dl_throttled = 0;
>>  	dl_se->dl_yielded = 0;
>> +	if (dl_se->on_list) {
>> +		list_del(&dl_se->throttled_node);
>> +		dl_se->on_list = 0;
>> +	}
>>  	if (task_on_rq_queued(p)) {
>>  		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
>>  		if (dl_task(rq->curr))
>> @@ -636,8 +641,12 @@ static void update_curr_dl(struct rq *rq)
>>  	dl_se->runtime -= delta_exec;
>>  	if (dl_runtime_exceeded(rq, dl_se)) {
>>  		__dequeue_task_dl(rq, curr, 0);
>> -		if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
>> +		if (rq->online && likely(start_dl_timer(dl_se, curr->dl.dl_boosted))) {
>
>Why is this check for rq->online necessary?

I will remove it.

>
>>  			dl_se->dl_throttled = 1;
>> +			dl_se->on_list = 1;
>> +			list_add(&dl_se->throttled_node,
>> +					&rq->dl.throttled_list);
>
>Alignment is wrong.

Agreed.

>
>> +		}
>>  		else
>>  			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
>>  
>> @@ -1593,9 +1602,20 @@ static void rq_online_dl(struct rq *rq)
>>  /* Assumes rq->lock is held */
>>  static void rq_offline_dl(struct rq *rq)
>>  {
>> +	struct task_struct *p, *n;
>> +
>>  	if (rq->dl.overloaded)
>>  		dl_clear_overload(rq);
>>  
>> +	/* Make sched_dl_entity available for pick_next_task() */
>> +	list_for_each_entry_safe(p, n, &rq->dl.throttled_list, dl.throttled_node) {
>> +		p->dl.dl_throttled = 0;
>> +		hrtimer_cancel(&p->dl.dl_timer);
>
>Deadlock is possible here. You're holding rq->lock and want to cancel timer handler,
>which is waiting for your rq->lock.

So what's your idea to handle this?

>
>> +		p->dl.dl_runtime = p->dl.dl_runtime;
>> +		if (task_on_rq_queued(p))
>> +			enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
>> +	}
>> +
>>  	cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
>>  }
>>  
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index ec3917c..8f95036 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -482,6 +482,9 @@ struct dl_rq {
>>  	 */
>>  	struct rb_root pushable_dl_tasks_root;
>>  	struct rb_node *pushable_dl_tasks_leftmost;
>> +
>> +	struct list_head throttled_list;
>> +
>>  #else
>>  	struct dl_bw dl_bw;
>>  #endif
>
>What about the situations when task changes its sched_class?

I still not consider this currently, your proposal is a great welcome. ;-)

Regards,
Wanpeng Li

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/