Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932910AbcKJMeG (ORCPT ); Thu, 10 Nov 2016 07:34:06 -0500 Received: from foss.arm.com ([217.140.101.70]:47814 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932824AbcKJMdz (ORCPT ); Thu, 10 Nov 2016 07:33:55 -0500 Date: Thu, 10 Nov 2016 12:34:15 +0000 From: Juri Lelli To: luca abeni Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , Ingo Molnar , Claudio Scordino , Steven Rostedt Subject: Re: [RFC v3 2/6] Improve the tracking of active utilisation Message-ID: <20161110123415.GJ16920@e106622-lin> References: <1477317998-7487-1-git-send-email-luca.abeni@unitn.it> <1477317998-7487-3-git-send-email-luca.abeni@unitn.it> <20161101164604.GB2769@ARMvm> <20161101224633.4e5ee0ca@utopia> <20161102033552.4d41c9ca@utopia> <20161110100428.GH16920@e106622-lin> <20161110115610.GI16920@e106622-lin> <20161110131558.7a9480b2@sweethome> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161110131558.7a9480b2@sweethome> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3986 Lines: 95 On 10/11/16 13:15, Luca Abeni wrote: > On Thu, 10 Nov 2016 11:56:10 +0000 > Juri Lelli wrote: > > > On 10/11/16 10:04, Juri Lelli wrote: > > > On 02/11/16 03:35, Luca Abeni wrote: > > > > On Tue, 1 Nov 2016 22:46:33 +0100 > > > > luca abeni wrote: > > > > [...] > > > > > > > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct > > > > > > > *p, int cpu, int sd_flag, int flags) } > > > > > > > rcu_read_unlock(); > > > > > > > > > > > > > > + rq = task_rq(p); > > > > > > > + raw_spin_lock(&rq->lock); > > > > > > > + if (hrtimer_active(&p->dl.inactive_timer)) { > > > > > > > + sub_running_bw(&p->dl, &rq->dl); > > > > > > > + > > > > > > > hrtimer_try_to_cancel(&p->dl.inactive_timer); > > > > > > > > > > > > Can't we subtract twice if it happens that after we grabbed > > > > > > rq_lock the timer fired, so it's now waiting for that lock > > > > > > and it goes ahead and sub_running_bw again after we release > > > > > > the lock? > > > > > Uhm... I somehow convinced myself that this could not happen, > > > > > but I do not remember the details, sorry :( > > > > I think I remember the answer now: pi_lock is acquired before > > > > invoking select_task_rq and is released after invoking > > > > enqueue_task... So, if there is a pending inactive timer, its > > > > handler will be executed after the task is enqueued... It will > > > > see the task as RUNNING, and will not decrease the active > > > > utilisation. > > > > > > Oh, because we do task_rq_lock() inactive_task_timer(). So, that > > > should save us from the double subtract. Would you mind adding > > > something along the line of what you said above as a comment for > > > next version? > > > > Mmm, wait again. > > > > Cannot the following happen? > > > > - inactive_timer fires and does sub_running_bw (as the task is not > > RUNNING) > > - another cpu does try_to_wake_up and blocks on pi_lock > > - inactive timer releases both pi and rq locks (but is still > > executing, let's say it is doing put_task_struct()) > > - try_to_wake_up goes ahead and calls select_task_rq_dl > > + it finds inactive_timer active > > + sub_running_bw again :( > Uhm... Right; this can happen :( > :( > Ok; I'll think about some possible solution for this race... If I do > not find any simple way to solve it, I'll add a "contending" flag, > which allows to know if the inactive timer handler already executed or > not. > Right, this might help. Another thing that I was thinking of is whether we can use the return value of hrtimer_try_to_cancel() to decide what to do: - if it returns 0 it means that the callback exectuted or the timer was never set, so nothing to do (as in nothing to sub_running_bw from) - if it returns 1 we succedeed, so we need to actively sub_running_bw - if -1 we can assume that it will eventually do sub_running_bw() so we don't need to care explicitly Now I guess the problem is that the task can be migrated while his inactive_timer is set (by select_task_rq_dl or by other classes load balacing if setscheduled to a different class). Can't we store a back reference to the rq from which the inactive_timer was queued and use that to sub_running_bw() from? It seems that we might end up with some "shadow" bandwidth, say when we do a wakeup migration, but maybe this is something we can tolerate? Just thinking aloud. :) > BTW, talking about sched_dl_entity flags: I see there are three > different int fields "dl_throttled, "dl_boosted" and "dl_yielded"; any > reason for doing this instead of having a "dl_flags" field and setting > its different bits when the entity is throttled, boosted or yielded? In > other words: if I need this "contending" flag, should I add a new > "dl_contending" field? > I think you might want to add a clean-up patch to your series (or a separate one) fixing the current situation, and the build on to adding the new flag if needed. Thanks, - Juri