Date: Wed, 1 Jul 2015 22:44:04 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Rabin Vincent <rabin.vincent@axis.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
        yuyang.du@intel.com, Morten Rasmussen <morten.rasmussen@arm.com>
Subject: Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
Message-ID: <20150701204404.GH25159@twins.programming.kicks-ass.net>
References: <20150630143057.GA31689@axis.com>
 <1435728995.9397.7.camel@gmail.com>
 <20150701145551.GA15690@axis.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150701145551.GA15690@axis.com>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3831
Lines: 89

On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
>  PID: 413    TASK: 8edda408  CPU: 1   COMMAND: "rngd"
>   task_h_load():     0 [ = (load_avg_contrib {    0} * cfs_rq->h_load {    0}) / (cfs_rq->runnable_load_avg {    0} + 1) ]
>   SE: 8edda450 load_avg_contrib:     0 load.weight:  1024 PARENT: 8fffbd00 GROUPNAME: (null)
>   SE: 8fffbd00 load_avg_contrib:     0 load.weight:     2 PARENT: 8f531f80 GROUPNAME: rngd@hwrng.service
>   SE: 8f531f80 load_avg_contrib:     0 load.weight:  1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
>   SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT: 00000000 GROUPNAME: system.slice

So there's two problems there... the first we can (and should) fix, the
second I'm not sure there's anything we can do about.

Firstly, a group (parent) load_avg_contrib should never be less than
that of its constituent parts, therefore the top 3 SEs should have at
least 118 too.

Now its been a while since I looked at the per entity load tracking
stuff so some of the details have left me, but while it looks like we
add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not
propagate that into the corresponding (group) se.

This means the se->avg.load_avg_contrib is accounted per cpu without
migration benefits. So if our task just got migrated onto a cpu that
hasn't ran the group in a while, the group will not have accumulated
runtime.

A quick fix would be something like the below; although I think we want
to do something else, like maybe propagate the load_avg_contrib up the
hierarchy etc.. But I need to think more about that.

The second problem is that your second SE has a weight of 2, that'll get
the task_h_load() a factor of 1/512 in which will flatten pretty much
anything down to small. This is per configuration, so there's really not
something we can or should do about that.

Untested, uncompiled hackery following, mostly for discussion.

---
 kernel/sched/fair.c  | 6 ++++--
 kernel/sched/sched.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0ca0a6..95d0ba249c8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6082,7 +6082,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 	unsigned long now = jiffies;
-	unsigned long load;
+	unsigned long load, load_avg_contrib = 0;
 
 	if (cfs_rq->last_h_load_update == now)
 		return;
@@ -6090,6 +6090,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	cfs_rq->h_load_next = NULL;
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_load_avg_contrib = load_avg_contrib =
+			max(load_avg_contrib, se->avg.load_avg_contrib);
 		cfs_rq->h_load_next = se;
 		if (cfs_rq->last_h_load_update == now)
 			break;
@@ -6102,7 +6104,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 	while ((se = cfs_rq->h_load_next) != NULL) {
 		load = cfs_rq->h_load;
-		load = div64_ul(load * se->avg.load_avg_contrib,
+		load = div64_ul(load * cfs_rq->h_load_avg_contrib,
 				cfs_rq->runnable_load_avg + 1);
 		cfs_rq = group_cfs_rq(se);
 		cfs_rq->h_load = load;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 885889190a1f..7738e3b301b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ struct cfs_rq {
 	 * this group.
 	 */
 	unsigned long h_load;
+	unsigned long h_load_avg_contrib;
 	u64 last_h_load_update;
 	struct sched_entity *h_load_next;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/