Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753577AbbGAUoX (ORCPT ); Wed, 1 Jul 2015 16:44:23 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:43130 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752072AbbGAUoO (ORCPT ); Wed, 1 Jul 2015 16:44:14 -0400 Date: Wed, 1 Jul 2015 22:44:04 +0200 From: Peter Zijlstra To: Rabin Vincent Cc: Mike Galbraith , "mingo@redhat.com" , "peterz@infradead.org" , "linux-kernel@vger.kernel.org" , Paul Turner , Ben Segall , yuyang.du@intel.com, Morten Rasmussen Subject: Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance() Message-ID: <20150701204404.GH25159@twins.programming.kicks-ass.net> References: <20150630143057.GA31689@axis.com> <1435728995.9397.7.camel@gmail.com> <20150701145551.GA15690@axis.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150701145551.GA15690@axis.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3831 Lines: 89 On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote: > PID: 413 TASK: 8edda408 CPU: 1 COMMAND: "rngd" > task_h_load(): 0 [ = (load_avg_contrib { 0} * cfs_rq->h_load { 0}) / (cfs_rq->runnable_load_avg { 0} + 1) ] > SE: 8edda450 load_avg_contrib: 0 load.weight: 1024 PARENT: 8fffbd00 GROUPNAME: (null) > SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 GROUPNAME: rngd@hwrng.service > SE: 8f531f80 load_avg_contrib: 0 load.weight: 1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice > SE: 8f456e00 load_avg_contrib: 118 load.weight: 911 PARENT: 00000000 GROUPNAME: system.slice So there's two problems there... the first we can (and should) fix, the second I'm not sure there's anything we can do about. Firstly, a group (parent) load_avg_contrib should never be less than that of its constituent parts, therefore the top 3 SEs should have at least 118 too. Now its been a while since I looked at the per entity load tracking stuff so some of the details have left me, but while it looks like we add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not propagate that into the corresponding (group) se. This means the se->avg.load_avg_contrib is accounted per cpu without migration benefits. So if our task just got migrated onto a cpu that hasn't ran the group in a while, the group will not have accumulated runtime. A quick fix would be something like the below; although I think we want to do something else, like maybe propagate the load_avg_contrib up the hierarchy etc.. But I need to think more about that. The second problem is that your second SE has a weight of 2, that'll get the task_h_load() a factor of 1/512 in which will flatten pretty much anything down to small. This is per configuration, so there's really not something we can or should do about that. Untested, uncompiled hackery following, mostly for discussion. --- kernel/sched/fair.c | 6 ++++-- kernel/sched/sched.h | 1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3d57cc0ca0a6..95d0ba249c8b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6082,7 +6082,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) struct rq *rq = rq_of(cfs_rq); struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)]; unsigned long now = jiffies; - unsigned long load; + unsigned long load, load_avg_contrib = 0; if (cfs_rq->last_h_load_update == now) return; @@ -6090,6 +6090,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) cfs_rq->h_load_next = NULL; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); + cfs_rq->h_load_avg_contrib = load_avg_contrib = + max(load_avg_contrib, se->avg.load_avg_contrib); cfs_rq->h_load_next = se; if (cfs_rq->last_h_load_update == now) break; @@ -6102,7 +6104,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) while ((se = cfs_rq->h_load_next) != NULL) { load = cfs_rq->h_load; - load = div64_ul(load * se->avg.load_avg_contrib, + load = div64_ul(load * cfs_rq->h_load_avg_contrib, cfs_rq->runnable_load_avg + 1); cfs_rq = group_cfs_rq(se); cfs_rq->h_load = load; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 885889190a1f..7738e3b301b7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -394,6 +394,7 @@ struct cfs_rq { * this group. */ unsigned long h_load; + unsigned long h_load_avg_contrib; u64 last_h_load_update; struct sched_entity *h_load_next; #endif /* CONFIG_FAIR_GROUP_SCHED */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/