Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753161AbcD0HJ5 (ORCPT ); Wed, 27 Apr 2016 03:09:57 -0400 Received: from mail-wm0-f49.google.com ([74.125.82.49]:36881 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752297AbcD0HJz (ORCPT ); Wed, 27 Apr 2016 03:09:55 -0400 Message-ID: <1461740991.3622.3.camel@gmail.com> Subject: [patch] sched: Fix smp nice induced group scheduling load distribution woes From: Mike Galbraith To: Peter Zijlstra Cc: LKML , Brendan Gregg , Jeff Merkey Date: Wed, 27 Apr 2016 09:09:51 +0200 In-Reply-To: <1461575925.3670.25.camel@gmail.com> References: <1461481517.3835.125.camel@gmail.com> <1461575925.3670.25.camel@gmail.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4041 Lines: 119 On Mon, 2016-04-25 at 11:18 +0200, Mike Galbraith wrote: > On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote: > > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote: > > > > > The bugs they found seem real, and their analysis is great > > > (although > > > using visualizations to find and fix scheduler bugs isn't new), > > > and it > > > would be good to see these fixed. However, it would also be > > > useful to > > > double check how widespread these issues really are. I suspect > > > many on > > > this list can test these patches in different environments. > > > > Part of it sounded to me very much like they're meeting and > > "fixing" > > SMP group fairness... > > Ew, NUMA boxen look like they could use a hug or two. Add a group of > one hog to compete with a box wide kbuild, ~lose a node. sched: Fix smp nice induced group scheduling load distribution woes On even a modest sized NUMA box any load that wants to scale is essentially reduced to SCHED_IDLE class by smp nice scaling. Limit niceness to prevent cramming a box wide load into a too small space. Given niceness affects latency, give the user the option to completely disable box wide group fairness as well. time make -j192 modules on a 4 node NUMA box.. Before: root cgroup real 1m6.987s 1.00 cgroup vs 1 groups of 1 hog real 1m20.871s 1.20 cgroup vs 2 groups of 1 hog real 1m48.803s 1.62 Each single task group receives a ~full socket because the kbuild has become an essentially massless object that fits in practically no space at all. Near perfect math led directly to far from good scaling/performance, a "Perfect is the enemy of good" poster child. After "Let's just be nice enough instead" adjustment, single task groups continued to sustain >99% utilization while competing with the box sized kbuild. cgroup vs 2 groups of 1 hog real 1m8.151s 1.01 192/190=1.01 Good enough works better.. nearly perfectly in this case. Signed-off-by: Mike Galbraith --- kernel/sched/fair.c | 22 ++++++++++++++++++---- kernel/sched/features.h | 3 +++ 2 files changed, 21 insertions(+), 4 deletions(-) Index: linux-2.6/kernel/sched/fair.c =================================================================== --- linux-2.6.orig/kernel/sched/fair.c +++ linux-2.6/kernel/sched/fair.c @@ -2464,17 +2464,28 @@ static inline long calc_tg_weight(struct static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) { - long tg_weight, load, shares; + long tg_weight, load, shares, min_shares = MIN_SHARES; - tg_weight = calc_tg_weight(tg, cfs_rq); + if (!sched_feat(SMP_NICE_GROUPS)) + return tg->shares; + + /* + * Bound niceness to prevent everything that wants to scale from + * essentially becoming SCHED_IDLE on multi/large socket boxen, + * screwing up our ability to distribute load properly and/or + * deliver acceptable latencies. + */ + tg_weight = min_t(long, calc_tg_weight(tg, cfs_rq), sched_prio_to_weight[10]); load = cfs_rq->load.weight; shares = (tg->shares * load); if (tg_weight) shares /= tg_weight; - if (shares < MIN_SHARES) - shares = MIN_SHARES; + if (tg->shares > sched_prio_to_weight[20]) + min_shares = sched_prio_to_weight[20]; + if (shares < min_shares) + shares = min_shares; if (shares > tg->shares) shares = tg->shares; @@ -2517,6 +2528,9 @@ static void update_cfs_shares(struct cfs #ifndef CONFIG_SMP if (likely(se->load.weight == tg->shares)) return; +#else + if (!sched_feat(SMP_NICE_GROUPS) && se->load.weight == tg->shares) + return; #endif shares = calc_cfs_shares(cfs_rq, tg); Index: linux-2.6/kernel/sched/features.h =================================================================== --- linux-2.6.orig/kernel/sched/features.h +++ linux-2.6/kernel/sched/features.h @@ -69,3 +69,6 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true) SCHED_FEAT(LB_MIN, false) SCHED_FEAT(ATTACH_AGE_LOAD, true) +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) +SCHED_FEAT(SMP_NICE_GROUPS, true) +#endif