Message-ID: <1461740991.3622.3.camel@gmail.com>
Subject: [patch] sched: Fix smp nice induced group scheduling load
 distribution woes
From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
        Brendan Gregg <brendan.d.gregg@gmail.com>,
        Jeff Merkey <linux.mdb@gmail.com>
Date: Wed, 27 Apr 2016 09:09:51 +0200
In-Reply-To: <1461575925.3670.25.camel@gmail.com>
References: <CAO6TR8XqXCZZNoSx_VPxXWaE114Z-nmEaRG8h96U4EqZNRhfaw@mail.gmail.com>
	 <CAE40pddfowQWus-6u9HEYawtUCOycK--3xBHb=AGaOj-tiMekg@mail.gmail.com>
	 <1461481517.3835.125.camel@gmail.com> <1461575925.3670.25.camel@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4041
Lines: 119

On Mon, 2016-04-25 at 11:18 +0200, Mike Galbraith wrote:
> On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote:
> > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote:
> > 
> > > The bugs they found seem real, and their analysis is great
> > > (although
> > > using visualizations to find and fix scheduler bugs isn't new),
> > > and it
> > > would be good to see these fixed. However, it would also be
> > > useful to
> > > double check how widespread these issues really are. I suspect
> > > many on
> > > this list can test these patches in different environments.
> > 
> > Part of it sounded to me very much like they're meeting and
> > "fixing"
> > SMP group fairness...
> 
> Ew, NUMA boxen look like they could use a hug or two.  Add a group of
> one hog to compete with a box wide kbuild, ~lose a node.

sched: Fix smp nice induced group scheduling load distribution woes

On even a modest sized NUMA box any load that wants to scale
is essentially reduced to SCHED_IDLE class by smp nice scaling.
Limit niceness to prevent cramming a box wide load into a too
small space.  Given niceness affects latency, give the user the
option to completely disable box wide group fairness as well.

time make -j192 modules on a 4 node NUMA box..

Before:
root cgroup
real    1m6.987s      1.00

cgroup vs 1 groups of 1 hog
real    1m20.871s     1.20

cgroup vs 2 groups of 1 hog
real    1m48.803s     1.62

Each single task group receives a ~full socket because the kbuild
has become an essentially massless object that fits in practically
no space at all.  Near perfect math led directly to far from good
scaling/performance, a "Perfect is the enemy of good" poster child.

After "Let's just be nice enough instead" adjustment, single task
groups continued to sustain >99% utilization while competing with
the box sized kbuild.

cgroup vs 2 groups of 1 hog
real    1m8.151s     1.01  192/190=1.01

Good enough works better.. nearly perfectly in this case.

Signed-off-by: Mike Galbraith <umgwanakikbuit@gmail.com>
---
 kernel/sched/fair.c     |   22 ++++++++++++++++++----
 kernel/sched/features.h |    3 +++
 2 files changed, 21 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -2464,17 +2464,28 @@ static inline long calc_tg_weight(struct
 
 static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
-	long tg_weight, load, shares;
+	long tg_weight, load, shares, min_shares = MIN_SHARES;
 
-	tg_weight = calc_tg_weight(tg, cfs_rq);
+	if (!sched_feat(SMP_NICE_GROUPS))
+		return tg->shares;
+
+	/*
+	 * Bound niceness to prevent everything that wants to scale from
+	 * essentially becoming SCHED_IDLE on multi/large socket boxen,
+	 * screwing up our ability to distribute load properly and/or
+	 * deliver acceptable latencies.
+	 */
+	tg_weight = min_t(long, calc_tg_weight(tg, cfs_rq), sched_prio_to_weight[10]);
 	load = cfs_rq->load.weight;
 
 	shares = (tg->shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
+	if (tg->shares > sched_prio_to_weight[20])
+		min_shares = sched_prio_to_weight[20];
+	if (shares < min_shares)
+		shares = min_shares;
 	if (shares > tg->shares)
 		shares = tg->shares;
 
@@ -2517,6 +2528,9 @@ static void update_cfs_shares(struct cfs
 #ifndef CONFIG_SMP
 	if (likely(se->load.weight == tg->shares))
 		return;
+#else
+	if (!sched_feat(SMP_NICE_GROUPS) && se->load.weight == tg->shares)
+		return;
 #endif
 	shares = calc_cfs_shares(cfs_rq, tg);
 
Index: linux-2.6/kernel/sched/features.h
===================================================================
--- linux-2.6.orig/kernel/sched/features.h
+++ linux-2.6/kernel/sched/features.h
@@ -69,3 +69,6 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+SCHED_FEAT(SMP_NICE_GROUPS, true)
+#endif