Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753245AbcJKNPN (ORCPT ); Tue, 11 Oct 2016 09:15:13 -0400 Received: from mail-lf0-f42.google.com ([209.85.215.42]:33618 "EHLO mail-lf0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751335AbcJKNPK (ORCPT ); Tue, 11 Oct 2016 09:15:10 -0400 MIME-Version: 1.0 In-Reply-To: <20161011102453.GA16071@codeblueprint.co.uk> References: <20160923115808.2330-1-matt@codeblueprint.co.uk> <20160928101422.GR5016@twins.programming.kicks-ass.net> <20160928193731.GD16071@codeblueprint.co.uk> <20161010100107.GZ16071@codeblueprint.co.uk> <20161010173440.GA28945@linaro.org> <20161011102453.GA16071@codeblueprint.co.uk> From: Vincent Guittot Date: Tue, 11 Oct 2016 15:14:47 +0200 Message-ID: Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue To: Matt Fleming Cc: Wanpeng Li , Peter Zijlstra , Ingo Molnar , "linux-kernel@vger.kernel.org" , Mike Galbraith , Yuyang Du , Dietmar Eggemann Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3908 Lines: 102 On 11 October 2016 at 12:24, Matt Fleming wrote: > On Mon, 10 Oct, at 07:34:40PM, Vincent Guittot wrote: >> >> Subject: [PATCH] sched: use load_avg for selecting idlest group >> >> select_busiest_group only compares the runnable_load_avg when looking for >> the idlest group. But on fork intensive use case like hackbenchw here task >> blocked quickly after the fork, this can lead to selecting the same CPU >> whereas other CPUs, which have similar runnable load but a lower load_avg, >> could be chosen instead. >> >> When the runnable_load_avg of 2 CPUs are close, we now take into account >> the amount of blocked load as a 2nd selection factor. >> >> For use case like hackbench, this enable the scheduler to select different >> CPUs during the fork sequence and to spread tasks across the system. >> >> Tests have been done on a Hikey board (ARM based octo cores) for several >> kernel. The result below gives min, max, avg and stdev values of 18 runs >> with each configuration. >> >> The v4.8+patches configuration also includes the changes below which is part of the >> proposal made by Peter to ensure that the clock will be up to date when the >> fork task will be attached to the rq. >> >> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) >> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); >> #endif >> rq = __task_rq_lock(p, &rf); >> + update_rq_clock(rq); >> post_init_entity_util_avg(&p->se); >> >> activate_task(rq, p, 0); >> >> hackbench -P -g 1 >> >> ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches >> min 0.049 0.050 0.051 0,048 >> avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) >> max 0.066 0.068 0.070 0,063 >> stdev +/-9% +/-9% +/-8% +/-9% >> >> Signed-off-by: Vincent Guittot >> --- >> kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++-------- >> 1 file changed, 32 insertions(+), 8 deletions(-) > > This patch looks pretty good to me and this 2-socket 48-cpu Xeon > (domain0 SMT, domain1 MC, domain2 NUMA) shows a few nice performance > improvements, and no regressions for various combinations of hackbench > sockets/pipes and group numbers. Good > > But on a 2-socket 8-cpu Xeon (domain0 MC, domain1 DIE) running, > > perf stat --null -r 25 -- hackbench -pipe 30 process 1000 I'm going to have a look at this use case on my hikey board made 8 CPUs (2 cluster of quad core) (domain0 MC, domain1 DIE) which seems to have a similar topology > > I see a regression, > > baseline: 2.41228 > patched : 2.64528 (-9.7%) Just to be sure; By baseline you mean v4.8 ? > > Even though the spread of tasks during fork[0] is improved, > > baseline CV: 0.478% > patched CV : 0.042% > > Clearly the spread wasn't *that* bad to begin with on this machine for > this workload. I consider the baseline spread to be pretty well > distributed. Some other factor must be at play. > > Patched runqueue latencies are higher (max9* are percentiles), > > baseline: mean: 615932.69 max90: 75272.00 max95: 175985.00 max99: 5884778.00 max: 1694084747.00 > patched: mean : 882026.28 max90: 92015.00 max95: 291760.00 max99: 7590167.00 max: 1841154776.00 > > And there are more migrations of hackbench tasks, > > baseline: total: 5390 cross-MC: 3810 cross-DIE: 1580 > patched : total: 7222 cross-MC: 4591 cross-DIE: 2631 > (+34.0%) (+20.5%) (+66.5%) > > That's a lot more costly cross-DIE migrations. I think this patch is > along the right lines, but there's something fishy happening on this > box. > > [0] - Fork task placement spread measurement: > > cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \ > sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}' nice command to evaluate spread