MIME-Version: 1.0
In-Reply-To: <20161011102453.GA16071@codeblueprint.co.uk>
References: <20160923115808.2330-1-matt@codeblueprint.co.uk>
 <20160928101422.GR5016@twins.programming.kicks-ass.net> <20160928193731.GD16071@codeblueprint.co.uk>
 <CANRm+CyVFuT3XJt7DZEBZgHb_hQPzDUfOGnkAqNexH4q2ex74Q@mail.gmail.com>
 <20161010100107.GZ16071@codeblueprint.co.uk> <CAKfTPtBFrahA2fBoG5S5CBiJHb8EZkUbPaOZ4jZFc1mVYH5zJQ@mail.gmail.com>
 <20161010173440.GA28945@linaro.org> <20161011102453.GA16071@codeblueprint.co.uk>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Tue, 11 Oct 2016 15:14:47 +0200
Message-ID: <CAKfTPtCKxu=39wbcDW5hioGnWxmGnSoYiu_oMfQEWObwBKzG5w@mail.gmail.com>
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Wanpeng Li <kernellwp@gmail.com>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Mike Galbraith <umgwanakikbuti@gmail.com>,
        Yuyang Du <yuyang.du@intel.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3908
Lines: 102

On 11 October 2016 at 12:24, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Mon, 10 Oct, at 07:34:40PM, Vincent Guittot wrote:
>>
>> Subject: [PATCH] sched: use load_avg for selecting idlest group
>>
>> select_busiest_group only compares the runnable_load_avg when looking for
>> the idlest group. But on fork intensive use case like hackbenchw here task
>> blocked quickly after the fork, this can lead to selecting the same CPU
>> whereas other CPUs, which have similar runnable load but a lower load_avg,
>> could be chosen instead.
>>
>> When the runnable_load_avg of 2 CPUs are close, we now take into account
>> the amount of blocked load as a 2nd selection factor.
>>
>> For use case like hackbench, this enable the scheduler to select different
>> CPUs during the fork sequence and to spread tasks across the system.
>>
>> Tests have been done on a Hikey board (ARM based octo cores) for several
>> kernel. The result below gives min, max, avg and stdev values of 18 runs
>> with each configuration.
>>
>> The v4.8+patches configuration also includes the changes below which is part of the
>> proposal made by Peter to ensure that the clock will be up to date when the
>> fork task will be attached to the rq.
>>
>> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
>>       __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>>  #endif
>>       rq = __task_rq_lock(p, &rf);
>> +     update_rq_clock(rq);
>>       post_init_entity_util_avg(&p->se);
>>
>>       activate_task(rq, p, 0);
>>
>> hackbench -P -g 1
>>
>>        ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
>> min    0.049         0.050         0.051       0,048
>> avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
>> max    0.066         0.068         0.070       0,063
>> stdev  +/-9%         +/-9%         +/-8%       +/-9%
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
>>  1 file changed, 32 insertions(+), 8 deletions(-)
>
> This patch looks pretty good to me and this 2-socket 48-cpu Xeon
> (domain0 SMT, domain1 MC, domain2 NUMA) shows a few nice performance
> improvements, and no regressions for various combinations of hackbench
> sockets/pipes and group numbers.

Good

>
> But on a 2-socket 8-cpu Xeon (domain0 MC, domain1 DIE) running,
>
>   perf stat --null -r 25 -- hackbench -pipe 30 process 1000

I'm going to have a look at this use case on my hikey board made 8
CPUs (2 cluster of quad core) (domain0 MC, domain1 DIE) which seems to
have a similar topology

>
> I see a regression,
>
>   baseline: 2.41228
>   patched : 2.64528 (-9.7%)

Just to be sure; By baseline you mean v4.8 ?

>
> Even though the spread of tasks during fork[0] is improved,
>
>   baseline CV: 0.478%
>   patched CV : 0.042%
>
> Clearly the spread wasn't *that* bad to begin with on this machine for
> this workload. I consider the baseline spread to be pretty well
> distributed. Some other factor must be at play.
>
> Patched runqueue latencies are higher (max9* are percentiles),
>
>   baseline: mean: 615932.69 max90: 75272.00 max95: 175985.00 max99: 5884778.00 max: 1694084747.00
>   patched: mean : 882026.28 max90: 92015.00 max95: 291760.00 max99: 7590167.00 max: 1841154776.00
>
> And there are more migrations of hackbench tasks,
>
>   baseline: total: 5390 cross-MC: 3810 cross-DIE: 1580
>   patched : total: 7222 cross-MC: 4591 cross-DIE: 2631
>                  (+34.0%)       (+20.5%)        (+66.5%)
>
> That's a lot more costly cross-DIE migrations. I think this patch is
> along the right lines, but there's something fishy happening on this
> box.
>
> [0] - Fork task placement spread measurement:
>
>       cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \
>         sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'

nice command to evaluate spread