MIME-Version: 1.0
In-Reply-To: <20170426225202.GC11348@wtj.duckdns.org>
References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org>
 <CAKfTPtDsZ4bRbmd47a3P-jDq4GC8FfM9=b+jpqnLEHqA8L+UtQ@mail.gmail.com> <20170426225202.GC11348@wtj.duckdns.org>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu, 27 Apr 2017 10:29:10 +0200
Message-ID: <CAKfTPtAasz-YAUGRKciY_Pab13n+2QGU6-xC=j7ef+jvqGwFtQ@mail.gmail.com>
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
        Chris Mason <clm@fb.com>, kernel-team@fb.com
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2397
Lines: 58

On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
>> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>> Can the problem be on the load balance side instead ?  and more
>> precisely in the wakeup path ?
>> After looking at the trace, it seems that task placement happens at
>> wake up path and if it fails to select the right idle cpu at wake up,
>> you will have to wait for a load balance which is alreayd too late
>
> Oh, I was tracing most of scheduler activities and the ratios of
> wakeups picking idle CPUs were about the same regardless of cgroup
> membership.  I can confidently say that the latency issue that I'm
> seeing is from load balancer picking the wrong busiest CPU, which is
> not to say that there can be other problems.

ok. Is there any trace that you can share ? your behavior seems
different of mine

>
>> > another queued wouldn't report the correspondingly higher
>>
>> It will as load_avg includes the runnable_load_avg so whatever load is
>> in runnable_load_avg will be in load_avg too. But at the contrary,
>> runnable_load_avg will not have the blocked that is going to wake up
>> soon in the case of schbench
>
> Decaying contribution of blocked tasks don't affect the busiest CPU
> selection.  Without cgroup, runnable_load_avg is immediately increased
> and decreased as tasks enter and leave the queue and otherwise we end
> up with CPUs which are idle when there are threads queued on different
> CPUs accumulating scheduling latencies.
>
> The patch doesn't change how the busiest CPU is picked.  It already
> uses runnable_load_avg.  The change that cgroup causes is that it
> blocks updates to runnable_load_avg from newly scheduled or sleeping
> tasks.
>
> The issue isn't about whether runnable_load_avg or load_avg should be
> used but the unexpected differences in the metrics that the load

I think that's the root of the problem. I explain a bit more my view
on the other thread

> balancer uses depending on whether cgroup is used or not.
>
>> One last thing, the load_avg of an idle CPU can stay blocked for a
>> while (until a load balance happens that will update blocked load) and
>> can be seen has "busy" whereas it is not. Could it be a reason of your
>> problem ?
>
> AFAICS, the load balancer doesn't use load_avg.
>
> Thanks.
>
> --
> tejun