Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
To: Vincent Guittot <vincent.guittot@linaro.org>,
        Peter Zijlstra <peterz@infradead.org>
References: <57FFADC8.2020602@canonical.com>
 <CAKfTPtDU+DmF4iLwHWF2jEnZCjPUXBOu9c6Zqh4+kQPMio39nQ@mail.gmail.com>
 <43c59cba-2044-1de2-0f78-8f346bd1e3cb@arm.com>
 <CAKfTPtAo=E62XK8kTrrCp1NbxQbhJeYFEoxfTEHvdKTyVZODbQ@mail.gmail.com>
 <CAKfTPtADw=JChCccEQ+HohaYwbOCh3QuWOFrGJdGudOHCD8-OA@mail.gmail.com>
 <d0b0e00d-2892-6386-3e09-8df568e161ef@arm.com>
 <20161014151827.GA10379@linaro.org>
 <2bb765e7-8a5f-c525-a6ae-fbec6fae6354@canonical.com>
 <20161017090903.GA11962@linaro.org>
 <4e15ad55-beeb-e860-0420-8f439d076758@arm.com>
 <20161017131952.GR3117@twins.programming.kicks-ass.net>
 <CAKfTPtAxw1b-vy285HKtPUBFYuJdv2CFZH_gP3CMtZHs1wLPXg@mail.gmail.com>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>,
        Ingo Molnar <mingo@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        LKML <linux-kernel@vger.kernel.org>, Mike Galbraith <efault@gmx.de>,
        omer.akram@canonical.com
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Message-ID: <94cc6deb-f93e-60ec-5834-e84a8b98e73c@arm.com>
Date: Mon, 17 Oct 2016 23:52:39 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.3.0
MIME-Version: 1.0
In-Reply-To: <CAKfTPtAxw1b-vy285HKtPUBFYuJdv2CFZH_gP3CMtZHs1wLPXg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2684
Lines: 56

On 10/17/2016 02:54 PM, Vincent Guittot wrote:
> On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:

[...]

>>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
>>> initially because systemd will create all ~100 services (and therefore the
>>> corresponding 2. level tg's) at once. In my previous example, there was 500ms
>>> between the creation of 2 tg's so there was a lot of decaying going on in between.
>>
>> Cute... on current kernels that translates to simply removing the call
>> to update_tg_load_avg(), lets see if we can figure out what goes
>> sideways first though, because it _should_ decay back out. And if that
> 
> yes, Reaching ~300000-400000 is not an issue in itself, the problem is
> that load_avg has decayed but it has not been reflected in
> tg->load_avg in the buggy case
> 
>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>>
>> I'll see if I can reproduce this with a script creating heaps of cgroups
>> in a hurry, I have a total lack of system-disease on all my machines.


Something looks weird related to the use of for_each_possible_cpu(i) in
online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).

In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
I get:

[ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3

T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
80

/proc/sched_debug:

cfs_rq[0]:/system.slice
  ...
  .tg_load_avg                   : 323584
  ...

80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
this could be this extra load on the systen.slice tg)

Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
of system.slice tg is 0 after startup.