Subject: Re: [PATCH V6] sched/fair: Remove group imbalance from
 calculate_imbalance()
To: Peter Zijlstra <peterz@infradead.org>
Cc: Jeffrey Hugo <jhugo@codeaurora.org>, Ingo Molnar <mingo@redhat.com>,
        linux-kernel@vger.kernel.org, Austin Christ <austinwc@codeaurora.org>,
        Tyler Baicar <tbaicar@codeaurora.org>,
        Timur Tabi <timur@codeaurora.org>
References: <1499975708-31090-1-git-send-email-jhugo@codeaurora.org>
 <0f91065c-94cb-3780-0d77-f2be682086bf@arm.com>
 <20170726145407.rfswqxoclvezukwq@hirez.programming.kicks-ass.net>
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Message-ID: <5ddf061e-26a2-7151-adff-7ae339c848ac@arm.com>
Date: Fri, 28 Jul 2017 13:16:24 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <20170726145407.rfswqxoclvezukwq@hirez.programming.kicks-ass.net>
Content-Type: text/plain; charset=utf-8
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2639
Lines: 79

On 26/07/17 15:54, Peter Zijlstra wrote:
> On Tue, Jul 18, 2017 at 08:48:53PM +0100, Dietmar Eggemann wrote:
>> Hi Jeffrey,
>>
>> On 13/07/17 20:55, Jeffrey Hugo wrote:

[...]

>>> Since the group imbalance path in calculate_imbalance() is at best a NOP
>>> but otherwise harmful, remove it.
> 
> Hurm.. so fix_small_imbalance() itself is a pile of dog poo... it used
> to make sense a long time ago, but smp-nice and then cgroups made a
> complete joke of things.
> 
>> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
>> level sched groups.
> 
> That'd be a NUMA box?

I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.

[...]

>> but here the prefer_sibling handling (group overloaded) eclipses 'group
>> imbalance' the moment one of the cfs tasks can go to cpu2 so the if
>> condition you got rid of is a nop.
>>
>> I wonder if it is fair to say that your fix helps multi-cluster
>> (especially with n > 2) systems without SMT and with your first patch
>> [1] for this specific, cpu affinity restricted test cases.
> 
> I tried on an IVB-EP with all the HT siblings unplugged, could not
> reproduce either. Still at n=2 though. Let me fire up an EX, that'll get
> me n=4.
> 
> So this is 4 * 18 * 2 = 144 cpus:

Impressive ;-)

> 
> # for ((i=72; i<144; i++)) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
> # taskset -pc 0,18 $$
> # while :; do :; done & while :; do :; done &
> 
> So I'm taking SMT out, affine to first and second MC group, start 2
> loops.
> 
> Using another console I see them both using 100%.
> 
> If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
> Then instantly they balance and I get 2x100% back.

Yeah, could reproduce on IVB-EP (2x10x2).

> Anything else I need to reproduce? (other than maybe a slightly less
> insane machine :-)

I guess what Jeff is trying to avoid is that 'busiest->load_per_task'
lowered to 'sds->avg_load' in case of an imbalanced busiest sg:

  if (busiest->group_type == group_imbalanced)
    busiest->load_per_task = min(busiest->load_per_task, sds->avg_load);

is so low that later fix_small_imbalance() won't be called and
'env->imbalance' stays so low that load-balance of on 50% task to the
now idle cpu won't happen.

  if (env->imbalance < busiest->load_per_task)
    fix_small_imbalance(env, sds);

Having really a lot of otherwise idle DIE sg's helps to keep
'sds->avg_load' low in comparison to 'busiest->load_per_task'.

> Because I have the feeling that while this patch cures things for you,
> you're fighting symptoms.

Unfortunately, don't have a machine available with n >> 2 (on DIE or
NUMA) ...