Date: Fri, 28 Jul 2017 14:59:24 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Jeffrey Hugo <jhugo@codeaurora.org>, Ingo Molnar <mingo@redhat.com>,
        linux-kernel@vger.kernel.org, Austin Christ <austinwc@codeaurora.org>,
        Tyler Baicar <tbaicar@codeaurora.org>,
        Timur Tabi <timur@codeaurora.org>
Subject: Re: [PATCH V6] sched/fair: Remove group imbalance from
 calculate_imbalance()
Message-ID: <20170728125924.eaw6unxude2qiyym@hirez.programming.kicks-ass.net>
References: <1499975708-31090-1-git-send-email-jhugo@codeaurora.org>
 <0f91065c-94cb-3780-0d77-f2be682086bf@arm.com>
 <20170726145407.rfswqxoclvezukwq@hirez.programming.kicks-ass.net>
 <5ddf061e-26a2-7151-adff-7ae339c848ac@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5ddf061e-26a2-7151-adff-7ae339c848ac@arm.com>
User-Agent: NeoMutt/20170609 (1.8.3)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1548
Lines: 46

On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
> >> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
> >> level sched groups.
> > 
> > That'd be a NUMA box?
> 
> I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.

Ah, I can't read. I thought >2 DIEs.

> > So this is 4 * 18 * 2 = 144 cpus:
> 
> Impressive ;-)

Takes forever to boot though :/

> > If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
> > Then instantly they balance and I get 2x100% back.
> 
> Yeah, could reproduce on IVB-EP (2x10x2).

OK, I have one of those. What should I do, because I didn't actually see
anything odd.

> > Anything else I need to reproduce? (other than maybe a slightly less
> > insane machine :-)
> 
> I guess what Jeff is trying to avoid is that 'busiest->load_per_task'
> lowered to 'sds->avg_load' in case of an imbalanced busiest sg:
> 
>   if (busiest->group_type == group_imbalanced)
>     busiest->load_per_task = min(busiest->load_per_task, sds->avg_load);
> 
> is so low that later fix_small_imbalance() won't be called and
> 'env->imbalance' stays so low that load-balance of on 50% task to the
> now idle cpu won't happen.
> 
>   if (env->imbalance < busiest->load_per_task)
>     fix_small_imbalance(env, sds);
> 
> Having really a lot of otherwise idle DIE sg's helps to keep
> 'sds->avg_load' low in comparison to 'busiest->load_per_task'.

Right, but the whole load_per_task thing is a bit wonky, and since
that's the basis of fix_small_imbalance() I'm very suspect.