Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751816AbdG1MQ2 (ORCPT ); Fri, 28 Jul 2017 08:16:28 -0400 Received: from foss.arm.com ([217.140.101.70]:57676 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751730AbdG1MQ1 (ORCPT ); Fri, 28 Jul 2017 08:16:27 -0400 Subject: Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance() To: Peter Zijlstra Cc: Jeffrey Hugo , Ingo Molnar , linux-kernel@vger.kernel.org, Austin Christ , Tyler Baicar , Timur Tabi References: <1499975708-31090-1-git-send-email-jhugo@codeaurora.org> <0f91065c-94cb-3780-0d77-f2be682086bf@arm.com> <20170726145407.rfswqxoclvezukwq@hirez.programming.kicks-ass.net> From: Dietmar Eggemann Message-ID: <5ddf061e-26a2-7151-adff-7ae339c848ac@arm.com> Date: Fri, 28 Jul 2017 13:16:24 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170726145407.rfswqxoclvezukwq@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2639 Lines: 79 On 26/07/17 15:54, Peter Zijlstra wrote: > On Tue, Jul 18, 2017 at 08:48:53PM +0100, Dietmar Eggemann wrote: >> Hi Jeffrey, >> >> On 13/07/17 20:55, Jeffrey Hugo wrote: [...] >>> Since the group imbalance path in calculate_imbalance() is at best a NOP >>> but otherwise harmful, remove it. > > Hurm.. so fix_small_imbalance() itself is a pile of dog poo... it used > to make sense a long time ago, but smp-nice and then cgroups made a > complete joke of things. > >> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE >> level sched groups. > > That'd be a NUMA box? I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2. [...] >> but here the prefer_sibling handling (group overloaded) eclipses 'group >> imbalance' the moment one of the cfs tasks can go to cpu2 so the if >> condition you got rid of is a nop. >> >> I wonder if it is fair to say that your fix helps multi-cluster >> (especially with n > 2) systems without SMT and with your first patch >> [1] for this specific, cpu affinity restricted test cases. > > I tried on an IVB-EP with all the HT siblings unplugged, could not > reproduce either. Still at n=2 though. Let me fire up an EX, that'll get > me n=4. > > So this is 4 * 18 * 2 = 144 cpus: Impressive ;-) > > # for ((i=72; i<144; i++)) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done > # taskset -pc 0,18 $$ > # while :; do :; done & while :; do :; done & > > So I'm taking SMT out, affine to first and second MC group, start 2 > loops. > > Using another console I see them both using 100%. > > If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%. > Then instantly they balance and I get 2x100% back. Yeah, could reproduce on IVB-EP (2x10x2). > Anything else I need to reproduce? (other than maybe a slightly less > insane machine :-) I guess what Jeff is trying to avoid is that 'busiest->load_per_task' lowered to 'sds->avg_load' in case of an imbalanced busiest sg: if (busiest->group_type == group_imbalanced) busiest->load_per_task = min(busiest->load_per_task, sds->avg_load); is so low that later fix_small_imbalance() won't be called and 'env->imbalance' stays so low that load-balance of on 50% task to the now idle cpu won't happen. if (env->imbalance < busiest->load_per_task) fix_small_imbalance(env, sds); Having really a lot of otherwise idle DIE sg's helps to keep 'sds->avg_load' low in comparison to 'busiest->load_per_task'. > Because I have the feeling that while this patch cures things for you, > you're fighting symptoms. Unfortunately, don't have a machine available with n >> 2 (on DIE or NUMA) ...