Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751859AbdGZOyQ (ORCPT ); Wed, 26 Jul 2017 10:54:16 -0400 Received: from merlin.infradead.org ([205.233.59.134]:48852 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751587AbdGZOyO (ORCPT ); Wed, 26 Jul 2017 10:54:14 -0400 Date: Wed, 26 Jul 2017 16:54:07 +0200 From: Peter Zijlstra To: Dietmar Eggemann Cc: Jeffrey Hugo , Ingo Molnar , linux-kernel@vger.kernel.org, Austin Christ , Tyler Baicar , Timur Tabi Subject: Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance() Message-ID: <20170726145407.rfswqxoclvezukwq@hirez.programming.kicks-ass.net> References: <1499975708-31090-1-git-send-email-jhugo@codeaurora.org> <0f91065c-94cb-3780-0d77-f2be682086bf@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0f91065c-94cb-3780-0d77-f2be682086bf@arm.com> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4754 Lines: 115 On Tue, Jul 18, 2017 at 08:48:53PM +0100, Dietmar Eggemann wrote: > Hi Jeffrey, > > On 13/07/17 20:55, Jeffrey Hugo wrote: > > The group_imbalance path in calculate_imbalance() made sense when it was > > added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load > > balance across sched domain") because busiest->load_per_task factored into > > the amount of imbalance that was calculated. Beginning with commit > > dd5feea14a7d ("sched: Fix SCHED_MC regression caused by change in sched > > cpu_power"), busiest->load_per_task is not a factor in the imbalance > > calculation, thus the group_imbalance path no longer makes sense. Bit quick that. If its no longer used, then who cares what value it is... /me reads on. > You're referring here to the use of 'sds->max_load - > sds->busiest_load_per_task' in the calculation of max_pull which got > replaced by load_above_capacity with dd5feea14a7d? > > I still wonder if the original code (908a7c1b9b80) > > if (group_imb) > busiest_load_per_task = min(busiest_load_per_task, avg_load); > > had something to do with the following: > > if (max_load <= busiest_load_per_task) > goto out_balanced; Quite possibly, yes. By lowering busiest_load_per_task it skips that test. But also, as noted, the lower busiest_load_per_task is then used in the imbalance calculation to allow moving more load around, so its not only that. > > The group_imbalance path can only affect the outcome of > > calculate_imbalance() when the average load of the domain is less than the > > original busiest->load_per_task. In this case, busiest->load_per_task is > > overwritten with the scheduling domain load average. Thus > > busiest->load_per_task no longer represents actual load that can be moved. > > > > At the final comparison between env->imbalance and busiest->load_per_task, > > imbalance may be larger than the new busiest->load_per_task causing the > > check to fail under the assumption that there is a task that could be > > migrated to satisfy the imbalance. However env->imbalance may still be > > smaller than the original busiest->load_per_task, thus it is unlikely that > > there is a task that can be migrated to satisfy the imbalance. > > Calculate_imbalance() would not choose to run fix_small_imbalance() when we > > expect it should. In the worst case, this can result in idle cpus. > > > > Since the group imbalance path in calculate_imbalance() is at best a NOP > > but otherwise harmful, remove it. Hurm.. so fix_small_imbalance() itself is a pile of dog poo... it used to make sense a long time ago, but smp-nice and then cgroups made a complete joke of things. > IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE > level sched groups. That'd be a NUMA box? > Running the testcase 'taskset 0x05 '2 always running task'' (both tasks > starting on cpu0) on your machine shows the issue since with your > previous patch [1] "sched/fair: Fix load_balance() affinity redo path" > we now propagate 'group imbalance' from MC level to DIE level and since > you have n > 2 you lower busiest->load_per_task in this group_imbalanced > related if condition all the time and env->imbalance stays too small to > let one of these tasks migrate to cpu2. > > Tried to test it on an Intel i5-3320M (2 cores x 2 HT) with rt-app (2 > always running cfs task with affinity 0x05 for 2*x ms and one rt task > affine to 0x04 for x ms): > > # cat /proc/schedstat | grep ^domain | awk '{ print $1" "$2}' > domain0 03 > domain1 0f > domain0 03 > domain1 0f > domain0 0c > domain1 0f > domain0 0c > domain1 0f > > but here the prefer_sibling handling (group overloaded) eclipses 'group > imbalance' the moment one of the cfs tasks can go to cpu2 so the if > condition you got rid of is a nop. > > I wonder if it is fair to say that your fix helps multi-cluster > (especially with n > 2) systems without SMT and with your first patch > [1] for this specific, cpu affinity restricted test cases. I tried on an IVB-EP with all the HT siblings unplugged, could not reproduce either. Still at n=2 though. Let me fire up an EX, that'll get me n=4. So this is 4 * 18 * 2 = 144 cpus: # for ((i=72; i<144; i++)) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done # taskset -pc 0,18 $$ # while :; do :; done & while :; do :; done & So I'm taking SMT out, affine to first and second MC group, start 2 loops. Using another console I see them both using 100%. If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%. Then instantly they balance and I get 2x100% back. Anything else I need to reproduce? (other than maybe a slightly less insane machine :-) Because I have the feeling that while this patch cures things for you, you're fighting symptoms.