Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752490AbdGELWP (ORCPT ); Wed, 5 Jul 2017 07:22:15 -0400 Received: from merlin.infradead.org ([205.233.59.134]:59520 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751772AbdGELWN (ORCPT ); Wed, 5 Jul 2017 07:22:13 -0400 Date: Wed, 5 Jul 2017 13:22:05 +0200 From: Peter Zijlstra To: Jeffrey Hugo Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Dietmar Eggemann , Austin Christ , Tyler Baicar , Timur Tabi , Morten Rasmussen Subject: Re: [PATCH V5 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Message-ID: <20170705112205.tyeprtki4vel5kpx@hirez.programming.kicks-ass.net> References: <1496863138-11322-1-git-send-email-jhugo@codeaurora.org> <1496863138-11322-3-git-send-email-jhugo@codeaurora.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1496863138-11322-3-git-send-email-jhugo@codeaurora.org> User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1916 Lines: 36 On Wed, Jun 07, 2017 at 01:18:58PM -0600, Jeffrey Hugo wrote: > The group_imbalance path in calculate_imbalance() made sense when it was > added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load > balance across sched domain") because busiest->load_per_task factored into > the amount of imbalance that was calculated. That is not the case today. It would be nice to have some more information on which patch(es) changed that. > The group_imbalance path can only affect the outcome of > calculate_imbalance() when the average load of the domain is less than the > original busiest->load_per_task. In this case, busiest->load_per_task is > overwritten with the scheduling domain load average. Thus > busiest->load_per_task no longer represents actual load that can be moved. > > At the final comparison between env->imbalance and busiest->load_per_task, > imbalance may be larger than the new busiest->load_per_task causing the > check to fail under the assumption that there is a task that could be > migrated to satisfy the imbalance. However env->imbalance may still be > smaller than the original busiest->load_per_task, thus it is unlikely that > there is a task that can be migrated to satisfy the imbalance. > Calculate_imbalance() would not choose to run fix_small_imbalance() when we > expect it should. In the worst case, this can result in idle cpus. > > Since the group imbalance path in calculate_imbalance() is at best a NOP > but otherwise harmful, remove it. load_per_task is horrible and should die. Ever since we did cgroup support the number is complete crap, but even before that the concept was dubious. Most of the logic that uses the number stems from the pre-smp-nice era. This also of course means that fix_small_imbalance() is probably a load of crap. Digging through all that has been on the todo list for a long while but somehow not something I've ever gotten to :/