Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753155AbdH1I45 (ORCPT ); Mon, 28 Aug 2017 04:56:57 -0400 Received: from mail-it0-f45.google.com ([209.85.214.45]:35928 "EHLO mail-it0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752375AbdH1I4z (ORCPT ); Mon, 28 Aug 2017 04:56:55 -0400 MIME-Version: 1.0 In-Reply-To: <87378f1w87.fsf@arm.com> References: <20170825101632.28065-1-brendan.jackman@arm.com> <20170825101632.28065-6-brendan.jackman@arm.com> <87378f1w87.fsf@arm.com> From: Vincent Guittot Date: Mon, 28 Aug 2017 10:56:34 +0200 Message-ID: Subject: Re: [PATCH v2 5/5] sched/fair: Fix use of find_idlest_group when local group is idlest. To: Brendan Jackman Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Joel Fernandes , Andres Oportus , Dietmar Eggemann , Josef Bacik , Morten Rasmussen Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8798 Lines: 210 On 25 August 2017 at 17:51, Brendan Jackman wrote: > > On Fri, Aug 25 2017 at 13:38, Vincent Guittot wrote: >> On 25 August 2017 at 12:16, Brendan Jackman wrote: >>> find_idlest_group currently returns NULL when the local group is >>> idlest. The caller then continues the find_idlest_group search at a >>> lower level of the current CPU's sched_domain hierarchy. However >>> find_idlest_group_cpu is not consulted and, crucially, @new_cpu is >>> not updated. This means the search is pointless and we return >>> @prev_cpu from select_task_rq_fair. >>> >>> This patch makes find_idlest_group return the idlest group, and never >>> NULL, and then has the caller unconditionally continue to consult >>> find_idlest_group_cpu and update @new_cpu. >>> >>> * * * >>> >>> An alternative, simpler, fix for this would be to initialize @new_cpu >>> to @cpu instead of @prev_cpu, at the beginning of the new >>> find_idlest_cpu. However this would not fix the fact that >> >> Yes this is simpler >> >>> find_idlest_group_cpu goes unused, and we miss out on the bias toward >>> a shallower-idle CPU, unless find_idlest_group picks a non-local >>> group. >> >> I'm not sure to catch why it's not enough. >> If no cpu of sched_domain span is allowed, we continue to return prev_cpu >> But if at least 1 cpu is allowed, the default new_cpu is cpu in case >> it is really the idlest and no other group will be returned by >> find_idlest_group >> Otherwise, find_idlest_group will finally return another group than >> the local one when running with sd->child > > This patch isn't about affinity but just about which group is idlest. So this patch changes 2 things: - don't return @prev if @cpu is the idlest one - always call find_idlest_group_cpu even for local group The 1st point is a bug but the 2nd one is more a matter of policy. Current implementation favors locality and load/capacity over current idle state of CPU which is only used at a last resort to select a cpu. What you are proposing, it's to increase the weight of the current idle state of cpus in the selection of the cpu in order to become sometime higher than the locality of even the load/capacity > > Consider a 4 CPU system with 2 shared caches where CPU0's sched_domains > look like: > > DIE: [0 1][2 3] > MC: [0][1] > > Suppose CPUs 0 and 1 are equally idle wrt load and util, but CPU1 is in > a shallower idle state. Suppose all CPUs are allowed. > > Assuming @cpu is 0, on the first iteration, find_idlest_group (fig) will > currently return NULL meaning [0 1], then on the second iteration it > will return NULL meaning [0]. > > With this patch, on the first iteration fig will return [0 1], then > find_idlest_group_cpu (figc) will return 1. Then on the second > iteration, fig should return [1] (because it is biased towards the local > group i.e. [@cpu] i.e. [1]). This is better because CPU1 is in a > shallower idle state. > > So basically we're currently missing out on the biasing in figc, unless > we at some point return a non-local group. > > Does that make sense? With your previous example if the load/capacity of CPU0 is significantly lower than CPU1, we will also select CPU1 instead of CPU0: @cpu is 0. fig returns local group [0 1] figc returns 1 @new_cpu != @cpu so we don't go to child cpu = new cpu sd = NULL weight = 4 sd stays null because weight == tmp->span_weight (DIE) we don't loop at lower level fic returns CPU1 > >> >> Is there sched_domain topology where the sched_group of lowest level >> is not with only 1 cpu ? >> >>> >>> If that alternative solution was preferred, then some code could be >>> removed in recognition of the fact that when find_idlest_group_cpu >>> was called, @cpu would never be a member of @group (because @group >>> would be a non-local group, and since @sd is derived from @cpu, @cpu >>> would always be in @sd's local group). >>> >>> Signed-off-by: Brendan Jackman >>> Cc: Dietmar Eggemann >>> Cc: Vincent Guittot >>> Cc: Josef Bacik >>> Cc: Ingo Molnar >>> Cc: Morten Rasmussen >>> Cc: Peter Zijlstra >>> --- >>> kernel/sched/fair.c | 29 ++++++++++------------------- >>> 1 file changed, 10 insertions(+), 19 deletions(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 26080917ff8d..93e2746d3c30 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -5384,11 +5384,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) >>> * Assumes p is allowed on at least one CPU in sd. >>> */ >>> static struct sched_group * >>> -find_idlest_group(struct sched_domain *sd, struct task_struct *p, >>> - int this_cpu, int sd_flag) >>> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag) >>> { >>> - struct sched_group *idlest = NULL, *group = sd->groups; >>> - struct sched_group *most_spare_sg = NULL; >>> + struct sched_group *group = sd->groups, *local_group = sd->groups; >>> + struct sched_group *idlest = NULL, *most_spare_sg = NULL; >>> unsigned long min_runnable_load = ULONG_MAX; >>> unsigned long this_runnable_load = ULONG_MAX; >>> unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX; >>> @@ -5404,7 +5403,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >>> do { >>> unsigned long load, avg_load, runnable_load; >>> unsigned long spare_cap, max_spare_cap; >>> - int local_group; >>> int i; >>> >>> /* Skip over this group if it has no CPUs allowed */ >>> @@ -5412,9 +5410,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >>> &p->cpus_allowed)) >>> continue; >>> >>> - local_group = cpumask_test_cpu(this_cpu, >>> - sched_group_span(group)); >>> - >>> /* >>> * Tally up the load of all CPUs in the group and find >>> * the group containing the CPU with most spare capacity. >>> @@ -5425,7 +5420,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >>> >>> for_each_cpu(i, sched_group_span(group)) { >>> /* Bias balancing toward cpus of our domain */ >>> - if (local_group) >>> + if (group == local_group) >>> load = source_load(i, load_idx); >>> else >>> load = target_load(i, load_idx); >>> @@ -5446,7 +5441,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >>> runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / >>> group->sgc->capacity; >>> >>> - if (local_group) { >>> + if (group == local_group) { >>> this_runnable_load = runnable_load; >>> this_avg_load = avg_load; >>> this_spare = max_spare_cap; >>> @@ -5492,21 +5487,21 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >>> >>> if (this_spare > task_util(p) / 2 && >>> imbalance_scale*this_spare > 100*most_spare) >>> - return NULL; >>> + return local_group; >>> >>> if (most_spare > task_util(p) / 2) >>> return most_spare_sg; >>> >>> skip_spare: >>> if (!idlest) >>> - return NULL; >>> + return local_group; >>> >>> if (min_runnable_load > (this_runnable_load + imbalance)) >>> - return NULL; >>> + return local_group; >>> >>> if ((this_runnable_load < (min_runnable_load + imbalance)) && >>> (100*this_avg_load < imbalance_scale*min_avg_load)) >>> - return NULL; >>> + return local_group; >>> >>> return idlest; >>> } >>> @@ -5582,11 +5577,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p >>> continue; >>> } >>> >>> - group = find_idlest_group(sd, p, cpu, sd_flag); >>> - if (!group) { >>> - sd = sd->child; >>> - continue; >>> - } >>> + group = find_idlest_group(sd, p, sd_flag); >>> >>> new_cpu = find_idlest_group_cpu(group, p, cpu); >>> if (new_cpu == cpu) { >>> -- >>> 2.14.1 >>> >