2013-10-21 11:44:52

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH 0/3] sched: Fixes for task placement in SMT threads

Hi,

The following series fixes scheduler loadbalancing issues where we are
missing opportunity to place tasks in SMT threads optimally especially
on a POWER7 system.

PATCH 1/3, Fixes the scenario where load balancing fails to move tasks
away from the cpus in a domain which has SHARE_PKG_RESOURCES set even
under the presence of idle cpus in other domains.

PATCH 2/3, ensures lower order SMT threads are used for the
SD_ASYM_PACKING load balancing case.

PATCH 3/3, tries to fix the problem that is explained in its
changelog. The following experiment expose the scenario:

A task placement test was conducted on a POWER7 as well as multi-core
x86 system. At the beginning, tasks are pinned to all the threads
within a core and later only the tasks pinned to the secondary threads
in a core are unpinned. This leaves the primary thread with tasks that
are unmovable, while the rest are free to be moved.

Under the above setup, load balancing today does not move the unpinned
tasks away from the secondary threads of the core in the above
experiment although there are other idle cpus. The PATCH 3/3 fixes
this situation and improves task placement even if some of the tasks
in a core are pinned.

This series applies on v3.12-rc6 and tested on x86 and powerpc.

--Vaidy

---

Preeti U Murthy (2):
sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
sched: Aggressive balance in domains whose groups share package resources

Vaidyanathan Srinivasan (1):
sched: Fix asymmetric scheduling for POWER7


kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++--------
1 file changed, 33 insertions(+), 8 deletions(-)


2013-10-21 11:44:55

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

From: Preeti U Murthy <[email protected]>

In nohz_kick_needed() there are checks around the flags
SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the
domains with this flag set have more than one cpu busy. Therefore at
every domain, a check has to be made on nr_busy of that domain. This
means the sum of the nr_busy of each group in that domain needs to be
checked, since nr_busy is a parameter which is associated with
a group. However in the current implementation of nohz_kick_needed(),
the nr_busy is being checked for just the group to which the cpu that
has initiated this check belongs to. This will give us the wrong count
of the number of busy cpus in that domain.

The following commit which fixed the sgp->nr_busy_cpus computation
actually exposed the bug in nohz_kick_needed() which worked when
nr_busy was incorrectly > 1

25f55d9d01ad7a7ad248fd5af1d22675ffd202c5
sched: Fix init NOHZ_IDLE flag

To illustrate the scenario, consider a core, whose domain will have
the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when
we find that more than one thread in the core is busy. With the
current implementation of nohz_kick_needed(), at this domain(sd), the
nr_busy will be 1 always since it returns this parameter for
sd->groups which encompasses a single thread, while we want this
parameter for sd->parent->groups which will rightly point to the
number of busy threads in the core.

This patch also ensures that the order of check for
SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING.
Priority is given to avoid more than one busy thread in a core as much
as possible before attempting asymmetric packing.

Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..12f0eab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)

rcu_read_lock();
for_each_domain(cpu, sd) {
- struct sched_group *sg = sd->groups;
- struct sched_group_power *sgp = sg->sgp;
- int nr_busy = atomic_read(&sgp->nr_busy_cpus);
-
- if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
- goto need_kick_unlock;
+ struct sched_domain *sd_parent = sd->parent;
+ struct sched_group *sg;
+ struct sched_group_power *sgp;
+ int nr_busy;
+
+ if (sd_parent) {
+ sg = sd_parent->groups;
+ sgp = sg->sgp;
+ nr_busy = atomic_read(&sgp->nr_busy_cpus);
+
+ if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
+ goto need_kick_unlock;
+ }

if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
&& (cpumask_first_and(nohz.idle_cpus_mask,

2013-10-21 11:45:06

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH 2/3] sched: Fix asymmetric scheduling for POWER7

Asymmetric scheduling within a core is a scheduler loadbalancing
feature that is triggered when SD_ASYM_PACKING flag is set. The goal
for the load balancer is to move tasks to lower order idle SMT threads
within a core on a POWER7 system.

In nohz_kick_needed(), we intend to check if our sched domain (core)
is completely busy or we have idle cpu.

The following check for SD_ASYM_PACKING:

(cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)

already covers the case of checking if the domain has an idle cpu,
because cpumask_first_and() will not yield any set bits if this domain
has no idle cpu.

Hence, nr_busy check against group weight can be removed.

Reported-by: Michael Neuling <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12f0eab..828ed97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
goto need_kick_unlock;
}

- if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
- && (cpumask_first_and(nohz.idle_cpus_mask,
+ if (sd->flags & SD_ASYM_PACKING &&
+ (cpumask_first_and(nohz.idle_cpus_mask,
sched_domain_span(sd)) < cpu))
goto need_kick_unlock;

2013-10-21 11:45:17

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

From: Preeti U Murthy <[email protected]>

The current logic in load balance is such that after picking the
busiest group, the load is attempted to be moved from the busiest cpu
in that group to the dst_cpu. If the load cannot be moved from the
busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache
hot tasks, then the dst_cpu is changed to be another idle cpu within
the dst->grpmask. If even then, the load cannot be moved from the
busiest cpu, then the source group is changed. The next busiest group
is found and the above steps are repeated.

However if the cpus in the group share package resources, then when
a load movement from the busiest cpu in this group fails as above,
instead of finding the next busiest group to move load from, find the
next busiest cpu *within the same group* from which to move load away.
By doing so, a conscious effort is made during load balancing to keep
just one cpu busy as much as possible within domains that have
SHARED_PKG_RESOURCES flag set unless under scenarios of high load.
Having multiple cpus busy within a domain which share package resource
could lead to a performance hit.

A similar scenario arises in active load balancing as well. When the
current task on the busiest cpu cannot be moved away due to task
pinning, currently no more attempts at load balancing is made. This
patch checks if the balancing is being done on a group whose cpus
share package resources. If so, then check if the load balancing can
be done for other cpus in the same group.

Signed-off-by: Preeti U Murthy <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 828ed97..bbcd96b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
{
int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_group *group;
+ struct sched_domain *child;
+ int share_pkg_res = 0;
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_mask);
@@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,

schedstat_inc(sd, lb_count[idle]);

+ child = sd->child;
+ if (child && child->flags & SD_SHARE_PKG_RESOURCES)
+ share_pkg_res = 1;
+
redo:
if (!should_we_balance(&env)) {
*continue_balancing = 0;
@@ -5202,6 +5208,7 @@ redo:
goto out_balanced;
}

+redo_grp:
busiest = find_busiest_queue(&env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
@@ -5292,6 +5299,11 @@ more_balance:
if (!cpumask_empty(cpus)) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
+ if (share_pkg_res &&
+ cpumask_intersects(cpus,
+ to_cpumask(group->cpumask)))
+ goto redo_grp;
+
goto redo;
}
goto out_balanced;
@@ -5318,9 +5330,15 @@ more_balance:
*/
if (!cpumask_test_cpu(this_cpu,
tsk_cpus_allowed(busiest->curr))) {
+ cpumask_clear_cpu(cpu_of(busiest), cpus);
raw_spin_unlock_irqrestore(&busiest->lock,
flags);
env.flags |= LBF_ALL_PINNED;
+ if (share_pkg_res &&
+ cpumask_intersects(cpus,
+ to_cpumask(group->cpumask)))
+ goto redo_grp;
+
goto out_one_pinned;
}

2013-10-21 22:55:34

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: Fix asymmetric scheduling for POWER7

Vaidyanathan Srinivasan <[email protected]> wrote:

> Asymmetric scheduling within a core is a scheduler loadbalancing
> feature that is triggered when SD_ASYM_PACKING flag is set. The goal
> for the load balancer is to move tasks to lower order idle SMT threads
> within a core on a POWER7 system.
>
> In nohz_kick_needed(), we intend to check if our sched domain (core)
> is completely busy or we have idle cpu.
>
> The following check for SD_ASYM_PACKING:
>
> (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)
>
> already covers the case of checking if the domain has an idle cpu,
> because cpumask_first_and() will not yield any set bits if this domain
> has no idle cpu.
>
> Hence, nr_busy check against group weight can be removed.
>
> Reported-by: Michael Neuling <[email protected]>

Tested-by: Michael Neuling <[email protected]>

Peter, I tested this only a brief while back but it turned out my test
wasn't stringent enough and it was actually broken (in v3.9). This
fixes it.

Mikey

> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> Signed-off-by: Preeti U Murthy <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12f0eab..828ed97 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> goto need_kick_unlock;
> }
>
> - if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
> - && (cpumask_first_and(nohz.idle_cpus_mask,
> + if (sd->flags & SD_ASYM_PACKING &&
> + (cpumask_first_and(nohz.idle_cpus_mask,
> sched_domain_span(sd)) < cpu))
> goto need_kick_unlock;
>
>

2013-10-22 14:36:27

by Kamalesh Babulal

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

* Vaidyanathan Srinivasan <[email protected]> [2013-10-21 17:14:42]:

> for_each_domain(cpu, sd) {
> - struct sched_group *sg = sd->groups;
> - struct sched_group_power *sgp = sg->sgp;
> - int nr_busy = atomic_read(&sgp->nr_busy_cpus);
> -
> - if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> - goto need_kick_unlock;
> + struct sched_domain *sd_parent = sd->parent;
> + struct sched_group *sg;
> + struct sched_group_power *sgp;
> + int nr_busy;
> +
> + if (sd_parent) {
> + sg = sd_parent->groups;
> + sgp = sg->sgp;
> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
> +
> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> + goto need_kick_unlock;
> + }
>
> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
> && (cpumask_first_and(nohz.idle_cpus_mask,

CC'ing Suresh Siddha and Vincent Guittot

Please correct me, If my understanding of idle balancing is wrong.
With proposed approach will not idle load balancer kick in, even if
there are busy cpus across groups or if there are 2 busy cpus which
are spread across sockets.

Consider 2 socket machine with 4 processors each (MC and NUMA domains).
If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too
nohz balancing is triggered because with this approach
(NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while
iterating over MC domain.

Isn't idle load balancer not suppose kick in, even in the case of two busy
cpu's in a dual-core single socket system.

Thanks,
Kamalesh.

2013-10-22 16:43:40

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

Hi Kamalesh,

On 10/22/2013 08:05 PM, Kamalesh Babulal wrote:
> * Vaidyanathan Srinivasan <[email protected]> [2013-10-21 17:14:42]:
>
>> for_each_domain(cpu, sd) {
>> - struct sched_group *sg = sd->groups;
>> - struct sched_group_power *sgp = sg->sgp;
>> - int nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> -
>> - if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> - goto need_kick_unlock;
>> + struct sched_domain *sd_parent = sd->parent;
>> + struct sched_group *sg;
>> + struct sched_group_power *sgp;
>> + int nr_busy;
>> +
>> + if (sd_parent) {
>> + sg = sd_parent->groups;
>> + sgp = sg->sgp;
>> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> +
>> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> + goto need_kick_unlock;
>> + }
>>
>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>> && (cpumask_first_and(nohz.idle_cpus_mask,
>
> CC'ing Suresh Siddha and Vincent Guittot
>
> Please correct me, If my understanding of idle balancing is wrong.
> With proposed approach will not idle load balancer kick in, even if
> there are busy cpus across groups or if there are 2 busy cpus which
> are spread across sockets.

Yes load balancing will happen on busy cpus periodically.

Wrt idle balancing there are two points here. One, when a CPU is just
about to go idle, it will enter idle_balance(), and trigger load
balancing with itself being the destination CPU to begin with. It will
load balance at every level of the sched domain that it belongs to. If
it manages to pull tasks, good, else it will enter an idle state.

nohz_idle_balancing is triggered by a busy cpu at every tick if it has
more than one task in its runqueue or if it belongs to a group that
shares the package resources and has more than one cpu busy. By
"nohz_idle_balance triggered", it means the busy cpu will send an ipi to
the ilb_cpu to do load balancing on the behalf of the idle cpus in the
nohz mask.

So to answer your question wrt this patch, if there is one busy cpu with
say 2 tasks in one socket and another busy cpu with 1 task on another
socket, the former busy cpu can kick nohz_idle_balance since it has more
than one task in its runqueue. An idle cpu in either socket could be
woken up to balance tasks with it.

The usual idle load balancer that runs on a CPU about to become idle
could pull from either cpu depending on who is more busy as it begins to
load balance across all levels of sched domain that it belongs to.
>
> Consider 2 socket machine with 4 processors each (MC and NUMA domains).
> If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too
> nohz balancing is triggered because with this approach
> (NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while
> iterating over MC domain.

For the example that you mention, you will have a CPU domain and a NUMA
domain. When the sockets are NUMA nodes, each socket will belong to a
CPU domain. If the sockets are non-numa nodes, then the domain
encompassing both the nodes will be a CPU domain, possibly with each
socket being an MC domain.
>
> Isn't idle load balancer not suppose kick in, even in the case of two busy
> cpu's in a dual-core single socket system

nohz_idle_balancing is a special case. It is triggered when the
conditions mentioned in nohz_kick_needed() are true. A CPU just about to
go idle will trigger load balancing without any pre-conditions.

In a single socket machine, there will be a CPU domain encompassing the
socket and the MC domain will encompass a core. nohz_idle load balancer
will kick in if both the threads in the core have tasks running on them.
This is fair enough because the threads share the resources of the core.

Regards
Preeti U Murthy
>
> Thanks,
> Kamalesh.
>

2013-10-22 22:12:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
> kernel/sched/fair.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7c70201..12f0eab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> + struct sched_domain *sd_parent = sd->parent;
> + struct sched_group *sg;
> + struct sched_group_power *sgp;
> + int nr_busy;
> +
> + if (sd_parent) {
> + sg = sd_parent->groups;
> + sgp = sg->sgp;
> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
> +
> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> + goto need_kick_unlock;
> + }
>
> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
> && (cpumask_first_and(nohz.idle_cpus_mask,
>

Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?

Also, this made me look at the nr_busy stuff again, and somehow that
entire thing makes me a little sad.

Can't we do something like the below and cut that nr_busy sd iteration
short?

This nohz stuff really needs to be re-thought and made more scalable --
its a royal pain :/


kernel/sched/core.c | 4 ++++
kernel/sched/fair.c | 21 +++++++++++++++------
kernel/sched/sched.h | 5 ++---
3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c06b8d3..89db8dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain *, sd_busy);

static void update_top_cache_domain(int cpu)
{
@@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)

sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+
+ sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING);
+ rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
}

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 813dd61..3d5141e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6512,19 +6512,23 @@ static inline void nohz_balance_exit_idle(int cpu)
}
}

-static inline void set_cpu_sd_state_busy(void)
+static inline void set_cpu_sd_state_busy(int cpu)
{
struct sched_domain *sd;
+ struct rq *rq = cpu_rq(cpu);

rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+ sd = rcu_dereference_check_sched_domain(rq->sd);

if (!sd || !sd->nohz_idle)
goto unlock;
sd->nohz_idle = 0;

- for (; sd; sd = sd->parent)
+ for (; sd; sd = sd->parent) {
atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+ if (sd == per_cpu(sd_busy, cpu))
+ break;
+ }
unlock:
rcu_read_unlock();
}
@@ -6532,16 +6536,21 @@ static inline void set_cpu_sd_state_busy(void)
void set_cpu_sd_state_idle(void)
{
struct sched_domain *sd;
+ int cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);

rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+ sd = rcu_dereference_check_sched_domain(rq->sd);

if (!sd || sd->nohz_idle)
goto unlock;
sd->nohz_idle = 1;

- for (; sd; sd = sd->parent)
+ for (; sd; sd = sd->parent) {
atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+ if (sd == per_cpu(sd_busy, cpu))
+ break;
+ }
unlock:
rcu_read_unlock();
}
@@ -6756,7 +6765,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
* We may be recently in ticked or tickless idle mode. At the first
* busy tick after returning from idle, we will update the busy stats.
*/
- set_cpu_sd_state_busy();
+ set_cpu_sd_state_busy(cpu);
nohz_balance_exit_idle(cpu);

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffc7087..80c5fd2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -599,9 +599,8 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
struct sched_domain *sd, *hsd = NULL;

for_each_domain(cpu, sd) {
- if (!(sd->flags & flag))
- break;
- hsd = sd;
+ if (sd->flags & flag)
+ hsd = sd;
}

return hsd;

2013-10-22 22:18:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] sched: Fix asymmetric scheduling for POWER7

On Mon, Oct 21, 2013 at 05:14:52PM +0530, Vaidyanathan Srinivasan wrote:
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12f0eab..828ed97 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> goto need_kick_unlock;
> }
>
> - if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
> - && (cpumask_first_and(nohz.idle_cpus_mask,
> + if (sd->flags & SD_ASYM_PACKING &&
> + (cpumask_first_and(nohz.idle_cpus_mask,
> sched_domain_span(sd)) < cpu))
> goto need_kick_unlock;
>
>

Ahh, so here you remove the nr_busy usage.. this patch should really go
before the first one that makes this all weird and funny.

2013-10-22 22:23:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
> kernel/sched/fair.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 828ed97..bbcd96b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> {
> int ld_moved, cur_ld_moved, active_balance = 0;
> struct sched_group *group;
> + struct sched_domain *child;
> + int share_pkg_res = 0;
> struct rq *busiest;
> unsigned long flags;
> struct cpumask *cpus = __get_cpu_var(load_balance_mask);
> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>
> schedstat_inc(sd, lb_count[idle]);
>
> + child = sd->child;
> + if (child && child->flags & SD_SHARE_PKG_RESOURCES)
> + share_pkg_res = 1;
> +
> redo:
> if (!should_we_balance(&env)) {
> *continue_balancing = 0;
> @@ -5202,6 +5208,7 @@ redo:
> goto out_balanced;
> }
>
> +redo_grp:
> busiest = find_busiest_queue(&env, group);
> if (!busiest) {
> schedstat_inc(sd, lb_nobusyq[idle]);
> @@ -5292,6 +5299,11 @@ more_balance:
> if (!cpumask_empty(cpus)) {
> env.loop = 0;
> env.loop_break = sched_nr_migrate_break;
> + if (share_pkg_res &&
> + cpumask_intersects(cpus,
> + to_cpumask(group->cpumask)))

sched_group_cpus()

> + goto redo_grp;
> +
> goto redo;
> }
> goto out_balanced;
> @@ -5318,9 +5330,15 @@ more_balance:
> */
> if (!cpumask_test_cpu(this_cpu,
> tsk_cpus_allowed(busiest->curr))) {
> + cpumask_clear_cpu(cpu_of(busiest), cpus);
> raw_spin_unlock_irqrestore(&busiest->lock,
> flags);
> env.flags |= LBF_ALL_PINNED;
> + if (share_pkg_res &&
> + cpumask_intersects(cpus,
> + to_cpumask(group->cpumask)))
> + goto redo_grp;
> +
> goto out_one_pinned;
> }

Man this retry logic is getting annoying.. isn't there anything saner we
can do?

2013-10-23 04:03:58

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

Hi Peter,

On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>> kernel/sched/fair.c | 19 +++++++++++++------
>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7c70201..12f0eab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>>
>> rcu_read_lock();
>> for_each_domain(cpu, sd) {
>> + struct sched_domain *sd_parent = sd->parent;
>> + struct sched_group *sg;
>> + struct sched_group_power *sgp;
>> + int nr_busy;
>> +
>> + if (sd_parent) {
>> + sg = sd_parent->groups;
>> + sgp = sg->sgp;
>> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> +
>> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> + goto need_kick_unlock;
>> + }
>>
>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>> && (cpumask_first_and(nohz.idle_cpus_mask,
>>
>
> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?

You are right, sorry about this. The idea was to correct the nr_busy
computation before the patch that would remove its usage in the second
patch. But that would mean the condition nr_busy != sg->group_weight
would be invalid with this patch. The second patch needs to go first to
avoid this confusion.

>
> Also, this made me look at the nr_busy stuff again, and somehow that
> entire thing makes me a little sad.
>
> Can't we do something like the below and cut that nr_busy sd iteration
> short?

We can surely cut the nr_busy sd iteration but not like what is done
with this patch. You stop the nr_busy computation at the sched domain
that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
would want to know the nr_busy for one level above this.
Consider a core. Assume it is the highest domain with this flag set.
The nr_busy of its groups, which are logical threads are set to 1/0
each. But nohz_kick_needed() would like to know the sum of the nr_busy
parameter of all the groups, i.e. the threads in a core before it
decides if it can kick nohz_idle balancing. The information about the
individual group's nr_busy is of no relevance here.

Thats why the above patch tries to get the
sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
the core's busy cpus in this example. But the below patch stops before
updating this parameter at the sd->parent level, where sd is the highest
level sched domain with the SD_SHARE_PKG_RESOURCES flag set.

But we can get around all this confusion if we can move the nr_busy
parameter to be included in the sched_domain structure rather than the
sched_groups_power structure. Anyway the only place where nr_busy is
used, that is at nohz_kick_needed(), is done to know the total number of
busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
set and not at a sched group level.

So why not move nr_busy to struct sched_domain and having the below
patch which just updates this parameter for the sched domain, sd_busy ?
This will avoid iterating through all the levels of sched domains and
should resolve the scalability issue. We also don't need to get to
sd->parent to get the nr_busy parameter for the sake of nohz_kick_needed().

What do you think?

Regards
Preeti U Murthy
>
> This nohz stuff really needs to be re-thought and made more scalable --
> its a royal pain :/
>
>
> kernel/sched/core.c | 4 ++++
> kernel/sched/fair.c | 21 +++++++++++++++------
> kernel/sched/sched.h | 5 ++---
> 3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..89db8dc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>
> static void update_top_cache_domain(int cpu)
> {
> @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING);
> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
> }
>
> /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 813dd61..3d5141e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6512,19 +6512,23 @@ static inline void nohz_balance_exit_idle(int cpu)
> }
> }
>
> -static inline void set_cpu_sd_state_busy(void)
> +static inline void set_cpu_sd_state_busy(int cpu)
> {
> struct sched_domain *sd;
> + struct rq *rq = cpu_rq(cpu);
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = rcu_dereference_check_sched_domain(rq->sd);
>
> if (!sd || !sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 0;
>
> - for (; sd; sd = sd->parent)
> + for (; sd; sd = sd->parent) {
> atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> + if (sd == per_cpu(sd_busy, cpu))
> + break;
> + }
> unlock:
> rcu_read_unlock();
> }
> @@ -6532,16 +6536,21 @@ static inline void set_cpu_sd_state_busy(void)
> void set_cpu_sd_state_idle(void)
> {
> struct sched_domain *sd;
> + int cpu = smp_processor_id();
> + struct rq *rq = cpu_rq(cpu);
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = rcu_dereference_check_sched_domain(rq->sd);
>
> if (!sd || sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 1;
>
> - for (; sd; sd = sd->parent)
> + for (; sd; sd = sd->parent) {
> atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> + if (sd == per_cpu(sd_busy, cpu))
> + break;
> + }
> unlock:
> rcu_read_unlock();
> }
> @@ -6756,7 +6765,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> * We may be recently in ticked or tickless idle mode. At the first
> * busy tick after returning from idle, we will update the busy stats.
> */
> - set_cpu_sd_state_busy();
> + set_cpu_sd_state_busy(cpu);
> nohz_balance_exit_idle(cpu);
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ffc7087..80c5fd2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -599,9 +599,8 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
> struct sched_domain *sd, *hsd = NULL;
>
> for_each_domain(cpu, sd) {
> - if (!(sd->flags & flag))
> - break;
> - hsd = sd;
> + if (sd->flags & flag)
> + hsd = sd;
> }
>
> return hsd;
>

2013-10-23 04:24:50

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

On 10/23/2013 09:30 AM, Preeti U Murthy wrote:
> Hi Peter,
>
> On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
>> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>>> kernel/sched/fair.c | 19 +++++++++++++------
>>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7c70201..12f0eab 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>>>
>>> rcu_read_lock();
>>> for_each_domain(cpu, sd) {
>>> + struct sched_domain *sd_parent = sd->parent;
>>> + struct sched_group *sg;
>>> + struct sched_group_power *sgp;
>>> + int nr_busy;
>>> +
>>> + if (sd_parent) {
>>> + sg = sd_parent->groups;
>>> + sgp = sg->sgp;
>>> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>>> +
>>> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>>> + goto need_kick_unlock;
>>> + }
>>>
>>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>> && (cpumask_first_and(nohz.idle_cpus_mask,
>>>
>>
>> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?
>
> You are right, sorry about this. The idea was to correct the nr_busy
> computation before the patch that would remove its usage in the second
> patch. But that would mean the condition nr_busy != sg->group_weight
> would be invalid with this patch. The second patch needs to go first to
> avoid this confusion.
>
>>
>> Also, this made me look at the nr_busy stuff again, and somehow that
>> entire thing makes me a little sad.
>>
>> Can't we do something like the below and cut that nr_busy sd iteration
>> short?
>
> We can surely cut the nr_busy sd iteration but not like what is done
> with this patch. You stop the nr_busy computation at the sched domain
> that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
> would want to know the nr_busy for one level above this.
> Consider a core. Assume it is the highest domain with this flag set.
> The nr_busy of its groups, which are logical threads are set to 1/0
> each. But nohz_kick_needed() would like to know the sum of the nr_busy
> parameter of all the groups, i.e. the threads in a core before it
> decides if it can kick nohz_idle balancing. The information about the
> individual group's nr_busy is of no relevance here.
>
> Thats why the above patch tries to get the
> sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
> the core's busy cpus in this example. But the below patch stops before
> updating this parameter at the sd->parent level, where sd is the highest
> level sched domain with the SD_SHARE_PKG_RESOURCES flag set.
>
> But we can get around all this confusion if we can move the nr_busy
> parameter to be included in the sched_domain structure rather than the
> sched_groups_power structure. Anyway the only place where nr_busy is
> used, that is at nohz_kick_needed(), is done to know the total number of
> busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
> set and not at a sched group level.
>
> So why not move nr_busy to struct sched_domain and having the below
> patch which just updates this parameter for the sched domain, sd_busy ?

Oh this can't be done :( Domain structures are per cpu!

Regards
Preeti U Murthy

2013-10-23 09:53:22

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

Hi Peter

On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
> This nohz stuff really needs to be re-thought and made more scalable --
> its a royal pain :/

Why not do something like the below instead? It does the following.

This patch introduces sd_busy just like your suggested patch, except that
it points to the parent of the highest level sched domain which has the
SD_SHARE_PKG_RESOURCES set and initializes it in update_top_cache_domain().
This is the sched domain that is relevant in nohz_kick_needed().

sd_set_sd_state_busy(), sd_set_sd_state_idle() and nohz_kick_needed() query
and update *only* this sched domain(sd_busy) for nr_busy_cpus. They are the
only users of this parameter. While we are at it, we might as well change
the nohz_idle parameter to be updated at the sd_busy domain level alone and
not the base domain level of a CPU. This will unify the concept of busy cpus
at just one level of sched domain.

There is no need to iterate through all levels of sched domains of a cpu to
update nr_busy_cpus since it is irrelevant at all other sched domains except
at sd_busy level.

De-couple asymmetric load balancing from the nr_busy parameter which the
PATCH 2/3 anyway does. sd_busy therefore is irrelevant for asymmetric load
balancing.

Regards
Preeti U Murthy
--------------------START_PATCH-------------------------------

sched: Fix nohz_kick_needed()

---
kernel/sched/core.c | 4 ++++
kernel/sched/fair.c | 40 ++++++++++++++++++++++------------------
kernel/sched/sched.h | 1 +
3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c06b8d3..c1dd11c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain *, sd_busy);

static void update_top_cache_domain(int cpu)
{
@@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)

sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+
+ sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)->parent;
+ rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
}

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 813dd61..71e6f14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
static inline void set_cpu_sd_state_busy(void)
{
struct sched_domain *sd;
+ int cpu = smp_processor_id();

rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+ sd = per_cpu(sd_busy, cpu);

if (!sd || !sd->nohz_idle)
goto unlock;
sd->nohz_idle = 0;

- for (; sd; sd = sd->parent)
- atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+ atomic_inc(&sd->groups->sgp->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
@@ -6532,16 +6532,16 @@ unlock:
void set_cpu_sd_state_idle(void)
{
struct sched_domain *sd;
+ int cpu = smp_processor_id();

rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+ sd = per_cpu(sd_busy, cpu);

if (!sd || sd->nohz_idle)
goto unlock;
sd->nohz_idle = 1;

- for (; sd; sd = sd->parent)
- atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+ atomic_dec(&sd->groups->sgp->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
@@ -6748,6 +6748,9 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
{
unsigned long now = jiffies;
struct sched_domain *sd;
+ struct sched_group *sg;
+ struct sched_group_power *sgp;
+ int nr_busy;

if (unlikely(idle_cpu(cpu)))
return 0;
@@ -6773,22 +6776,23 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
goto need_kick;

rcu_read_lock();
- for_each_domain(cpu, sd) {
- struct sched_group *sg = sd->groups;
- struct sched_group_power *sgp = sg->sgp;
- int nr_busy = atomic_read(&sgp->nr_busy_cpus);
+ sd = per_cpu(sd_busy, cpu);

- if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
- goto need_kick_unlock;
+ if (sd) {
+ sg = sd->groups;
+ sgp = sg->sgp;
+ nr_busy = atomic_read(&sgp->nr_busy_cpus);

- if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
- && (cpumask_first_and(nohz.idle_cpus_mask,
- sched_domain_span(sd)) < cpu))
+ if (nr_busy > 1)
goto need_kick_unlock;
-
- if (!(sd->flags & (SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING)))
- break;
}
+
+ sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
+
+ if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
+ sched_domain_span(sd)) < cpu))
+ goto need_kick_unlock;
+
rcu_read_unlock();
return 0;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffc7087..0f1253f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -623,6 +623,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
+DECLARE_PER_CPU(struct sched_domain *, sd_busy);

struct sched_group_power {
atomic_t ref;

2013-10-23 15:28:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

Hi Preeti

On 23 October 2013 11:50, Preeti U Murthy <[email protected]> wrote:
> Hi Peter
>
> On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
>> This nohz stuff really needs to be re-thought and made more scalable --
>> its a royal pain :/
>
> Why not do something like the below instead? It does the following.
>
> This patch introduces sd_busy just like your suggested patch, except that
> it points to the parent of the highest level sched domain which has the
> SD_SHARE_PKG_RESOURCES set and initializes it in update_top_cache_domain().
> This is the sched domain that is relevant in nohz_kick_needed().
>
> sd_set_sd_state_busy(), sd_set_sd_state_idle() and nohz_kick_needed() query
> and update *only* this sched domain(sd_busy) for nr_busy_cpus. They are the
> only users of this parameter. While we are at it, we might as well change
> the nohz_idle parameter to be updated at the sd_busy domain level alone and
> not the base domain level of a CPU. This will unify the concept of busy cpus
> at just one level of sched domain.
>
> There is no need to iterate through all levels of sched domains of a cpu to
> update nr_busy_cpus since it is irrelevant at all other sched domains except
> at sd_busy level.
>
> De-couple asymmetric load balancing from the nr_busy parameter which the
> PATCH 2/3 anyway does. sd_busy therefore is irrelevant for asymmetric load
> balancing.
>
> Regards
> Preeti U Murthy
> --------------------START_PATCH-------------------------------
>
> sched: Fix nohz_kick_needed()
>
> ---
> kernel/sched/core.c | 4 ++++
> kernel/sched/fair.c | 40 ++++++++++++++++++++++------------------
> kernel/sched/sched.h | 1 +
> 3 files changed, 27 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..c1dd11c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>
> static void update_top_cache_domain(int cpu)
> {
> @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)->parent;

highest_flag_domain can return null pointer

> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
> }
>
> /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 813dd61..71e6f14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
> static inline void set_cpu_sd_state_busy(void)
> {
> struct sched_domain *sd;
> + int cpu = smp_processor_id();
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = per_cpu(sd_busy, cpu);

Don't you need to use rcu_dereference when using sd_busy ?

>
> if (!sd || !sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 0;
>
> - for (; sd; sd = sd->parent)
> - atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> + atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
> }
> @@ -6532,16 +6532,16 @@ unlock:
> void set_cpu_sd_state_idle(void)
> {
> struct sched_domain *sd;
> + int cpu = smp_processor_id();
>
> rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> + sd = per_cpu(sd_busy, cpu);
>
> if (!sd || sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 1;
>
> - for (; sd; sd = sd->parent)
> - atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> + atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
> }
> @@ -6748,6 +6748,9 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> {
> unsigned long now = jiffies;
> struct sched_domain *sd;
> + struct sched_group *sg;
> + struct sched_group_power *sgp;
> + int nr_busy;
>
> if (unlikely(idle_cpu(cpu)))
> return 0;
> @@ -6773,22 +6776,23 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> goto need_kick;
>
> rcu_read_lock();
> - for_each_domain(cpu, sd) {
> - struct sched_group *sg = sd->groups;
> - struct sched_group_power *sgp = sg->sgp;
> - int nr_busy = atomic_read(&sgp->nr_busy_cpus);
> + sd = per_cpu(sd_busy, cpu);
>
> - if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> - goto need_kick_unlock;
> + if (sd) {
> + sg = sd->groups;

sg is not needed anymore

> + sgp = sg->sgp;
> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>
> - if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
> - && (cpumask_first_and(nohz.idle_cpus_mask,
> - sched_domain_span(sd)) < cpu))
> + if (nr_busy > 1)
> goto need_kick_unlock;
> -
> - if (!(sd->flags & (SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING)))
> - break;
> }
> +
> + sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> +
> + if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
> + sched_domain_span(sd)) < cpu))
> + goto need_kick_unlock;
> +
> rcu_read_unlock();
> return 0;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ffc7087..0f1253f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -623,6 +623,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_llc);
> DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
> DECLARE_PER_CPU(struct sched_domain *, sd_numa);
> +DECLARE_PER_CPU(struct sched_domain *, sd_busy);
>
> struct sched_group_power {
> atomic_t ref;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-10-24 04:07:18

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

Hi Peter,

On 10/23/2013 03:53 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
>> kernel/sched/fair.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 828ed97..bbcd96b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>> {
>> int ld_moved, cur_ld_moved, active_balance = 0;
>> struct sched_group *group;
>> + struct sched_domain *child;
>> + int share_pkg_res = 0;
>> struct rq *busiest;
>> unsigned long flags;
>> struct cpumask *cpus = __get_cpu_var(load_balance_mask);
>> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>
>> schedstat_inc(sd, lb_count[idle]);
>>
>> + child = sd->child;
>> + if (child && child->flags & SD_SHARE_PKG_RESOURCES)
>> + share_pkg_res = 1;
>> +
>> redo:
>> if (!should_we_balance(&env)) {
>> *continue_balancing = 0;
>> @@ -5202,6 +5208,7 @@ redo:
>> goto out_balanced;
>> }
>>
>> +redo_grp:
>> busiest = find_busiest_queue(&env, group);
>> if (!busiest) {
>> schedstat_inc(sd, lb_nobusyq[idle]);
>> @@ -5292,6 +5299,11 @@ more_balance:
>> if (!cpumask_empty(cpus)) {
>> env.loop = 0;
>> env.loop_break = sched_nr_migrate_break;
>> + if (share_pkg_res &&
>> + cpumask_intersects(cpus,
>> + to_cpumask(group->cpumask)))
>
> sched_group_cpus()
>
>> + goto redo_grp;
>> +
>> goto redo;
>> }
>> goto out_balanced;
>> @@ -5318,9 +5330,15 @@ more_balance:
>> */
>> if (!cpumask_test_cpu(this_cpu,
>> tsk_cpus_allowed(busiest->curr))) {
>> + cpumask_clear_cpu(cpu_of(busiest), cpus);
>> raw_spin_unlock_irqrestore(&busiest->lock,
>> flags);
>> env.flags |= LBF_ALL_PINNED;
>> + if (share_pkg_res &&
>> + cpumask_intersects(cpus,
>> + to_cpumask(group->cpumask)))
>> + goto redo_grp;
>> +
>> goto out_one_pinned;
>> }
>
> Man this retry logic is getting annoying.. isn't there anything saner we
> can do?

Let me give this a thought and get back.

Regards
Preeti U Murthy
>

2013-10-24 08:10:48

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

Hi Vincent,

I have addressed your comments and below is the fresh patch. This patch
applies on PATCH 2/3 posted in this thread.

Regards
Preeti U Murthy


sched:Remove un-necessary iterations over sched domains to update/query nr_busy_cpus

From: Preeti U Murthy <[email protected]>

nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number
of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set.
Therefore instead of updating nr_busy_cpus at every level of sched domain,
since it is irrelevant, we can update this parameter only at the parent
domain of the sd which has this flag set. Introduce a per-cpu parameter
sd_busy which represents this parent domain.

In nohz_kick_needed() we directly query the nr_busy_cpus parameter
associated with the groups of sd_busy.

By associating sd_busy with the highest domain which has
SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could
have this flag set and trigger nohz_idle_balancing if any of the levels have
more than one busy cpu.

sd_busy is irrelevant for asymmetric load balancing.

While we are at it, we might as well change the nohz_idle parameter to be
updated at the sd_busy domain level alone and not the base domain level of a CPU.
This will unify the concept of busy cpus at just one level of sched domain
where it is currently used.

Signed-off-by: Preeti U Murthy<[email protected]>
---
kernel/sched/core.c | 5 +++++
kernel/sched/fair.c | 38 ++++++++++++++++++++------------------
kernel/sched/sched.h | 1 +
3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c06b8d3..c540392 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain *, sd_busy);

static void update_top_cache_domain(int cpu)
{
@@ -5290,6 +5291,10 @@ static void update_top_cache_domain(int cpu)

sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+
+ sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+ if (sd)
+ rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
}

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9c9549..f66cfd9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
static inline void set_cpu_sd_state_busy(void)
{
struct sched_domain *sd;
+ int cpu = smp_processor_id();

rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+ sd = rcu_dereference(per_cpu(sd_busy, cpu));

if (!sd || !sd->nohz_idle)
goto unlock;
sd->nohz_idle = 0;

- for (; sd; sd = sd->parent)
- atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+ atomic_inc(&sd->groups->sgp->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
@@ -6532,16 +6532,16 @@ unlock:
void set_cpu_sd_state_idle(void)
{
struct sched_domain *sd;
+ int cpu = smp_processor_id();

rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+ sd = rcu_dereference(per_cpu(sd_busy, cpu));

if (!sd || sd->nohz_idle)
goto unlock;
sd->nohz_idle = 1;

- for (; sd; sd = sd->parent)
- atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+ atomic_dec(&sd->groups->sgp->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
@@ -6748,6 +6748,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
{
unsigned long now = jiffies;
struct sched_domain *sd;
+ struct sched_group_power *sgp;
+ int nr_busy;

if (unlikely(idle_cpu(cpu)))
return 0;
@@ -6773,22 +6775,22 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
goto need_kick;

rcu_read_lock();
- for_each_domain(cpu, sd) {
- struct sched_group *sg = sd->groups;
- struct sched_group_power *sgp = sg->sgp;
- int nr_busy = atomic_read(&sgp->nr_busy_cpus);
+ sd = rcu_dereference(per_cpu(sd_busy, cpu));

- if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
- goto need_kick_unlock;
+ if (sd) {
+ sgp = sd->groups->sgp;
+ nr_busy = atomic_read(&sgp->nr_busy_cpus);

- if (sd->flags & SD_ASYM_PACKING
- && (cpumask_first_and(nohz.idle_cpus_mask,
- sched_domain_span(sd)) < cpu))
+ if (nr_busy > 1)
goto need_kick_unlock;
-
- if (!(sd->flags & (SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING)))
- break;
}
+
+ sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
+
+ if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
+ sched_domain_span(sd)) < cpu))
+ goto need_kick_unlock;
+
rcu_read_unlock();
return 0;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffc7087..0f1253f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -623,6 +623,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
+DECLARE_PER_CPU(struct sched_domain *, sd_busy);

struct sched_group_power {
atomic_t ref;

2013-10-25 13:22:43

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

Hi Peter,

On 10/23/2013 03:53 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
>> kernel/sched/fair.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 828ed97..bbcd96b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>> {
>> int ld_moved, cur_ld_moved, active_balance = 0;
>> struct sched_group *group;
>> + struct sched_domain *child;
>> + int share_pkg_res = 0;
>> struct rq *busiest;
>> unsigned long flags;
>> struct cpumask *cpus = __get_cpu_var(load_balance_mask);
>> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>
>> schedstat_inc(sd, lb_count[idle]);
>>
>> + child = sd->child;
>> + if (child && child->flags & SD_SHARE_PKG_RESOURCES)
>> + share_pkg_res = 1;
>> +
>> redo:
>> if (!should_we_balance(&env)) {
>> *continue_balancing = 0;
>> @@ -5202,6 +5208,7 @@ redo:
>> goto out_balanced;
>> }
>>
>> +redo_grp:
>> busiest = find_busiest_queue(&env, group);
>> if (!busiest) {
>> schedstat_inc(sd, lb_nobusyq[idle]);
>> @@ -5292,6 +5299,11 @@ more_balance:
>> if (!cpumask_empty(cpus)) {
>> env.loop = 0;
>> env.loop_break = sched_nr_migrate_break;
>> + if (share_pkg_res &&
>> + cpumask_intersects(cpus,
>> + to_cpumask(group->cpumask)))
>
> sched_group_cpus()
>
>> + goto redo_grp;
>> +
>> goto redo;
>> }
>> goto out_balanced;
>> @@ -5318,9 +5330,15 @@ more_balance:
>> */
>> if (!cpumask_test_cpu(this_cpu,
>> tsk_cpus_allowed(busiest->curr))) {
>> + cpumask_clear_cpu(cpu_of(busiest), cpus);
>> raw_spin_unlock_irqrestore(&busiest->lock,
>> flags);
>> env.flags |= LBF_ALL_PINNED;
>> + if (share_pkg_res &&
>> + cpumask_intersects(cpus,
>> + to_cpumask(group->cpumask)))
>> + goto redo_grp;
>> +
>> goto out_one_pinned;
>> }
>
> Man this retry logic is getting annoying.. isn't there anything saner we
> can do?

Maybe we can do this just at the SIBLINGS level? Having the hyper
threads busy due to the scenario described in the changelog is bad for
performance.

Regards
Preeti U Murthy
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

2013-10-28 13:50:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

On Thu, Oct 24, 2013 at 01:37:38PM +0530, Preeti U Murthy wrote:
> kernel/sched/core.c | 5 +++++
> kernel/sched/fair.c | 38 ++++++++++++++++++++------------------
> kernel/sched/sched.h | 1 +
> 3 files changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..c540392 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>
> static void update_top_cache_domain(int cpu)
> {
> @@ -5290,6 +5291,10 @@ static void update_top_cache_domain(int cpu)
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> + if (sd)
> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
> }
>
> /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e9c9549..f66cfd9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
> static inline void set_cpu_sd_state_busy(void)
> {
> struct sched_domain *sd;
> + int cpu = smp_processor_id();
>
> rcu_read_lock();
> + sd = rcu_dereference(per_cpu(sd_busy, cpu));
>
> if (!sd || !sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 0;
>
> + atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
> }
> @@ -6532,16 +6532,16 @@ unlock:
> void set_cpu_sd_state_idle(void)
> {
> struct sched_domain *sd;
> + int cpu = smp_processor_id();
>
> rcu_read_lock();
> + sd = rcu_dereference(per_cpu(sd_busy, cpu));
>
> if (!sd || sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 1;
>
> + atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
> }

Oh nice, that gets rid of the multiple atomics, and it nicely splits
this nohz logic into per topology groups -- now if only we could split
the rest too :-)

> @@ -6748,6 +6748,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> {
> unsigned long now = jiffies;
> struct sched_domain *sd;
> + struct sched_group_power *sgp;
> + int nr_busy;
>
> if (unlikely(idle_cpu(cpu)))
> return 0;
> @@ -6773,22 +6775,22 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> goto need_kick;
>
> rcu_read_lock();
> + sd = rcu_dereference(per_cpu(sd_busy, cpu));
>
> + if (sd) {
> + sgp = sd->groups->sgp;
> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>
> + if (nr_busy > 1)
> goto need_kick_unlock;
> }

OK, so far so good.

> +
> + sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> +
> + if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
> + sched_domain_span(sd)) < cpu))
> + goto need_kick_unlock;
> +
> rcu_read_unlock();
> return 0;

This again is a bit sad; most archs will not have SD_ASYM_PACKING set at
all; this means that they all will do a complete (and pointless) sched
domain tree walk here.

It would be much better to also introduce sd_asym and do the analogous
thing to the new sd_busy.

2013-10-28 15:53:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
> From: Preeti U Murthy <[email protected]>
>
> The current logic in load balance is such that after picking the
> busiest group, the load is attempted to be moved from the busiest cpu
> in that group to the dst_cpu. If the load cannot be moved from the
> busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache
> hot tasks, then the dst_cpu is changed to be another idle cpu within
> the dst->grpmask. If even then, the load cannot be moved from the
> busiest cpu, then the source group is changed. The next busiest group
> is found and the above steps are repeated.
>
> However if the cpus in the group share package resources, then when
> a load movement from the busiest cpu in this group fails as above,
> instead of finding the next busiest group to move load from, find the
> next busiest cpu *within the same group* from which to move load away.
> By doing so, a conscious effort is made during load balancing to keep
> just one cpu busy as much as possible within domains that have
> SHARED_PKG_RESOURCES flag set unless under scenarios of high load.
> Having multiple cpus busy within a domain which share package resource
> could lead to a performance hit.
>
> A similar scenario arises in active load balancing as well. When the
> current task on the busiest cpu cannot be moved away due to task
> pinning, currently no more attempts at load balancing is made.

> This
> patch checks if the balancing is being done on a group whose cpus
> share package resources. If so, then check if the load balancing can
> be done for other cpus in the same group.

So I absolutely hate this patch... Also I'm not convinced I actually
understand the explanation above.

Furthermore; there is nothing special about spreading tasks for
SHARED_PKG_RESOURCES and special casing that one case is just wrong.

If anything it should be keyed off of SD_PREFER_SIBLING and or
cpu_power.

2013-10-29 03:34:14

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

Hi Peter,

On 10/28/2013 07:20 PM, Peter Zijlstra wrote:
> On Thu, Oct 24, 2013 at 01:37:38PM +0530, Preeti U Murthy wrote:
>> kernel/sched/core.c | 5 +++++
>> kernel/sched/fair.c | 38 ++++++++++++++++++++------------------
>> kernel/sched/sched.h | 1 +
>> 3 files changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c06b8d3..c540392 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
>> DEFINE_PER_CPU(int, sd_llc_size);
>> DEFINE_PER_CPU(int, sd_llc_id);
>> DEFINE_PER_CPU(struct sched_domain *, sd_numa);
>> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>>
>> static void update_top_cache_domain(int cpu)
>> {
>> @@ -5290,6 +5291,10 @@ static void update_top_cache_domain(int cpu)
>>
>> sd = lowest_flag_domain(cpu, SD_NUMA);
>> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
>> +
>> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>> + if (sd)
>> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
>> }
>>
>> /*
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e9c9549..f66cfd9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
>> static inline void set_cpu_sd_state_busy(void)
>> {
>> struct sched_domain *sd;
>> + int cpu = smp_processor_id();
>>
>> rcu_read_lock();
>> + sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>
>> if (!sd || !sd->nohz_idle)
>> goto unlock;
>> sd->nohz_idle = 0;
>>
>> + atomic_inc(&sd->groups->sgp->nr_busy_cpus);
>> unlock:
>> rcu_read_unlock();
>> }
>> @@ -6532,16 +6532,16 @@ unlock:
>> void set_cpu_sd_state_idle(void)
>> {
>> struct sched_domain *sd;
>> + int cpu = smp_processor_id();
>>
>> rcu_read_lock();
>> + sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>
>> if (!sd || sd->nohz_idle)
>> goto unlock;
>> sd->nohz_idle = 1;
>>
>> + atomic_dec(&sd->groups->sgp->nr_busy_cpus);
>> unlock:
>> rcu_read_unlock();
>> }
>
> Oh nice, that gets rid of the multiple atomics, and it nicely splits
> this nohz logic into per topology groups -- now if only we could split
> the rest too :-)

I am sorry, I don't get you here. By the 'rest', do you refer to
nohz_kick_needed() as below? Or am I missing something?

>
>> @@ -6748,6 +6748,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>> {
>> unsigned long now = jiffies;
>> struct sched_domain *sd;
>> + struct sched_group_power *sgp;
>> + int nr_busy;
>>
>> if (unlikely(idle_cpu(cpu)))
>> return 0;
>> @@ -6773,22 +6775,22 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>> goto need_kick;
>>
>> rcu_read_lock();
>> + sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>
>> + if (sd) {
>> + sgp = sd->groups->sgp;
>> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
>>
>> + if (nr_busy > 1)
>> goto need_kick_unlock;
>> }
>
> OK, so far so good.
>
>> +
>> + sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>> +
>> + if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>> + sched_domain_span(sd)) < cpu))
>> + goto need_kick_unlock;
>> +
>> rcu_read_unlock();
>> return 0;
>
> This again is a bit sad; most archs will not have SD_ASYM_PACKING set at
> all; this means that they all will do a complete (and pointless) sched
> domain tree walk here.

There will not be a 'complete' sched domain tree walk right? The
iteration will break at the first level of the sched domain for those
archs which do not have SD_ASYM_PACKING set at all.

But it is true that doing a sched domain tree walk regularly is a bad
idea, might as well update the domain with SD_ASYM_PACKING flag set once
and query this domain when required.

I will send out the patch with sd_asym domain introduced rather than the
above.

Thanks

Regards
Preeti U Murthy

>
> It would be much better to also introduce sd_asym and do the analogous
> thing to the new sd_busy.
>

2013-10-29 05:38:36

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

Hi Peter,

On 10/28/2013 09:23 PM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
>> From: Preeti U Murthy <[email protected]>
>>
>> The current logic in load balance is such that after picking the
>> busiest group, the load is attempted to be moved from the busiest cpu
>> in that group to the dst_cpu. If the load cannot be moved from the
>> busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache
>> hot tasks, then the dst_cpu is changed to be another idle cpu within
>> the dst->grpmask. If even then, the load cannot be moved from the
>> busiest cpu, then the source group is changed. The next busiest group
>> is found and the above steps are repeated.
>>
>> However if the cpus in the group share package resources, then when
>> a load movement from the busiest cpu in this group fails as above,
>> instead of finding the next busiest group to move load from, find the
>> next busiest cpu *within the same group* from which to move load away.
>> By doing so, a conscious effort is made during load balancing to keep
>> just one cpu busy as much as possible within domains that have
>> SHARED_PKG_RESOURCES flag set unless under scenarios of high load.
>> Having multiple cpus busy within a domain which share package resource
>> could lead to a performance hit.
>>
>> A similar scenario arises in active load balancing as well. When the
>> current task on the busiest cpu cannot be moved away due to task
>> pinning, currently no more attempts at load balancing is made.
>
>> This
>> patch checks if the balancing is being done on a group whose cpus
>> share package resources. If so, then check if the load balancing can
>> be done for other cpus in the same group.
>
> So I absolutely hate this patch... Also I'm not convinced I actually
> understand the explanation above.
>
> Furthermore; there is nothing special about spreading tasks for
> SHARED_PKG_RESOURCES and special casing that one case is just wrong.
>
> If anything it should be keyed off of SD_PREFER_SIBLING and or
> cpu_power.

At a SIBLING level, which has SHARED_PKG_RESOURCES set, cpu_power in
fact takes care of ensuring that the scheduler mostly spreads the load
when there is more than one running task by nominating the group as
busy. But the issue that this patch is bringing to the front is a bit
different; its not during the time of this nomination, its at the time
of load balancing. It is explained below.

So metrics like cpu_power and flags like SD_PREFER_SIBLING ensure that
we spread the load by nominating such groups as busiest in
update_sg_lb_stats() and update_sd_lb_stats(). So "nominating a group"
as busiest by virtue of cpu_power or flags is taken care of.

However, in load_balance(), if the imbalance cannot be offset by moving
load from the busiest_cpu in the busiest_group, then today we do not try
the *next busiest cpu in the group*; instead we try the next busiest_group.

So whatever effort we put in by nominating this group as busiest, if the
grp_power and flags do not favour tasks on it, seems relevant only if
the busiest cpu in that group co-operates in offloading tasks. Should we
not be trying our best to move load from any other cpu in this group ?

This patch identifies one such situation, which led to too many tasks on
a core and got me to ponder over this question. I agree that the fix in
this patch is not right. But I thought this would open up discussion
around the above question. Its true that iterating over all the cpus in
a group during the actual load balance is too much of an overhead, but
isn't there a balance we can strike during load balance iterations for
such groups which have limited cpu power?
>
Thanks

Regards
Preeti U Murthy

2013-10-29 13:26:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

On Tue, Oct 29, 2013 at 09:00:52AM +0530, Preeti U Murthy wrote:
> > Oh nice, that gets rid of the multiple atomics, and it nicely splits
> > this nohz logic into per topology groups -- now if only we could split
> > the rest too :-)
>
> I am sorry, I don't get you here. By the 'rest', do you refer to
> nohz_kick_needed() as below? Or am I missing something?

Nah, the rest of the NOHZ infrastructure. Currently its global state;
there were some patches a few years ago that attempted to make that
per-node state, but that work stalled due to people switching jobs.


> >> + sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> >> +
> >> + if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
> >> + sched_domain_span(sd)) < cpu))
> >> + goto need_kick_unlock;
> >> +
> >> rcu_read_unlock();
> >> return 0;
> >
> > This again is a bit sad; most archs will not have SD_ASYM_PACKING set at
> > all; this means that they all will do a complete (and pointless) sched
> > domain tree walk here.
>
> There will not be a 'complete' sched domain tree walk right? The
> iteration will break at the first level of the sched domain for those
> archs which do not have SD_ASYM_PACKING set at all.

Ah indeed; I think I got confused due to me modifying
highest_flag_domain() earlier to assume a flag is carried from the
lowest domain upwards.

> But it is true that doing a sched domain tree walk regularly is a bad
> idea, might as well update the domain with SD_ASYM_PACKING flag set once
> and query this domain when required.
>
> I will send out the patch with sd_asym domain introduced rather than the
> above.

Thanks