2012-06-06 13:08:01

by Prashanth Nageshappa

[permalink] [raw]
Subject: [PATCH v2] sched: balance_cpu to consider other cpus in its group as target of (pinned) task

From: Srivatsa Vaddagiri <[email protected]>

Current load balance scheme lets one cpu in a sched_group (balance_cpu)
look at other peer sched_groups for imbalance and pull tasks to
itself from a busy cpu. Tasks thus pulled to balance_cpu will later get
picked up by cpus that are in the same sched_group as that of balance_cpu.
This scheme fails to pull tasks that are not allowed to run on
balance_cpu (but are allowed to run on other cpus in its sched_group).

This can affect fairness and in some worst case scenarios cause
starvation, as illustrated below. Consider a two core (2 threads/core)
system running tasks as below:

Core Core

C0 - F0 C2 - F1
C1 - T1 C3 - idle

F0 & F1 are SCHED_FIFO cpu hogs pinned to C0 & C2 respectively, while T1 is
a SCHED_OTHER task pinned to C1. Another SCHED_OTHER task T2 (which can
run on cpus 1,2) now wakes up and lands on its prev_cpu of C2, which is
now running SCHED_FIFO cpu hog. To prevent starvation, T2 needs to move to C1.
However between C0 & C1, C0 is chosen to balance its core with peer cores and
thus fails to pull T2 towards its core (C0 not being in T2's affinity mask). T2 was
found to starve eternally in this case.

Although the problem is illustrated in presence of rt tasks, this is a
general problem that can manifest in presence of non-rt tasks as well.

Some solutions that were considered to solve this problem were:

- Have the right sibling cpus to do load balance ignoring balance_cpu

- Modify move_tasks to move a pinned tasks to a sibling cpu in the
same sched_group as env->dst_cpu. This will involve some runqueue
lock juggling (a third runqueue locks needs to be taken when we
already have two locks held). Moreover we may be just fine to ignore
that particular task and meet load balance goals by moving other
tasks.

- Hint that move_tasks should be called with a different env->dst_cpu

This patch implements the 3rd of the above approach, which seemed least
invasive. Essentially can_migrate_task() records if any task(s) were not moved
as the destination cpu was not in the cpus_allowed mask of the target task(s)
and the new destination cpu that task can be moved to. We reissue a call
to move_tasks with that new destination cpu, provided we failed to meet
load balance goal by moving other tasks from env->src_cpu.

Changes since v1 (https://lkml.org/lkml/2012/6/4/52):

- updated change log to describe the problem in a more generic sense and
different soultions considered
- used cur_ld_moved instead of old_ld_moved
- modified comments in the code
- reset env.loop_break before retrying


Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: Prashanth Nageshappa <[email protected]>

----

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 939fd63..21a59fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
+#define LBF_NEW_DST_CPU 0x04

struct lb_env {
struct sched_domain *sd;
@@ -3108,6 +3109,8 @@ struct lb_env {
int dst_cpu;
struct rq *dst_rq;

+ struct cpumask *dst_grpmask;
+ int new_dst_cpu;
enum cpu_idle_type idle;
long imbalance;
unsigned int flags;
@@ -3198,7 +3201,26 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* 3) are cache-hot on their current CPU.
*/
if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ int new_dst_cpu;
+
+ if (!env->dst_grpmask) {
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ return 0;
+ }
+
+ /*
+ * remember if this task can be moved to any other cpus in our
+ * sched_group so that we can retry load balance and move
+ * that task to a new_dst_cpu if required.
+ */
+ new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+ tsk_cpus_allowed(p));
+ if (new_dst_cpu >= nr_cpu_ids) {
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ } else {
+ env->flags |= LBF_NEW_DST_CPU;
+ env->new_dst_cpu = new_dst_cpu;
+ }
return 0;
}
env->flags &= ~LBF_ALL_PINNED;
@@ -4440,7 +4462,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *balance)
{
- int ld_moved, active_balance = 0;
+ int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_group *group;
struct rq *busiest;
unsigned long flags;
@@ -4450,6 +4472,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
+ .dst_grpmask = sched_group_cpus(sd->groups),
.idle = idle,
.loop_break = sched_nr_migrate_break,
.find_busiest_queue = find_busiest_queue,
@@ -4502,7 +4525,8 @@ more_balance:
double_rq_lock(this_rq, busiest);
if (!env.loop)
update_h_load(env.src_cpu);
- ld_moved += move_tasks(&env);
+ cur_ld_moved = move_tasks(&env);
+ ld_moved += cur_ld_moved;
double_rq_unlock(this_rq, busiest);
local_irq_restore(flags);

@@ -4514,8 +4538,23 @@ more_balance:
/*
* some other cpu did the load balance for us.
*/
- if (ld_moved && this_cpu != smp_processor_id())
- resched_cpu(this_cpu);
+ if (cur_ld_moved && env.dst_cpu != smp_processor_id())
+ resched_cpu(env.dst_cpu);
+
+ if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {
+ /*
+ * we could not balance completely as some tasks
+ * were not allowed to move to the dst_cpu, so try
+ * again with new_dst_cpu.
+ */
+ this_rq = cpu_rq(env.new_dst_cpu);
+ env.dst_rq = this_rq;
+ env.dst_cpu = env.new_dst_cpu;
+ env.flags &= ~LBF_NEW_DST_CPU;
+ env.loop = 0;
+ env.loop_break = sched_nr_migrate_break;
+ goto more_balance;
+ }

/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(env.flags & LBF_ALL_PINNED)) {
@@ -4716,6 +4755,7 @@ static int active_load_balance_cpu_stop(void *data)
.sd = sd,
.dst_cpu = target_cpu,
.dst_rq = target_rq,
+ .dst_grpmask = NULL,
.src_cpu = busiest_rq->cpu,
.src_rq = busiest_rq,
.idle = CPU_IDLE,


2012-06-13 12:31:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched: balance_cpu to consider other cpus in its group as target of (pinned) task

On Wed, 2012-06-06 at 18:36 +0530, Prashanth Nageshappa wrote:
> From: Srivatsa Vaddagiri <[email protected]>
>
> Current load balance scheme lets one cpu in a sched_group (balance_cpu)
> look at other peer sched_groups for imbalance and pull tasks to
> itself from a busy cpu. Tasks thus pulled to balance_cpu will later get
> picked up by cpus that are in the same sched_group as that of balance_cpu.

Not strictly true.. but close enough I think.

> This scheme fails to pull tasks that are not allowed to run on
> balance_cpu (but are allowed to run on other cpus in its sched_group).
>
> This can affect fairness and in some worst case scenarios cause
> starvation, as illustrated below. Consider a two core (2 threads/core)
> system running tasks as below:
>
> Core Core
>
> C0 - F0 C2 - F1
> C1 - T1 C3 - idle
>
> F0 & F1 are SCHED_FIFO cpu hogs pinned to C0 & C2 respectively, while T1 is
> a SCHED_OTHER task pinned to C1. Another SCHED_OTHER task T2 (which can
> run on cpus 1,2) now wakes up and lands on its prev_cpu of C2, which is
> now running SCHED_FIFO cpu hog. To prevent starvation, T2 needs to move to C1.
> However between C0 & C1, C0 is chosen to balance its core with peer cores and
> thus fails to pull T2 towards its core (C0 not being in T2's affinity mask). T2 was
> found to starve eternally in this case.
>
> Although the problem is illustrated in presence of rt tasks, this is a
> general problem that can manifest in presence of non-rt tasks as well.

If you can come up with a relatively simple example without rt tasks
that would be preferred. Also I think the above description could be
improved.

The graph only confuses me and I'm not sure the description is maximally
concise either.

> Some solutions that were considered to solve this problem were:
>
> - Have the right sibling cpus to do load balance ignoring balance_cpu
>
> - Modify move_tasks to move a pinned tasks to a sibling cpu in the
> same sched_group as env->dst_cpu. This will involve some runqueue
> lock juggling (a third runqueue locks needs to be taken when we
> already have two locks held). Moreover we may be just fine to ignore
> that particular task and meet load balance goals by moving other
> tasks.
>
> - Hint that move_tasks should be called with a different env->dst_cpu
>
> This patch implements the 3rd of the above approach, which seemed least
> invasive. Essentially can_migrate_task() records if any task(s) were not moved
> as the destination cpu was not in the cpus_allowed mask of the target task(s)
> and the new destination cpu that task can be moved to. We reissue a call
> to move_tasks with that new destination cpu, provided we failed to meet
> load balance goal by moving other tasks from env->src_cpu.

Still no word on why doing a 3-way pull is ok.. (I think it is, I just
want you to convince me)..

> Signed-off-by: Srivatsa Vaddagiri <[email protected]>
> Signed-off-by: Prashanth Nageshappa <[email protected]>
>
> ----
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 939fd63..21a59fc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>
> #define LBF_ALL_PINNED 0x01
> #define LBF_NEED_BREAK 0x02
> +#define LBF_NEW_DST_CPU 0x04

I still don't really like the new_dst_cpu name, I think it is because it
describes a solution detail, not a problem diagnosis.

> @@ -3198,7 +3201,26 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> * 3) are cache-hot on their current CPU.
> */
> if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
> - schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + int new_dst_cpu;
> +
> + if (!env->dst_grpmask) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + return 0;
> + }
> +
> + /*
> + * remember if this task can be moved to any other cpus in our
> + * sched_group so that we can retry load balance and move
> + * that task to a new_dst_cpu if required.
> + */
> + new_dst_cpu = cpumask_first_and(env->dst_grpmask,
> + tsk_cpus_allowed(p));
> + if (new_dst_cpu >= nr_cpu_ids) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + } else {
> + env->flags |= LBF_NEW_DST_CPU;
> + env->new_dst_cpu = new_dst_cpu;
> + }
> return 0;
> }

This wants a comment describing why we're doing this.. this also wants a
comment describing why we do what we do about multiple new_dst.. I think
we should do the cheap thing and not compute a new dst if we already
have one.

For bonus points you'll also add a comment for the all pinned muck,
although that's somewhat more obvious.

> env->flags &= ~LBF_ALL_PINNED;
> @@ -4440,7 +4462,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct sched_domain *sd, enum cpu_idle_type idle,
> int *balance)
> {
> - int ld_moved, active_balance = 0;
> + int ld_moved, cur_ld_moved, active_balance = 0;

better than the old_ld_moved, but still not really obvious.. maybe we
can cure this with a comment...

> struct sched_group *group;
> struct rq *busiest;
> unsigned long flags;
> @@ -4450,6 +4472,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> .sd = sd,
> .dst_cpu = this_cpu,
> .dst_rq = this_rq,
> + .dst_grpmask = sched_group_cpus(sd->groups),
> .idle = idle,
> .loop_break = sched_nr_migrate_break,
> .find_busiest_queue = find_busiest_queue,
> @@ -4502,7 +4525,8 @@ more_balance:
> double_rq_lock(this_rq, busiest);
> if (!env.loop)
> update_h_load(env.src_cpu);
> - ld_moved += move_tasks(&env);
> + cur_ld_moved = move_tasks(&env);
> + ld_moved += cur_ld_moved;
> double_rq_unlock(this_rq, busiest);
> local_irq_restore(flags);
>
> @@ -4514,8 +4538,23 @@ more_balance:
> /*
> * some other cpu did the load balance for us.

so add something here describing the cur_ld_moved vs ld_moved thing..

> */

> + if (cur_ld_moved && env.dst_cpu != smp_processor_id())
> + resched_cpu(env.dst_cpu);
> +
> + if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {

Is this different from the last version?, ISTR only doing the loop again
when we didn't move anything.

> + /*
> + * we could not balance completely as some tasks
> + * were not allowed to move to the dst_cpu, so try
> + * again with new_dst_cpu.
> + */
> + this_rq = cpu_rq(env.new_dst_cpu);
> + env.dst_rq = this_rq;
> + env.dst_cpu = env.new_dst_cpu;
> + env.flags &= ~LBF_NEW_DST_CPU;
> + env.loop = 0;
> + env.loop_break = sched_nr_migrate_break;
> + goto more_balance;

What guarantees there's an end to this iteration? Also we explicitly use
more_balance over redo so that we're sure to iterate the same src cpu
because otherwise our new dst_cpu is meaningless? Sounds like something
that would be good to mention in .. wait for it .. a comment?

> + }
>
> /* All tasks on this runqueue were pinned by CPU affinity */
> if (unlikely(env.flags & LBF_ALL_PINNED)) {
> @@ -4716,6 +4755,7 @@ static int active_load_balance_cpu_stop(void *data)
> .sd = sd,
> .dst_cpu = target_cpu,
> .dst_rq = target_rq,
> + .dst_grpmask = NULL,

This is superfluous, c99 initialization zeros all unmentioned members.

> .src_cpu = busiest_rq->cpu,
> .src_rq = busiest_rq,
> .idle = CPU_IDLE,
>

2012-06-13 14:56:01

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH v2] sched: balance_cpu to consider other cpus in its group as target of (pinned) task

* Peter Zijlstra <[email protected]> [2012-06-13 14:31:36]:

> > Current load balance scheme lets one cpu in a sched_group (balance_cpu)
> > look at other peer sched_groups for imbalance and pull tasks to
> > itself from a busy cpu. Tasks thus pulled to balance_cpu will later get
> > picked up by cpus that are in the same sched_group as that of balance_cpu.
>
> Not strictly true.. but close enough I think.

Do you mean that tasks pulled by balance_cpu could in turn get pulled by a cpu
outside the balance_cpu's sched_group (before its sibling cpus have had
a chance to pull them)? Yes that's a possibility, but we should less of it in
practice. Will put a comment to that effect in next version.

> If you can come up with a relatively simple example without rt tasks
> that would be preferred. Also I think the above description could be
> improved.
>
> The graph only confuses me and I'm not sure the description is maximally
> concise either.

Ok ..we will take a shot at making it more concise and better ..

> > Some solutions that were considered to solve this problem were:
> >
> > - Have the right sibling cpus to do load balance ignoring balance_cpu
> >
> > - Modify move_tasks to move a pinned tasks to a sibling cpu in the
> > same sched_group as env->dst_cpu. This will involve some runqueue
> > lock juggling (a third runqueue locks needs to be taken when we
> > already have two locks held). Moreover we may be just fine to ignore
> > that particular task and meet load balance goals by moving other
> > tasks.
> >
> > - Hint that move_tasks should be called with a different env->dst_cpu

I guess another approach is to try a different env->src_cpu, which still
seems to be more invasive than what is proposed here.

> Still no word on why doing a 3-way pull is ok.. (I think it is, I just
> want you to convince me)..

One theroretical problem that can arise with a generic 3-way pull is two cpus
deciding what load should move towards a given cpu (and not syncing with each
other in the process). Say that C1 and C2 decide (at nearly same time) that load
X should get moved from C4 to C0. They don't realize each other's decision and
together end up moving twice that load to C0.

Now that seems remote as we are restricting who can decide on load movement
towards a given cpu - essentially the given cpu itself or its balance_cpu with
this patch applied. The source cpu in each case being different (given
cpu will attempt pulling from sibling cpus while balance_cpu will attempt to
pull from a cpu in sibling group), I can't see how the two can conflict with
each other's decisions.

There is however the remotest possibility that load_balance in context of a
given balance_cpu is performed concurrently (and thus lead to the
theoretical problem described above). Let's say that balance_cpu
was nohz-idle and some ilb_cpu started doing load_balance on its behalf.
Before it could finish (its vcpu got preempted?), balance_cpu became active and
started running load_balance by itself. These two load_balance() actions could
lead to conflicting decisions. This may not happen so much in practice though?
Alternately we can validate find_busiest_group and friend's assumption of
this_cpu's load (once this_rq lock is taken).


> > +#define LBF_NEW_DST_CPU 0x04
>
> I still don't really like the new_dst_cpu name, I think it is because it
> describes a solution detail, not a problem diagnosis.

LBF_SOME_PINNED? LBF_SOME_TASKS_PINNED?

> > + } else {
> > + env->flags |= LBF_NEW_DST_CPU;
> > + env->new_dst_cpu = new_dst_cpu;
> > + }

[snip]

> This wants a comment describing why we're doing this.. this also wants a
> comment describing why we do what we do about multiple new_dst.. I think
> we should do the cheap thing and not compute a new dst if we already
> have one.

Yeah given that cpumask can be long in some platforms?

> For bonus points you'll also add a comment for the all pinned muck,
> although that's somewhat more obvious.

Ok!

> > - int ld_moved, active_balance = 0;
> > + int ld_moved, cur_ld_moved, active_balance = 0;
>
> better than the old_ld_moved, but still not really obvious.. maybe we
> can cure this with a comment...

Ok ..we will take a shot at improving this.

> > @@ -4514,8 +4538,23 @@ more_balance:
> > /*
> > * some other cpu did the load balance for us.
>
> so add something here describing the cur_ld_moved vs ld_moved thing..

Ok ..

> > + if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {
>
> Is this different from the last version?, ISTR only doing the loop again
> when we didn't move anything.

Hmm ..this bit has not changed since last version. We may have moved
some load but still failed short of meeting target (because of some pinned
task), in which case we just retry with a new dst_cpu.

> > + /*
> > + * we could not balance completely as some tasks
> > + * were not allowed to move to the dst_cpu, so try
> > + * again with new_dst_cpu.
> > + */
> > + this_rq = cpu_rq(env.new_dst_cpu);
> > + env.dst_rq = this_rq;
> > + env.dst_cpu = env.new_dst_cpu;
> > + env.flags &= ~LBF_NEW_DST_CPU;
> > + env.loop = 0;
> > + env.loop_break = sched_nr_migrate_break;
> > + goto more_balance;
>
> What guarantees there's an end to this iteration?

We clear LBF_NEW_DST_CPU flag each time and so unless someone is
cleverly racing with us by modifying a task's cpus_allowed mask I can't
see how this can lead to a forever loop. Nevertheless I think it's good
that we put a check there for it! Will take a stab at it in next
version.

> Also we explicitly use
> more_balance over redo so that we're sure to iterate the same src cpu
> because otherwise our new dst_cpu is meaningless?

Yep.

> Sounds like something
> that would be good to mention in .. wait for it .. a comment?

Ok!

> > + .dst_grpmask = NULL,
>
> This is superfluous, c99 initialization zeros all unmentioned members.

Ok ...we will fix in next version. Thanks for your feedback!

- vatsa