2012-06-04 05:57:31

by Prashanth Nageshappa

[permalink] [raw]
Subject: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

Based on the description in
http://marc.info/?l=linux-kernel&m=133108682113018&w=2 , I was able to recreate
a problem where in a SCHED_OTHER thread never gets runtime, even though there is
one allowed CPU where it can run and make progress.

On a dual socket box (4 cores per socket, 2 threads per core) with following
config:
0 8 1 9 4 12 5 13
2 10 3 11 6 14 7 15
|__________| |__________|
socket 1 socket 2

If we have following 4 tasks (2 SCHED_FIFO and 2 SCHED_OTHER) started in the
following order:
1> SCHED_FIFO cpu hogging task bound to cpu 1
2> SCHED_OTHER cpu hogging task bound to cpus 3 & 9 - running on cpu 3
sleeps and wakes up after all other tasks are started
3> SCHED_FIFO cpu hogging task bound to cpu 3
4> SCHED_OTHER cpu hogging task bound to cpu 9

Once all the 4 tasks are started, we observe that 2nd task is starved of CPU
after waking up. When it wakes up, it wakes up on its prev_cpu (3) where
a FIFO task is now hogging the cpu. To prevent starvation, 2nd task
needs to be pulled to cpu 9. However, between cpus 1, 9, cpu1 is the chosen
cpu that attempts pulling tasks towards its core. When it tries pulling
2nd tasks towards its core, it is unable to do so as cpu1 is not in 2nd
task's cpus_allowed mask. Ideally cpu1 should note that the task can be
moved to its sibling and trigger that movement.

In this patch, we try to identify if load balancing goal was not achieved
completely because of destination cpu not being in cpus_allowed mask of target
task(s) and retry load balancing to try and move tasks to other cpus in the
same sched_group as that of destination cpu.

Tested on tip commit cca44889.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: Prashanth Nageshappa <[email protected]>

----

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de49ed5..da275d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
+#define LBF_NEW_DST_CPU 0x04

struct lb_env {
struct sched_domain *sd;
@@ -3108,6 +3109,8 @@ struct lb_env {
int dst_cpu;
struct rq *dst_rq;

+ struct cpumask *dst_grpmask;
+ int new_dst_cpu;
enum cpu_idle_type idle;
long imbalance;
unsigned int flags;
@@ -3198,7 +3201,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* 3) are cache-hot on their current CPU.
*/
if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ int new_dst_cpu;
+
+ if (!env->dst_grpmask) {
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ return 0;
+ }
+ /*
+ * check if cpus_allowed has any cpus in the same sched_group
+ * as that of dst_cpu and set LBF_NEW_DST_CPU and new_dst_cpu
+ * accordingly
+ */
+ new_dst_cpu = cpumask_first_and(env->dst_grpmask,
+ tsk_cpus_allowed(p));
+ if (new_dst_cpu >= nr_cpu_ids) {
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ } else {
+ env->flags |= LBF_NEW_DST_CPU;
+ env->new_dst_cpu = new_dst_cpu;
+ }
return 0;
}
env->flags &= ~LBF_ALL_PINNED;
@@ -4418,7 +4439,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *balance)
{
- int ld_moved, active_balance = 0;
+ int ld_moved, old_ld_moved, active_balance = 0;
struct sched_group *group;
struct rq *busiest;
unsigned long flags;
@@ -4428,6 +4449,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
+ .dst_grpmask = sched_group_cpus(sd->groups),
.idle = idle,
.loop_break = sched_nr_migrate_break,
.find_busiest_queue = find_busiest_queue,
@@ -4461,6 +4483,7 @@ redo:
schedstat_add(sd, lb_imbalance[idle], env.imbalance);

ld_moved = 0;
+ old_ld_moved = 0;
if (busiest->nr_running > 1) {
/*
* Attempt to move tasks. If find_busiest_group has found
@@ -4488,12 +4511,27 @@ more_balance:
env.flags &= ~LBF_NEED_BREAK;
goto more_balance;
}
-
/*
* some other cpu did the load balance for us.
*/
- if (ld_moved && this_cpu != smp_processor_id())
- resched_cpu(this_cpu);
+ if ((ld_moved != old_ld_moved) &&
+ (env.dst_cpu != smp_processor_id()))
+ resched_cpu(env.dst_cpu);
+
+ if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {
+ /*
+ * we could not balance completely as some tasks
+ * were not allowed to move to the dst_cpu, so try
+ * again with new_dst_cpu.
+ */
+ this_rq = cpu_rq(env.new_dst_cpu);
+ env.dst_rq = this_rq;
+ env.dst_cpu = env.new_dst_cpu;
+ env.flags &= ~LBF_NEW_DST_CPU;
+ env.loop = 0;
+ old_ld_moved = ld_moved;
+ goto more_balance;
+ }

/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(env.flags & LBF_ALL_PINNED)) {
@@ -4694,6 +4732,7 @@ static int active_load_balance_cpu_stop(void *data)
.sd = sd,
.dst_cpu = target_cpu,
.dst_rq = target_rq,
+ .dst_grpmask = NULL,
.src_cpu = busiest_rq->cpu,
.src_rq = busiest_rq,
.idle = CPU_IDLE,


2012-06-04 09:01:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 11:27 +0530, Prashanth Nageshappa wrote:
> Based on the description in
> http://marc.info/?l=linux-kernel&m=133108682113018&w=2 , I was able to recreate
> a problem where in a SCHED_OTHER thread never gets runtime, even though there is
> one allowed CPU where it can run and make progress.

Do not use external references in changelogs if you don't absolutely
need to.

> On a dual socket box (4 cores per socket, 2 threads per core) with following
> config:
> 0 8 1 9 4 12 5 13
> 2 10 3 11 6 14 7 15
> |__________| |__________|
> socket 1 socket 2
>
> If we have following 4 tasks (2 SCHED_FIFO and 2 SCHED_OTHER) started in the
> following order:
> 1> SCHED_FIFO cpu hogging task bound to cpu 1
> 2> SCHED_OTHER cpu hogging task bound to cpus 3 & 9 - running on cpu 3
> sleeps and wakes up after all other tasks are started
> 3> SCHED_FIFO cpu hogging task bound to cpu 3
> 4> SCHED_OTHER cpu hogging task bound to cpu 9
>
> Once all the 4 tasks are started, we observe that 2nd task is starved of CPU
> after waking up. When it wakes up, it wakes up on its prev_cpu (3) where
> a FIFO task is now hogging the cpu. To prevent starvation, 2nd task
> needs to be pulled to cpu 9. However, between cpus 1, 9, cpu1 is the chosen
> cpu that attempts pulling tasks towards its core. When it tries pulling
> 2nd tasks towards its core, it is unable to do so as cpu1 is not in 2nd
> task's cpus_allowed mask. Ideally cpu1 should note that the task can be
> moved to its sibling and trigger that movement.
>
> In this patch, we try to identify if load balancing goal was not achieved
> completely because of destination cpu not being in cpus_allowed mask of target
> task(s) and retry load balancing to try and move tasks to other cpus in the
> same sched_group as that of destination cpu.

Either expand a little more on the implementation or preferably add some
comments.

> Tested on tip commit cca44889.

This is not a sane addition to the changelog.

> Signed-off-by: Srivatsa Vaddagiri <[email protected]>

Did vatsa write this patch? If so, you forgot a From header, if not,
wtf!?

> Signed-off-by: Prashanth Nageshappa <[email protected]>
>
> ----
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index de49ed5..da275d8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>
> #define LBF_ALL_PINNED 0x01
> #define LBF_NEED_BREAK 0x02
> +#define LBF_NEW_DST_CPU 0x04
>
> struct lb_env {
> struct sched_domain *sd;
> @@ -3108,6 +3109,8 @@ struct lb_env {
> int dst_cpu;
> struct rq *dst_rq;
>
> + struct cpumask *dst_grpmask;
> + int new_dst_cpu;
> enum cpu_idle_type idle;
> long imbalance;
> unsigned int flags;
> @@ -3198,7 +3201,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> * 3) are cache-hot on their current CPU.
> */
> if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
> - schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + int new_dst_cpu;
> +
> + if (!env->dst_grpmask) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + return 0;
> + }
> + /*
> + * check if cpus_allowed has any cpus in the same sched_group
> + * as that of dst_cpu and set LBF_NEW_DST_CPU and new_dst_cpu
> + * accordingly
> + */
> + new_dst_cpu = cpumask_first_and(env->dst_grpmask,
> + tsk_cpus_allowed(p));
> + if (new_dst_cpu >= nr_cpu_ids) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + } else {
> + env->flags |= LBF_NEW_DST_CPU;
> + env->new_dst_cpu = new_dst_cpu;
> + }

Only a flimsy comment on what the code does -- I can read C thank you.
Not a single comment on why it does this.

> return 0;
> }
> env->flags &= ~LBF_ALL_PINNED;
> @@ -4418,7 +4439,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct sched_domain *sd, enum cpu_idle_type idle,
> int *balance)
> {
> - int ld_moved, active_balance = 0;
> + int ld_moved, old_ld_moved, active_balance = 0;
> struct sched_group *group;
> struct rq *busiest;
> unsigned long flags;
> @@ -4428,6 +4449,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> .sd = sd,
> .dst_cpu = this_cpu,
> .dst_rq = this_rq,
> + .dst_grpmask = sched_group_cpus(sd->groups),
> .idle = idle,
> .loop_break = sched_nr_migrate_break,
> .find_busiest_queue = find_busiest_queue,
> @@ -4461,6 +4483,7 @@ redo:
> schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>
> ld_moved = 0;
> + old_ld_moved = 0;
> if (busiest->nr_running > 1) {
> /*
> * Attempt to move tasks. If find_busiest_group has found
> @@ -4488,12 +4511,27 @@ more_balance:
> env.flags &= ~LBF_NEED_BREAK;
> goto more_balance;
> }
> -
> /*
> * some other cpu did the load balance for us.
> */
> - if (ld_moved && this_cpu != smp_processor_id())
> - resched_cpu(this_cpu);
> + if ((ld_moved != old_ld_moved) &&
> + (env.dst_cpu != smp_processor_id()))
> + resched_cpu(env.dst_cpu);

The whole old_ld_moved stuff sucks, and preferably needs a rename, or at
least a comment.

> + if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {
> + /*
> + * we could not balance completely as some tasks
> + * were not allowed to move to the dst_cpu, so try
> + * again with new_dst_cpu.
> + */
> + this_rq = cpu_rq(env.new_dst_cpu);
> + env.dst_rq = this_rq;
> + env.dst_cpu = env.new_dst_cpu;
> + env.flags &= ~LBF_NEW_DST_CPU;
> + env.loop = 0;
> + old_ld_moved = ld_moved;
> + goto more_balance;
> + }
>

OK, so previously we only pulled to ourselves, now you make cpu x move
from cpu y to cpu z. This changes the dynamic of the load-balancer, not
a single word on that and its impact/ramifications.

2012-06-04 09:25:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 11:27 +0530, Prashanth Nageshappa wrote:
> Based on the description in
> http://marc.info/?l=linux-kernel&m=133108682113018&w=2 , I was able to recreate
> a problem where in a SCHED_OTHER thread never gets runtime, even though there is
> one allowed CPU where it can run and make progress.
>
> On a dual socket box (4 cores per socket, 2 threads per core) with following
> config:
> 0 8 1 9 4 12 5 13
> 2 10 3 11 6 14 7 15
> |__________| |__________|
> socket 1 socket 2
>
> If we have following 4 tasks (2 SCHED_FIFO and 2 SCHED_OTHER) started in the
> following order:
> 1> SCHED_FIFO cpu hogging task bound to cpu 1
> 2> SCHED_OTHER cpu hogging task bound to cpus 3 & 9 - running on cpu 3
> sleeps and wakes up after all other tasks are started
> 3> SCHED_FIFO cpu hogging task bound to cpu 3
> 4> SCHED_OTHER cpu hogging task bound to cpu 9
>
> Once all the 4 tasks are started, we observe that 2nd task is starved of CPU
> after waking up. When it wakes up, it wakes up on its prev_cpu (3) where
> a FIFO task is now hogging the cpu. To prevent starvation, 2nd task
> needs to be pulled to cpu 9. However, between cpus 1, 9, cpu1 is the chosen
> cpu that attempts pulling tasks towards its core. When it tries pulling
> 2nd tasks towards its core, it is unable to do so as cpu1 is not in 2nd
> task's cpus_allowed mask. Ideally cpu1 should note that the task can be
> moved to its sibling and trigger that movement.

Isn't this poking the wrong spot?

Making load balancing try to correct a bad situation created by a gone
insane SCHED_FIFO task looks wrong to me. Better would be to make sure
insane RT tasks cannot borrow runtime indefinitely. End result of 100%
SCHED_FIFO is dead box, so whether we have a spot where we could place
poor doomed SCHED_OTHER task seems kinda moot.

Also, seems everybody and his brother thinks their stuff is so critical
that they run stuff SCHED_FIFO/99, which is dainbramaged but seemingly
common practice. To make the system more robust in the face of that
insanity, we _could_ perhaps tick SCHED_FIFO when budget is staying
exceeded, which would allow sane threads at the same prio to get CPU
instead of returning CPU to the criminally insane. Better would be to
just detect elite sociopath, and noisily cancel his Unobtanium CPU Card.

-Mike

2012-06-04 11:41:40

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Peter Zijlstra <[email protected]> [2012-06-04 11:00:54]:

> > Signed-off-by: Srivatsa Vaddagiri <[email protected]>
>
> Did vatsa write this patch?

I wrote the first version of the patch which Prashanth took, tested,
fixed a bug and is finally publishing it. So yes,

> If so, you forgot a From header, if not, wtf!?

it is missing the From header.

> OK, so previously we only pulled to ourselves,

That't not entirely true isn't it i.e this_cpu need not equal
smp_processor_id even before this change.

> now you make cpu x move
> from cpu y to cpu z. This changes the dynamic of the load-balancer, not
> a single word on that and its impact/ramifications.

The other possibility is for the right sibling cpus to do load balance
in the same domain (noting that it needs to pull a task from another
sched_group to itself and ignoring balance_cpu). That seemed like a more
invasive change than this patch. We'd be happy to try any other approach
you have in mind.

- vatsa

2012-06-04 11:50:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 17:11 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <[email protected]> [2012-06-04 11:00:54]:
>
> > > Signed-off-by: Srivatsa Vaddagiri <[email protected]>
> >
> > Did vatsa write this patch?
>
> I wrote the first version of the patch which Prashanth took, tested,
> fixed a bug and is finally publishing it. So yes,
>
> > If so, you forgot a From header, if not, wtf!?
>
> it is missing the From header.
>
> > OK, so previously we only pulled to ourselves,
>
> That't not entirely true isn't it i.e this_cpu need not equal
> smp_processor_id even before this change.

You forgot to finish that, I presume you were thinking of nohz idle
balancing? True, but in that case the target was at least idle.

> > now you make cpu x move
> > from cpu y to cpu z. This changes the dynamic of the load-balancer, not
> > a single word on that and its impact/ramifications.
>
> The other possibility is for the right sibling cpus to do load balance
> in the same domain (noting that it needs to pull a task from another
> sched_group to itself and ignoring balance_cpu). That seemed like a more
> invasive change than this patch. We'd be happy to try any other approach
> you have in mind.

I'm not saying the approach is bad, I'm just saying the patch is bad.
Mostly because there's a distinct lack of information on things.

There's nothing to indicate you've considered stuff, found this the best
solution because of other stuff etc... thus I think its the first thing
that came to mind without due consideration.

I don't like unconsidered poking at the load-balancer.

2012-06-04 11:51:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 17:11 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <[email protected]> [2012-06-04 11:00:54]:
>
> > > Signed-off-by: Srivatsa Vaddagiri <[email protected]>
> >
> > Did vatsa write this patch?
>
> I wrote the first version of the patch which Prashanth took, tested,
> fixed a bug and is finally publishing it. So yes,
>
> > If so, you forgot a From header, if not, wtf!?
>
> it is missing the From header.
>
> > OK, so previously we only pulled to ourselves,
>
> That't not entirely true isn't it i.e this_cpu need not equal
> smp_processor_id even before this change.
>
> > now you make cpu x move
> > from cpu y to cpu z. This changes the dynamic of the load-balancer, not
> > a single word on that and its impact/ramifications.
>
> The other possibility is for the right sibling cpus to do load balance
> in the same domain (noting that it needs to pull a task from another
> sched_group to itself and ignoring balance_cpu). That seemed like a more
> invasive change than this patch. We'd be happy to try any other approach
> you have in mind.

Thing is, first thing on Monday morning my brain don't work too fast. If
I then get to basically reverse engineer a patch because people forgot
to mention all the important bits, I get annoyed.

So don't do that ;-)

2012-06-04 11:53:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 11:25 +0200, Mike Galbraith wrote:
> Isn't this poking the wrong spot?

Yes and no, the use-case is definitely so-so.. However, even if a FIFO
task were to only consume 95% of time, we might still want to balance
things differently, and I don't think we do the sane thing there either.

But fully agreed, if you run FIFO at 100% you get to keep the pieces.

2012-06-04 12:27:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Peter Zijlstra <[email protected]> [2012-06-04 13:49:47]:

> > > OK, so previously we only pulled to ourselves,
> >
> > That't not entirely true isn't it i.e this_cpu need not equal
> > smp_processor_id even before this change.
>
> You forgot to finish that, I presume you were thinking of nohz idle
> balancing?

Yes!

> True, but in that case the target was at least idle.

While that is true most of the time, afaics there is nothing preventing a
idle cpu to wake up and start running a task while somebody else
(ilb_cpu) is doing load balance on its behalf?

> > > now you make cpu x move
> > > from cpu y to cpu z. This changes the dynamic of the load-balancer, not
> > > a single word on that and its impact/ramifications.
> >
> > The other possibility is for the right sibling cpus to do load balance
> > in the same domain (noting that it needs to pull a task from another
> > sched_group to itself and ignoring balance_cpu). That seemed like a more
> > invasive change than this patch. We'd be happy to try any other approach
> > you have in mind.
>
> I'm not saying the approach is bad, I'm just saying the patch is bad.
> Mostly because there's a distinct lack of information on things.

The other possibility that was considered was to modify move_tasks() to
move a task to a different env->dst_cpu (as indicated by task's
cpus_allowed mask and the sched_group's cpumask). Since at that point we would
have alread taken two runqueue locks (src_rq and dst_rq), grabbing a third
runqueue lock (that of the new dst_cpu) would have meant some runqueue
lock/unlock dance which didn't look pretty either. Moreover we may be
able to achieve the load balace goal by just ignoring that particular
task and wading thr' the rest of the tasks on the src_cpu.

> There's nothing to indicate you've considered stuff, found this the best
> solution because of other stuff etc... thus I think its the first thing
> that came to mind without due consideration.

We will modify the changelog to indicate all the possibilities
considered and publish the next version. Thanks for your feedback!

- vatsa

2012-06-04 12:47:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 13:53 +0200, Peter Zijlstra wrote:
> On Mon, 2012-06-04 at 11:25 +0200, Mike Galbraith wrote:
> > Isn't this poking the wrong spot?
>
> Yes and no, the use-case is definitely so-so.. However, even if a FIFO
> task were to only consume 95% of time, we might still want to balance
> things differently, and I don't think we do the sane thing there either.

You need a good reason to run RT, and being able to starve others to
death ain't it, so I don't see a good reason to care about the 95% case
enough to fiddle with load balancing to accommodate the oddball case.

Agreed it is a hole, but it's one dug by root. If you need so much CPU
that you can and will starve SCHED_OTHER to death, you need isolation
from SCHED_OTHER, lest they do evil things to your deadline, just as
much as they desperately need protection from your evil CPU usage ;-)

-Mike

2012-06-04 13:07:51

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Mike Galbraith <[email protected]> [2012-06-04 14:47:43]:

> You need a good reason to run RT, and being able to starve others to
> death ain't it, so I don't see a good reason to care about the 95% case
> enough to fiddle with load balancing to accommodate the oddball case.

While starvation of SCHED_OTHER task was an extreme case, the point
remains that SCHED_OTHER tasks are better served by moving them away
from cpus running rt tasks that are partially cpu intensive. While the
current code has the nuts and bolts to recognize this situation
(scale_rt_power), it fails to effect SCHED_OTHER task movement because of how
one cpu from a sched_group is designated to pull tasks on behalf of its
siblings and that chosen balance_cpu may not be in the task's cpus_allowed mask
(but the task can run on one or more of its sibling cpus).

- vatsa

2012-06-04 14:30:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 18:37 +0530, Srivatsa Vaddagiri wrote:
> * Mike Galbraith <[email protected]> [2012-06-04 14:47:43]:
>
> > You need a good reason to run RT, and being able to starve others to
> > death ain't it, so I don't see a good reason to care about the 95% case
> > enough to fiddle with load balancing to accommodate the oddball case.
>
> While starvation of SCHED_OTHER task was an extreme case, the point
> remains that SCHED_OTHER tasks are better served by moving them away
> from cpus running rt tasks that are partially cpu intensive. While the
> current code has the nuts and bolts to recognize this situation
> (scale_rt_power), it fails to effect SCHED_OTHER task movement because of how
> one cpu from a sched_group is designated to pull tasks on behalf of its
> siblings and that chosen balance_cpu may not be in the task's cpus_allowed mask
> (but the task can run on one or more of its sibling cpus).

Yeah, this is true, it is a latency source and a fairness violation.
Slow path balance consideration does make some sense to me.

But, if you have an RT requirement, you can't afford to mix unknown
entities, nor over-commit etc. A realtime application will assign all
resources, so the load balancer becomes essentially unemployed. No?
Non critical worker-bees may be allowed to bounce around in say a
cpuset, but none of the CPUs which do critical work will ever be
over-committed, else application just lost the war. In that regard,
twiddling the load balancer to accommodate strange sounding case still
seems wrong to me.

-Mike

2012-06-04 14:38:14

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Mike Galbraith <[email protected]> [2012-06-04 16:30:34]:

> Yeah, this is true, it is a latency source and a fairness violation.
> Slow path balance consideration does make some sense to me.
>
> But, if you have an RT requirement, you can't afford to mix unknown
> entities, nor over-commit etc. A realtime application will assign all
> resources, so the load balancer becomes essentially unemployed. No?
> Non critical worker-bees may be allowed to bounce around in say a
> cpuset, but none of the CPUs which do critical work will ever be
> over-committed, else application just lost the war. In that regard,
> twiddling the load balancer to accommodate strange sounding case still
> seems wrong to me.

Btw the patch should help non-rt case as well (where a high
priority SCHED_OTHER is hogging cpu while low-priority SCHED_OTHER task
on that same cpu suffers as we choose not to move it to another
cpu (because of the way balance_cpu based load balance is written).

- vatsa

2012-06-04 14:41:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 20:08 +0530, Srivatsa Vaddagiri wrote:
> * Mike Galbraith <[email protected]> [2012-06-04 16:30:34]:
>
> > Yeah, this is true, it is a latency source and a fairness violation.
> > Slow path balance consideration does make some sense to me.
> >
> > But, if you have an RT requirement, you can't afford to mix unknown
> > entities, nor over-commit etc. A realtime application will assign all
> > resources, so the load balancer becomes essentially unemployed. No?
> > Non critical worker-bees may be allowed to bounce around in say a
> > cpuset, but none of the CPUs which do critical work will ever be
> > over-committed, else application just lost the war. In that regard,
> > twiddling the load balancer to accommodate strange sounding case still
> > seems wrong to me.
>
> Btw the patch should help non-rt case as well (where a high
> priority SCHED_OTHER is hogging cpu while low-priority SCHED_OTHER task
> on that same cpu suffers as we choose not to move it to another
> cpu (because of the way balance_cpu based load balance is written).

But high priority SCHED_OTHER tasks do not hog the CPU, they get their
fair share as defined by the user.

-Mike

2012-06-04 15:00:53

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Mike Galbraith <[email protected]> [2012-06-04 16:41:35]:

> But high priority SCHED_OTHER tasks do not hog the CPU, they get their
> fair share as defined by the user.

Consider this case. System with 2 cores (each with 2 thread) and 3
cgroups :

A (1024) -> has 2 tasks (A0, A1)
B (2048) -> has 2 tasks (B0, B1)
C (1024) -> has 1 tasks (C0 - pinned to CPUs 1,2)

(B0, B1) collectively are eligible to consume 2 full cpus worth of
bandwidth, (A0, A1) together are eligible to consume 1 full-cpu
worth of bandwidth and finally C0 is eligible to get 1 full-cpu worth of
bandwidth.

Currently C0 is sleeping as a result of which tasks could be spread as:

CPU0 -> A0
CPU1 -> A1

CPU2 -> B0
CPU3 -> B1

Now C0 wakes up and lands on CPU2 (which was its prev_cpu).

CPU0 -> A0
CPU1 -> A1

CPU2 -> B0, C0
CPU3 -> B1

Ideally CPU1 needs to pull it C0 to itself (while A1 moves to CPU0). Do
you agree to that? I doubt that happens because of how CPU0 does load
balance on behalf of itself and CPU1 (and thus fails to pull C0 to its
core).

- vatsa

2012-06-04 15:21:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 20:30 +0530, Srivatsa Vaddagiri wrote:
> * Mike Galbraith <[email protected]> [2012-06-04 16:41:35]:
>
> > But high priority SCHED_OTHER tasks do not hog the CPU, they get their
> > fair share as defined by the user.
>
> Consider this case. System with 2 cores (each with 2 thread) and 3
> cgroups :
>
> A (1024) -> has 2 tasks (A0, A1)
> B (2048) -> has 2 tasks (B0, B1)
> C (1024) -> has 1 tasks (C0 - pinned to CPUs 1,2)
>
> (B0, B1) collectively are eligible to consume 2 full cpus worth of
> bandwidth, (A0, A1) together are eligible to consume 1 full-cpu
> worth of bandwidth and finally C0 is eligible to get 1 full-cpu worth of
> bandwidth.

The much simpler way to say that is: 5 tasks, two of 512, 3 of 1024.

> Currently C0 is sleeping as a result of which tasks could be spread as:
>
> CPU0 -> A0
> CPU1 -> A1
>
> CPU2 -> B0
> CPU3 -> B1
>
> Now C0 wakes up and lands on CPU2 (which was its prev_cpu).
>
> CPU0 -> A0
> CPU1 -> A1
>
> CPU2 -> B0, C0
> CPU3 -> B1

That's 512, 512, 2048, 1024.

> Ideally CPU1 needs to pull it C0 to itself (while A1 moves to CPU0). Do
> you agree to that? I doubt that happens because of how CPU0 does load
> balance on behalf of itself and CPU1 (and thus fails to pull C0 to its
> core).

Right, 0 can't pull C0 because of cpus_allowed, it can however pull B0,
resulting in: {A0, B0}:1536, {A1}:512, {C0}:1024, {B1}:1024, the next
balance pass of 1 will then pull A0, resulting in: {B0}:1024, {A0,
A1}:1024, {C0}:1024, {B1}:1024

And all is well again.

That is not to say you couldn't contrive a scenario where it would be
needed..

2012-06-04 15:25:36

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Peter Zijlstra <[email protected]> [2012-06-04 17:21:11]:

> That is not to say you couldn't contrive a scenario where it would be
> needed..

Right ..what if B0. B1 are pinned? !! I think most of the current
weakness lies in handling tasks that have affinity set.

- vatsa

2012-06-04 15:33:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 20:55 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <[email protected]> [2012-06-04 17:21:11]:
>
> > That is not to say you couldn't contrive a scenario where it would be
> > needed..
>
> Right ..what if B0. B1 are pinned? !! I think most of the current
> weakness lies in handling tasks that have affinity set.

Yeah, affinity is a pain.. the whole group_imb crap is due to that as
well.

Now I'm not as opposed to this as Mike is, the load cpu_power thing can
also happen due to excessive IRQ time, and hopefully we'll get to have
an unpriv. SCHED_DEADLINE at some point as well.

But we should try to keep the stuff reasonably sane and very much
consider the worst case compute time of the load-balancer. All that redo
logic can be triggered on purpose.

Thing is, you can create an arbitrary hard problem by creating lots of
tasks with tricky masks, we shouldn't bend over backwards trying to
solve it just because.

[ Also, I suspect I wrecked the ALL_PINNED muck, shouldn't we reset
env.loop_break? ]

2012-06-04 15:46:17

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Peter Zijlstra <[email protected]> [2012-06-04 17:33:34]:

> Now I'm not as opposed to this as Mike is, the load cpu_power thing can
> also happen due to excessive IRQ time,

Yes good point.

> Thing is, you can create an arbitrary hard problem by creating lots of
> tasks with tricky masks, we shouldn't bend over backwards trying to
> solve it just because.

Agreed. The change proposed here however seems simple enough to warrant
consideration?

> [ Also, I suspect I wrecked the ALL_PINNED muck, shouldn't we reset
> env.loop_break? ]

Yeah I think so. That was one of the bug we had in first version of the
patch (where we had not reset loop_break and redo was failing to pull task
because of that).

- vatsa

2012-06-04 16:56:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

On Mon, 2012-06-04 at 17:33 +0200, Peter Zijlstra wrote:

> Now I'm not as opposed to this as Mike is, the load cpu_power thing can
> also happen due to excessive IRQ time, and hopefully we'll get to have
> an unpriv. SCHED_DEADLINE at some point as well.

I'm not heavily opposed, was just questioning the RT justification.

-Mike

2012-06-04 17:37:41

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH] sched: balance_cpu to consider other cpus in its group as target of (pinned) task migration

* Mike Galbraith <[email protected]> [2012-06-04 18:56:29]:

> I'm not heavily opposed, was just questioning the RT justification.

We will update the changelog to indicate all the scenarios/possibilities
discussed and publish the next version shortly.

- vatsa