LinuxLists.cc - [RFC PATCH 0/2] sched: Load Balancing using Per-entity-Load-tracking

2012-10-12 04:50:58

Subject: [RFC PATCH 0/2] sched: Load Balancing using Per-entity-Load-tracking

Hi everyone,

This patchset uses the per-entity-load-tracking patchset which will soon be
available in the kernel.It is based on the tip/master tree and the first 8
latest patches of sched:per-entity-load-tracking alone have been imported to
the tree to avoid the complexities of task groups and to hold back the
optimizations of this patch for now.

This patchset is an attempt to begin the integration of Per-entity-load-
metric for the cfs_rq,henceforth referred to as PJT's metric,with the load
balancer in a step wise fashion,and progress based on the consequences.

The following issues have been considered towards this:
[NOTE:an x% task referred to in the logs and below is calculated over a
duty cycle of 10ms.]

1.Consider a scenario,where there are two 10% tasks running on a cpu.The
present code will consider the load on this queue to be 2048,while
using PJT's metric the load is calculated to be <1000,rarely exceeding this
limit.Although the tasks are not contributing much to the cpu load,they are
decided to be moved by the scheduler.

But one could argue that 'not moving one of these tasks could throttle
them.If there was an idle cpu,perhaps we could have moved them'.While the
power save mode would have been fine with not moving the task,the
performance mode would prefer not to throttle the tasks.We could strive
to strike a balance by making this decision tunable with certain parameters.
This patchset includes such tunables.This issue is addressed in Patch[1/2].

2.We need to be able to do this cautiously,as the scheduler code is too
complex.This patchset is an attempt to begin the integration of PJT's
metric with the load balancer in a step wise fashion,and progress based on
the consequences.
I dont intend to vary the parameters used by the load balancer.Some
parameters are however included anew to make decisions about including a
sched group as a candidate for load balancing.

This patchset therefore has two primary aims.
Patch[1/2]: This patch aims at detecting short running tasks and
prevent their movement.In update_sg_lb_stats,dismiss a sched group
as a candidate for load balancing,if load calculated by PJT's metric
says that the average load on the sched_group <= 1024+(.15*1024).
This is a tunable,which can be varied after sufficient experiments.

Patch[2/2]:In the current scheduler greater load would be analogous
to more number of tasks.Therefore when the busiest group is picked
from the sched domain in update_sd_lb_stats,only the loads of the
groups are compared between them.If we were to use PJT's metric,a
higher load does not necessarily mean more number of tasks.This
patch addresses this issue.

3.The next step towards integration should be in using the PJT's metric for
comparison between the loads of the busy sched group and the sched
group which has to pull the tasks,which happens in find_busiest_group.
---

Preeti U Murthy (2):
sched:Prevent movement of short running tasks during load balancing
sched:Pick the apt busy sched group during load balancing

kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)

--
Regards,
Preeti U Murthy

2012-10-12 04:51:07

by Preeti U Murthy

[permalink] [raw]

Subject: [RFC PATCH 1/2] sched:Prevent movement of short running tasks during load balancing

Prevent sched groups with low load as tracked by PJT's metrics
from being candidates of the load balance routine.This metric is chosen to be
1024+15%*1024.But using PJT's metrics it has been observed that even when
three 10% tasks are running,the load sometimes does not exceed this
threshold.The call should be taken if the tasks can afford to be throttled.

This is why an additional metric has been included,which can determine how
long we can tolerate tasks not being moved even if the load is low.

Signed-off-by: Preeti U Murthy <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dbddcf6..dd0fb28 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4188,6 +4188,7 @@ struct sd_lb_stats {
*/
struct sg_lb_stats {
unsigned long avg_load; /*Avg load across the CPUs of the group */
+ u64 avg_cfs_runnable_load; /* Equivalent of avg_load but calculated using pjt's metric */
unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long sum_nr_running; /* Nr tasks running in the group */
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
@@ -4504,6 +4505,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
unsigned long load, max_cpu_load, min_cpu_load;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long avg_load_per_task = 0;
+ u64 group_load = 0; /* computed using PJT's metric */
int i;

if (local_group)
@@ -4548,6 +4550,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (idle_cpu(i))
sgs->idle_cpus++;

+ group_load += cpu_rq(i)->cfs.runnable_load_avg;
update_sg_numa_stats(sgs, rq);
}

@@ -4572,6 +4575,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;

/*
+ * Check if the sched group has not crossed the threshold.
+ *
+ * Also check if the sched_group although being within the threshold,is not
+ * queueing too many tasks.If yes to both,then make it an
+ * invalid candidate for load balancing
+ *
+ * The below condition is included as a tunable to meet performance and power needs
+ */
+ sgs->avg_cfs_runnable_load = (group_load * SCHED_POWER_SCALE) / group->sgp->power;
+ if (sgs->avg_cfs_runnable_load <= 1178 && sgs->sum_nr_running <= 2)
+ sgs->avg_cfs_runnable_load = 0;
+
+ /*
* Consider the group unbalanced when the imbalance is larger
* than the average weight of a task.
*

2012-10-12 04:51:14

by Preeti U Murthy

[permalink] [raw]

Subject: [RFC PATCH 2/2] sched:Pick the apt busy sched group during load balancing

If a sched group has passed the test for sufficient load in
update_sg_lb_stats,to qualify for load balancing,then PJT's
metrics has to be used to qualify the right sched group as the busiest group.

The scenario which led to this patch is shown below:
Consider Task1 and Task2 to be a long running task
and Tasks 3,4,5,6 to be short running tasks

Task3
Task4
Task1 Task5
Task2 Task6
------ ------
SCHED_GRP1 SCHED_GRP2

Normal load calculator would qualify SCHED_GRP2 as
the candidate for sd->busiest due to the following loads
that it calculates.

SCHED_GRP1:2048
SCHED_GRP2:4096

Load calculator would probably qualify SCHED_GRP1 as the candidate
for sd->busiest due to the following loads that it calculates

SCHED_GRP1:3200
SCHED_GRP2:1156

This patch aims to strike a balance between the loads of the
group and the number of tasks running on the group to decide the
busiest group in the sched_domain.

This means we will need to use the PJT's metrics but with an
additional constraint.

Signed-off-by: Preeti U Murthy <[email protected]>
---
kernel/sched/fair.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd0fb28..d45b7b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -165,7 +165,8 @@ void sched_init_granularity(void)
#else
# define WMULT_CONST (1UL << 32)
#endif
-
+#define NR_THRESHOLD 2
+#define LOAD_THRESHOLD 1
#define WMULT_SHIFT 32

/*
@@ -4169,6 +4170,7 @@ struct sd_lb_stats {
/* Statistics of the busiest group */
unsigned int busiest_idle_cpus;
unsigned long max_load;
+ u64 max_sg_load; /* Equivalent of max_load but calculated using pjt's metric*/
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
@@ -4628,8 +4630,21 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
- if (sgs->avg_load <= sds->max_load)
- return false;
+ /* Use PJT's metrics to qualify a sched_group as busy
+ * But a low load sched group may be queueing up many tasks
+ *
+ * So before dismissing a sched group with lesser load,ensure
+ * that the number of processes on it is checked if it is
+ * not too less loaded than the max load so far
+ */
+ if (sgs->avg_cfs_runnable_load <= sds->max_sg_load) {
+ if (sgs->avg_cfs_runnable_load > LOAD_THRESHOLD * sds->max_sg_load) {
+ if (sgs->sum_nr_running <= (NR_THRESHOLD + sds->busiest_nr_running))
+ return false;
+ } else {
+ return false;
+ }
+ }

if (sgs->sum_nr_running > sgs->group_capacity)
return true;
@@ -4708,6 +4723,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->this_idle_cpus = sgs.idle_cpus;
} else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
sds->max_load = sgs.avg_load;
+ sds->max_sg_load = sgs.avg_cfs_runnable_load;
sds->busiest = sg;
sds->busiest_nr_running = sgs.sum_nr_running;
sds->busiest_idle_cpus = sgs.idle_cpus;

2012-10-12 05:16:48

by Preeti U Murthy

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] sched:Pick the apt busy sched group during load balancing

Hi everyone,

The figures SCHED_GRP1:3200 and SCHED_GRP2:1156 shown below in the
changelog is the probable figure as calculated with the per-entity-
load-tracking metric for the runqueue load.

> If a sched group has passed the test for sufficient load in
> update_sg_lb_stats,to qualify for load balancing,then PJT's
> metrics has to be used to qualify the right sched group as the busiest group.
>
> The scenario which led to this patch is shown below:
> Consider Task1 and Task2 to be a long running task
> and Tasks 3,4,5,6 to be short running tasks
>
> Task3
> Task4
> Task1 Task5
> Task2 Task6
> ------ ------
> SCHED_GRP1 SCHED_GRP2
>
> Normal load calculator would qualify SCHED_GRP2 as
> the candidate for sd->busiest due to the following loads
> that it calculates.
>
> SCHED_GRP1:2048
> SCHED_GRP2:4096
>
> Load calculator would probably qualify SCHED_GRP1 as the candidate
> for sd->busiest due to the following loads that it calculates
>
> SCHED_GRP1:3200
> SCHED_GRP2:1156
>
Regards
Preeti

2012-10-15 09:14:59

by Preeti U Murthy

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] sched: Load Balancing using Per-entity-Load-tracking

Hi everyone,

The past few weeks must have been very tight due to the merge
window phase.Looks like it got through well :)

I request you to comment on this patchset.Primarily because it forces us
to look into the below scenario:

Highly loaded sched groups does not mean too many tasks and less loaded
sched groups does not mean too few tasks.

Thank you

Regards
Preeti

2012-10-18 17:26:15

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] sched: Load Balancing using Per-entity-Load-tracking

Hi Preeti,

I'm pleased to see that someone found the time to start looking at this.

On Fri, Oct 12, 2012 at 05:50:36AM +0100, Preeti U Murthy wrote:
> Hi everyone,
>
> This patchset uses the per-entity-load-tracking patchset which will soon be
> available in the kernel.It is based on the tip/master tree and the first 8
> latest patches of sched:per-entity-load-tracking alone have been imported to
> the tree to avoid the complexities of task groups and to hold back the
> optimizations of this patch for now.
>
> This patchset is an attempt to begin the integration of Per-entity-load-
> metric for the cfs_rq,henceforth referred to as PJT's metric,with the load
> balancer in a step wise fashion,and progress based on the consequences.
>
> The following issues have been considered towards this:
> [NOTE:an x% task referred to in the logs and below is calculated over a
> duty cycle of 10ms.]
>
> 1.Consider a scenario,where there are two 10% tasks running on a cpu.The
> present code will consider the load on this queue to be 2048,while
> using PJT's metric the load is calculated to be <1000,rarely exceeding this
> limit.Although the tasks are not contributing much to the cpu load,they are
> decided to be moved by the scheduler.

I guess that you assume, for now, that all tasks have default (nice 0)
priority? Both the old load and the PJT metric (tracked load) depends on
priority.

>
> But one could argue that 'not moving one of these tasks could throttle
> them.If there was an idle cpu,perhaps we could have moved them'.While the
> power save mode would have been fine with not moving the task,the
> performance mode would prefer not to throttle the tasks.We could strive
> to strike a balance by making this decision tunable with certain parameters.
> This patchset includes such tunables.This issue is addressed in Patch[1/2].
>

One could also argue that long as there are spare cpu cycles in each
schedule period then all tasks have received the cpu time they needed.
So from that point of view performance isn't affected by not balancing
the tasks as long as the cpu is not fully utilized. If we look at the
problem from a latency point of view then packing tasks on a single cpu
will increase latency but the increase will be bounded by the schedule
period.

> 2.We need to be able to do this cautiously,as the scheduler code is too
> complex.This patchset is an attempt to begin the integration of PJT's
> metric with the load balancer in a step wise fashion,and progress based on
> the consequences.
> I dont intend to vary the parameters used by the load balancer.Some
> parameters are however included anew to make decisions about including a
> sched group as a candidate for load balancing.
>
> This patchset therefore has two primary aims.
> Patch[1/2]: This patch aims at detecting short running tasks and
> prevent their movement.In update_sg_lb_stats,dismiss a sched group
> as a candidate for load balancing,if load calculated by PJT's metric
> says that the average load on the sched_group <= 1024+(.15*1024).
> This is a tunable,which can be varied after sufficient experiments.

Your current threshold implies that there must be at least two (nice 0)
tasks running breach the threshold and they need to be quite busy. This
makes sense to me. When you have more tasks they are more likely to be
waiting on the runqueue even if it is only 10% tasks. Let's say you have
five 10% tasks and they all become runnable at the same instant. In that
case some of the tasks would have a tracked load which is much higher
than if we only had two 10% tasks running. So if I'm not mistaken, it
would be possible to breach the threshold even though the overall cpu
utilization is only 50% and it would have been safe not to load-balance
that cpu.

Do you think it would make sense to let the threshold depend on the
number of task on the cpu somehow?

Alternative, the decision could be based on the cpu idle time over the
last schedule period. A cpu with no or very few spare cycles in the last
schedule period would be a good candidate for load-balancing. Latency
would be affected as mentioned earlier.

What are your thoughts about this?

Morten

>
> Patch[2/2]:In the current scheduler greater load would be analogous
> to more number of tasks.Therefore when the busiest group is picked
> from the sched domain in update_sd_lb_stats,only the loads of the
> groups are compared between them.If we were to use PJT's metric,a
> higher load does not necessarily mean more number of tasks.This
> patch addresses this issue.
>
> 3.The next step towards integration should be in using the PJT's metric for
> comparison between the loads of the busy sched group and the sched
> group which has to pull the tasks,which happens in find_busiest_group.
> ---
>
> Preeti U Murthy (2):
> sched:Prevent movement of short running tasks during load balancing
> sched:Pick the apt busy sched group during load balancing
>
>
> kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++++---
> 1 file changed, 35 insertions(+), 3 deletions(-)
>
> --
> Regards,
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2012-10-19 04:18:08

by Preeti U Murthy

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] sched: Load Balancing using Per-entity-Load-tracking

Hi Morten,
Thank you very much for your review.

>> 1.Consider a scenario,where there are two 10% tasks running on a cpu.The
>> present code will consider the load on this queue to be 2048,while
>> using PJT's metric the load is calculated to be <1000,rarely exceeding this
>> limit.Although the tasks are not contributing much to the cpu load,they are
>> decided to be moved by the scheduler.
>
> I guess that you assume, for now, that all tasks have default (nice 0)
> priority? Both the old load and the PJT metric (tracked load) depends on
> priority.
Thats right.I have assumed default priority of the tasks.
>
>>
>> But one could argue that 'not moving one of these tasks could throttle
>> them.If there was an idle cpu,perhaps we could have moved them'.While the
>> power save mode would have been fine with not moving the task,the
>> performance mode would prefer not to throttle the tasks.We could strive
>> to strike a balance by making this decision tunable with certain parameters.
>> This patchset includes such tunables.This issue is addressed in Patch[1/2].
>>
>
> One could also argue that long as there are spare cpu cycles in each
> schedule period then all tasks have received the cpu time they needed.
> So from that point of view performance isn't affected by not balancing
> the tasks as long as the cpu is not fully utilized. If we look at the
> problem from a latency point of view then packing tasks on a single cpu
> will increase latency but the increase will be bounded by the schedule
> period.
>
Assume that at the end of one scheduling period,there are a few spare
cycles on the cpu.this is fine from both the performance and latency
point of view at *this* point.nobody is waiting for the cpu.

The issue arises if it is detected that these spare cycles are due to
*sleeping tasks* and not due to no tasks.

At this point a decision needs to be made as to: if a scenario arises
where all these tasks wake up at the same time in the future,and wait on
the cpu,then are we ok with them waiting.Both performance and latency
views could be against this,as this also means less throughput.But
performance view could go slightly easy on this to argue,that its ok if
2-3 tasks wait,if more,then there is a need to move them.
>> This patchset therefore has two primary aims.
>> Patch[1/2]: This patch aims at detecting short running tasks and
>> prevent their movement.In update_sg_lb_stats,dismiss a sched group
>> as a candidate for load balancing,if load calculated by PJT's metric
>> says that the average load on the sched_group <= 1024+(.15*1024).
>> This is a tunable,which can be varied after sufficient experiments.
>
> Your current threshold implies that there must be at least two (nice 0)
> tasks running breach the threshold and they need to be quite busy. This
> makes sense to me. When you have more tasks they are more likely to be
> waiting on the runqueue even if it is only 10% tasks. Let's say you have
> five 10% tasks and they all become runnable at the same instant. In that
> case some of the tasks would have a tracked load which is much higher
> than if we only had two 10% tasks running. So if I'm not mistaken, it
> would be possible to breach the threshold even though the overall cpu
> utilization is only 50% and it would have been safe not to load-balance
> that cpu.
>
> Do you think it would make sense to let the threshold depend on the
> number of task on the cpu somehow?

You are right,Morten.In fact I have included this viewpoint in both my
first and second patch enclosed by this. So lets take up the above
scenario.if there are 5 10% tasks running,they will surely cross the
threshold,but the cpu might have spare cycles at the end of a scheduling
period.Now that is your concern.

Again we have two different viewpoints.This threshold is like a tuning
knob.we could increase it if we feel that this threshold gets reached
very quickly with as few tasks as 5, although the cpu utilization is
poor.we prefer not to wake up another cpu unless the present cpu is
aptly loaded.we could call this the power saving view.

Else we could say that,we are not intending to affect the throughput of
tasks,so we prefer the knob be at this value,so that we qualify such a
load as a candidate for load balancing.we could call this the
performance view.
>
> Alternative, the decision could be based on the cpu idle time over the
> last schedule period. A cpu with no or very few spare cycles in the last
> schedule period would be a good candidate for load-balancing. Latency
> would be affected as mentioned earlier.
>
Exactly.idle_time == spare_cpu_cycles == less cpu_utilization.I hope i
am not wrong in drawing this equivalence.if thats the case then the same
explanation as above holds good here too.
>
> Morten

Thank you

Regards
Preeti

2012-10-19 04:58:32

by Preeti U Murthy

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] sched: Load Balancing using Per-entity-Load-tracking

Sorry guys this mail had problems getting sent.hence the repost.

Hi Morten,
Thank you very much for your review.

>> 1.Consider a scenario,where there are two 10% tasks running on a cpu.The
>> present code will consider the load on this queue to be 2048,while
>> using PJT's metric the load is calculated to be <1000,rarely exceeding this
>> limit.Although the tasks are not contributing much to the cpu load,they are
>> decided to be moved by the scheduler.
>
> I guess that you assume, for now, that all tasks have default (nice 0)
> priority? Both the old load and the PJT metric (tracked load) depends on
> priority.

Thats right.I have assumed default priority of the tasks.
>>
>> But one could argue that 'not moving one of these tasks could throttle
>> them.If there was an idle cpu,perhaps we could have moved them'.While the
>> power save mode would have been fine with not moving the task,the
>> performance mode would prefer not to throttle the tasks.We could strive
>> to strike a balance by making this decision tunable with certain parameters.
>> This patchset includes such tunables.This issue is addressed in Patch[1/2].
>>
>
> One could also argue that long as there are spare cpu cycles in each
> schedule period then all tasks have received the cpu time they needed.
> So from that point of view performance isn't affected by not balancing
> the tasks as long as the cpu is not fully utilized. If we look at the
> problem from a latency point of view then packing tasks on a single cpu
> will increase latency but the increase will be bounded by the schedule
> period.
>
Assume that at the end of one scheduling period,there are a few spare
cycles on the cpu.this is fine from both the performance and latency
point of view at *this* point.nobody is waiting for the cpu.

The issue arises if it is detected that these spare cycles are due to
*sleeping tasks* and not due to no tasks.

At this point a decision needs to be made as to: if a scenario arises
where all these tasks wake up at the same time in the future,and wait on
the cpu,then are we ok with them waiting.Both performance and latency
views could be against this,as this also means less throughput.But
performance view could go slightly easy on this to argue,that its ok if
2-3 tasks wait,if more,then there is a need to move them.

>> This patchset therefore has two primary aims.
>> Patch[1/2]: This patch aims at detecting short running tasks and
>> prevent their movement.In update_sg_lb_stats,dismiss a sched group
>> as a candidate for load balancing,if load calculated by PJT's metric
>> says that the average load on the sched_group <= 1024+(.15*1024).
>> This is a tunable,which can be varied after sufficient experiments.
>
> Your current threshold implies that there must be at least two (nice 0)
> tasks running breach the threshold and they need to be quite busy. This
> makes sense to me. When you have more tasks they are more likely to be
> waiting on the runqueue even if it is only 10% tasks. Let's say you have
> five 10% tasks and they all become runnable at the same instant. In that
> case some of the tasks would have a tracked load which is much higher
> than if we only had two 10% tasks running. So if I'm not mistaken, it
> would be possible to breach the threshold even though the overall cpu
> utilization is only 50% and it would have been safe not to load-balance
> that cpu.
>
> Do you think it would make sense to let the threshold depend on the
> number of task on the cpu somehow?
>
You are right,Morten.In fact I have included this viewpoint in both my
first and second patch enclosed by this. So lets take up the above
scenario.if there are 5 10% tasks running,they will surely cross the
threshold,but the cpu might have spare cycles at the end of a scheduling
period.Now that is your concern.

Again we have two different viewpoints.This threshold is like a tuning
knob.we could increase it if we feel that this threshold gets reached
very quickly with as few tasks as 5, although the cpu utilization is
poor.we prefer not to wake up another cpu unless the present cpu is
aptly loaded.we could call this the power saving view.

Else we could say that,we are not intending to affect the throughput of
tasks,so we prefer the knob be at this value,so that we qualify such a
load as a candidate for load balancing.we could call this the
performance view.

> Alternative, the decision could be based on the cpu idle time over the
> last schedule period. A cpu with no or very few spare cycles in the last
> schedule period would be a good candidate for load-balancing. Latency
> would be affected as mentioned earlier.
>
Exactly.idle_time == spare_cpu_cycles == less cpu_utilization.I hope i
am not wrong in drawing this equivalence.if thats the case then the same
explanation as above holds good here too.

> Morten

Thank you

Regards
Preeti