On Wed, Dec 25, 2013 at 02:58:26PM +0000, Alex Shi wrote:
>
> >> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
> >> From: Alex Shi <[email protected]>
> >> Date: Sat, 23 Nov 2013 23:18:09 +0800
> >> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
> >>
> >> Task migration happens when target just a bit less then source cpu load.
> >> To reduce such situation happens, aggravate the target cpu load with
> >> sd->imbalance_pct/100 in wake_affine.
> >>
> >> In find_idlest/busiest_group, change the aggravate to local cpu only
> >> from old group aggravation.
> >>
> >> on my pandaboard ES.
> >>
> >> latest kernel 527d1511310a89 + whole patchset
> >> hackbench -T -g 10 -f 40
> >> 23.25" 21.99"
> >> 23.16" 21.20"
> >> 24.24" 21.89"
> >> hackbench -p -g 10 -f 40
> >> 26.52" 21.46"
> >> 23.89" 22.96"
> >> 25.65" 22.73"
> >> hackbench -P -g 10 -f 40
> >> 20.14" 19.72"
> >> 19.96" 19.10"
> >> 21.76" 20.03"
> >>
> >> Signed-off-by: Alex Shi <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 35 ++++++++++++++++-------------------
> >> 1 file changed, 16 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index bccdd89..3623ba4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
> >>
> >> static unsigned long weighted_cpuload(const int cpu);
> >> static unsigned long source_load(int cpu);
> >> -static unsigned long target_load(int cpu);
> >> +static unsigned long target_load(int cpu, int imbalance_pct);
> >> static unsigned long power_of(int cpu);
> >> static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> >>
> >> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
> >> * Return a high guess at the load of a migration-target cpu weighted
> >> * according to the scheduling class and "nice" value.
> >> */
> >> -static unsigned long target_load(int cpu)
> >> +static unsigned long target_load(int cpu, int imbalance_pct)
> >> {
> >> struct rq *rq = cpu_rq(cpu);
> >> unsigned long total = weighted_cpuload(cpu);
> >>
> >> + /*
> >> + * without cpu_load decay, in most of time cpu_load is same as total
> >> + * so we need to make target a bit heavier to reduce task migration
> >> + */
> >> + total = total * imbalance_pct / 100;
> >> +
> >> if (!sched_feat(LB_BIAS))
> >> return total;
> >>
> >> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >> this_cpu = smp_processor_id();
> >> prev_cpu = task_cpu(p);
> >> load = source_load(prev_cpu);
> >> - this_load = target_load(this_cpu);
> >> + this_load = target_load(this_cpu, 100);
> >>
> >> /*
> >> * If sync wakeup then subtract the (maximum possible)
> >> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>
> >> if (balanced ||
> >> (this_load <= load &&
> >> - this_load + target_load(prev_cpu) <= tl_per_task)) {
> >> + this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
> >> /*
> >> * This domain has SD_WAKE_AFFINE and
> >> * p is cache cold in this domain, and
> >> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >> {
> >> struct sched_group *idlest = NULL, *group = sd->groups;
> >> unsigned long min_load = ULONG_MAX, this_load = 0;
> >> - int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >>
> >> do {
> >> unsigned long load, avg_load;
> >> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>
> >> for_each_cpu(i, sched_group_cpus(group)) {
> >> /* Bias balancing toward cpus of our domain */
> >> - if (local_group)
> >> + if (i == this_cpu)
> >
> > What is the motivation for changing the local_group load calculation?
> > Now the load contributions of all cpus in the local group, except
> > this_cpu, will contribute more as their contribution (this_load) is
> > determined using target_load() instead.
>
> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
> 2 cores(guess no HT at that time) in cpu socket. With the cores number
NUMA support was already present. I guess that means support for systems
with significantly more than two cpus.
> increasing trend, the sched_group become large and large, to give whole
> group this bias value is becoming non-sense. So it looks reasonable to
> just bias this cpu only.
I think that is question whether you want to bias the whole group or
not at wake-up balancing. If you don't change the bias weight and only
apply it to this_cpu, you will be more prone to send the waking task
somewhere else.
> >
> > If I'm not mistaken, that will lead to more frequent load balancing as
> > the local_group bias has been reduced. That is the opposite of your
> > intentions based on your comment in target_load().
> >
> >> load = source_load(i);
> >> else
> >> - load = target_load(i);
> >> + load = target_load(i, sd->imbalance_pct);
> >
> > You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
> > that you removed above. sd->imbalance_pct may have been arbitrarily
> > chosen in the past, but changing it may affect behavior.
>
> the first commit with this line:
> int imbalance = 100 + (sd->imbalance_pct-100)/2;
> is 147cbb4bbe99, that also has no explanation for this imbalance value.
100 + (sd->imbalance_pct-100)/2 is used several places in sched.c before
147cbb4bbe99. It seems that it was used to encourage balancing in
certain situations such as wake-up, while sd->imbalance_pct was used
unmodified in other situations.
That seems to still be the case in v3.13-rc6. find_busiest_group() and
task_numa_compare() use sd->imbalance_pct, while task_numa_migrate(),
wake_affine(), and find_idlest_group() use 100 + (sd->imbalance_pct -
100)/2. The pattern seems to be to use the reduced imbalance_pct for
wake-up and the full imbalance_pct for balancing of runnable tasks.
AFAICT, using sd->imbalance_pct in find_idlest_group() will increase the
bias towards the local group. However, I would not be comfortable
changing it without understanding the full implications.
> AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
> reason to use half of imbalance_pct.
Based on the comment removed in 147cbb4bbe99 it seems that the reason is
to encourage balancing at wake-up. It doesn't seem to have any relation
to the number of cpus.
> but sched_domain is used widely for many kinds of platform/cpu type, the
> value is arbitrary too.
> Another reason to use it, I scaled any cpu load which is not this_cpu,
> so a bit conservation is to use origin imbalance_pct, not half of it.
The net result is almost the same if the groups have two cpus each.
Stronger for groups with one cpu, and weaker for groups with more than
two cpus.
> >
> >>
> >> avg_load += load;
> >> }
> >> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >> }
> >> } while (group = group->next, group != sd->groups);
> >>
> >> - if (!idlest || 100*this_load < imbalance*min_load)
> >> + if (!idlest || this_load < min_load)
> >> return NULL;
> >> return idlest;
> >> }
> >> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>
> >> nr_running = rq->nr_running;
> >>
> >> - /* Bias balancing toward cpus of our domain */
> >> - if (local_group)
> >> - load = target_load(i);
> >> + /* Bias balancing toward dst cpu */
> >> + if (env->dst_cpu == i)
> >> + load = target_load(i, env->sd->imbalance_pct);
> >
> > Here you do the same group load bias change as above.
> >
I don't think it is right to change the bias here to only apply to the
dst_cpu. rebalance_domains()/load_balance() is balancing groups not just
pulling task to dst_cpu. It must pull enough tasks to balance the
groups, even if it means temporarily overloading dst_cpu. The bias
should therefore apply to the whole group.
> >> else
> >> load = source_load(i);
> >>
> >> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >> if ((local->idle_cpus < busiest->idle_cpus) &&
> >> busiest->sum_nr_running <= busiest->group_weight)
> >> goto out_balanced;
> >> - } else {
> >> - /*
> >> - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> >> - * imbalance_pct to be conservative.
> >> - */
> >> - if (100 * busiest->avg_load <=
> >> - env->sd->imbalance_pct * local->avg_load)
> >> - goto out_balanced;
> >> }
> >>
> >> force_balance:
> >
> > As said my previous replies to this series, I think this problem should
> > be solved by fixing the cause of the problem, that is the cpu_load
> > calculation, instead of biasing the cpu_load where-ever it is used to
> > hide the problem.
> >
> > Doing a bit of git archaeology reveals that the cpu_load code goes back
> > to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
> > patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
> > my opinion that made logical sense. If we are about to change to *_idx=0
> > we are removing the main idea behind that code and there needs to be a
> > new one. Otherwise, cpu_load doesn't make sense.
>
> The commit is written in 2005. The CFS scheduler merged into kernel in
> Oct 2007. it is a too old legacy for us ...
And that is why I think cpu_load should be reconsidered. It doesn't make
sense to cripple the cpu_load code without fully understanding what it
is doing and then try to hide the problems it causes.
The cpu_load code might be older than CFS but in my opinion it still
makes sense. Changing cpu_load to be a snapshot of weighted_cpuload()
does not.
Morten
On 01/03/2014 12:04 AM, Morten Rasmussen wrote:
> On Wed, Dec 25, 2013 at 02:58:26PM +0000, Alex Shi wrote:
>>
>>>> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
>>>> From: Alex Shi <[email protected]>
>>>> Date: Sat, 23 Nov 2013 23:18:09 +0800
>>>> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
>>>>
>>>> Task migration happens when target just a bit less then source cpu load.
>>>> To reduce such situation happens, aggravate the target cpu load with
>>>> sd->imbalance_pct/100 in wake_affine.
>>>>
>>>> In find_idlest/busiest_group, change the aggravate to local cpu only
>>>> from old group aggravation.
>>>>
>>>> on my pandaboard ES.
>>>>
>>>> latest kernel 527d1511310a89 + whole patchset
>>>> hackbench -T -g 10 -f 40
>>>> 23.25" 21.99"
>>>> 23.16" 21.20"
>>>> 24.24" 21.89"
>>>> hackbench -p -g 10 -f 40
>>>> 26.52" 21.46"
>>>> 23.89" 22.96"
>>>> 25.65" 22.73"
>>>> hackbench -P -g 10 -f 40
>>>> 20.14" 19.72"
>>>> 19.96" 19.10"
>>>> 21.76" 20.03"
>>>>
>>>> Signed-off-by: Alex Shi <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 35 ++++++++++++++++-------------------
>>>> 1 file changed, 16 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index bccdd89..3623ba4 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
>>>>
>>>> static unsigned long weighted_cpuload(const int cpu);
>>>> static unsigned long source_load(int cpu);
>>>> -static unsigned long target_load(int cpu);
>>>> +static unsigned long target_load(int cpu, int imbalance_pct);
>>>> static unsigned long power_of(int cpu);
>>>> static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
>>>>
>>>> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
>>>> * Return a high guess at the load of a migration-target cpu weighted
>>>> * according to the scheduling class and "nice" value.
>>>> */
>>>> -static unsigned long target_load(int cpu)
>>>> +static unsigned long target_load(int cpu, int imbalance_pct)
>>>> {
>>>> struct rq *rq = cpu_rq(cpu);
>>>> unsigned long total = weighted_cpuload(cpu);
>>>>
>>>> + /*
>>>> + * without cpu_load decay, in most of time cpu_load is same as total
>>>> + * so we need to make target a bit heavier to reduce task migration
>>>> + */
>>>> + total = total * imbalance_pct / 100;
>>>> +
>>>> if (!sched_feat(LB_BIAS))
>>>> return total;
>>>>
>>>> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>>> this_cpu = smp_processor_id();
>>>> prev_cpu = task_cpu(p);
>>>> load = source_load(prev_cpu);
>>>> - this_load = target_load(this_cpu);
>>>> + this_load = target_load(this_cpu, 100);
>>>>
>>>> /*
>>>> * If sync wakeup then subtract the (maximum possible)
>>>> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>>>
>>>> if (balanced ||
>>>> (this_load <= load &&
>>>> - this_load + target_load(prev_cpu) <= tl_per_task)) {
>>>> + this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
>>>> /*
>>>> * This domain has SD_WAKE_AFFINE and
>>>> * p is cache cold in this domain, and
>>>> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>> {
>>>> struct sched_group *idlest = NULL, *group = sd->groups;
>>>> unsigned long min_load = ULONG_MAX, this_load = 0;
>>>> - int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>>>
>>>> do {
>>>> unsigned long load, avg_load;
>>>> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>>
>>>> for_each_cpu(i, sched_group_cpus(group)) {
>>>> /* Bias balancing toward cpus of our domain */
>>>> - if (local_group)
>>>> + if (i == this_cpu)
>>>
>>> What is the motivation for changing the local_group load calculation?
>>> Now the load contributions of all cpus in the local group, except
>>> this_cpu, will contribute more as their contribution (this_load) is
>>> determined using target_load() instead.
>>
>> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
>> 2 cores(guess no HT at that time) in cpu socket. With the cores number
>
> NUMA support was already present. I guess that means support for systems
> with significantly more than two cpus.
Thanks a lot for comments, Morten!
the find_idlest_group used in select_task_rq_fair, that designed to
select cpu near by old cpu whenever for wake or fork tasks. So in common
case, the selected cpu is rarely on another NUMA.
>
>> increasing trend, the sched_group become large and large, to give whole
>> group this bias value is becoming non-sense. So it looks reasonable to
>> just bias this cpu only.
>
> I think that is question whether you want to bias the whole group or
> not at wake-up balancing. If you don't change the bias weight and only
> apply it to this_cpu, you will be more prone to send the waking task
> somewhere else.
We don't need to follow the exactly old logical. Guess people will be
more happier to see the code/logical simple with better benchmark
performance. :)
>
>>>
>>> If I'm not mistaken, that will lead to more frequent load balancing as
>>> the local_group bias has been reduced. That is the opposite of your
>>> intentions based on your comment in target_load().
>>>
>>>> load = source_load(i);
>>>> else
>>>> - load = target_load(i);
>>>> + load = target_load(i, sd->imbalance_pct);
>>>
>>> You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
>>> that you removed above. sd->imbalance_pct may have been arbitrarily
>>> chosen in the past, but changing it may affect behavior.
>>
>> the first commit with this line:
>> int imbalance = 100 + (sd->imbalance_pct-100)/2;
>> is 147cbb4bbe99, that also has no explanation for this imbalance value.
>
> 100 + (sd->imbalance_pct-100)/2 is used several places in sched.c before
> 147cbb4bbe99. It seems that it was used to encourage balancing in
> certain situations such as wake-up, while sd->imbalance_pct was used
> unmodified in other situations.
>
> That seems to still be the case in v3.13-rc6. find_busiest_group() and
> task_numa_compare() use sd->imbalance_pct, while task_numa_migrate(),
> wake_affine(), and find_idlest_group() use 100 + (sd->imbalance_pct -
> 100)/2. The pattern seems to be to use the reduced imbalance_pct for
> wake-up and the full imbalance_pct for balancing of runnable tasks.
>
> AFAICT, using sd->imbalance_pct in find_idlest_group() will increase the
> bias towards the local group. However, I would not be comfortable
> changing it without understanding the full implications.
>
>> AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
>> reason to use half of imbalance_pct.
>
> Based on the comment removed in 147cbb4bbe99 it seems that the reason is
> to encourage balancing at wake-up. It doesn't seem to have any relation
> to the number of cpus.
>
>> but sched_domain is used widely for many kinds of platform/cpu type, the
>> value is arbitrary too.
>> Another reason to use it, I scaled any cpu load which is not this_cpu,
>> so a bit conservation is to use origin imbalance_pct, not half of it.
>
> The net result is almost the same if the groups have two cpus each.
> Stronger for groups with one cpu, and weaker for groups with more than
> two cpus.
>
>>>
>>>>
>>>> avg_load += load;
>>>> }
>>>> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>> }
>>>> } while (group = group->next, group != sd->groups);
>>>>
>>>> - if (!idlest || 100*this_load < imbalance*min_load)
>>>> + if (!idlest || this_load < min_load)
>>>> return NULL;
>>>> return idlest;
>>>> }
>>>> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>>
>>>> nr_running = rq->nr_running;
>>>>
>>>> - /* Bias balancing toward cpus of our domain */
>>>> - if (local_group)
>>>> - load = target_load(i);
>>>> + /* Bias balancing toward dst cpu */
>>>> + if (env->dst_cpu == i)
>>>> + load = target_load(i, env->sd->imbalance_pct);
>>>
>>> Here you do the same group load bias change as above.
>>>
>
> I don't think it is right to change the bias here to only apply to the
> dst_cpu. rebalance_domains()/load_balance() is balancing groups not just
> pulling task to dst_cpu. It must pull enough tasks to balance the
> groups, even if it means temporarily overloading dst_cpu. The bias
> should therefore apply to the whole group.
I have different understanding for load_balance(). It the dst_cpu
overloaded, other cpu need to move back some load. That cause more
expensive task CS.
>
>>>> else
>>>> load = source_load(i);
>>>>
>>>> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>>>> if ((local->idle_cpus < busiest->idle_cpus) &&
>>>> busiest->sum_nr_running <= busiest->group_weight)
>>>> goto out_balanced;
>>>> - } else {
>>>> - /*
>>>> - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
>>>> - * imbalance_pct to be conservative.
>>>> - */
>>>> - if (100 * busiest->avg_load <=
>>>> - env->sd->imbalance_pct * local->avg_load)
>>>> - goto out_balanced;
>>>> }
>>>>
>>>> force_balance:
>>>
>>> As said my previous replies to this series, I think this problem should
>>> be solved by fixing the cause of the problem, that is the cpu_load
>>> calculation, instead of biasing the cpu_load where-ever it is used to
>>> hide the problem.
>>>
>>> Doing a bit of git archaeology reveals that the cpu_load code goes back
>>> to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
>>> patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
>>> my opinion that made logical sense. If we are about to change to *_idx=0
>>> we are removing the main idea behind that code and there needs to be a
>>> new one. Otherwise, cpu_load doesn't make sense.
>>
>> The commit is written in 2005. The CFS scheduler merged into kernel in
>> Oct 2007. it is a too old legacy for us ...
>
> And that is why I think cpu_load should be reconsidered. It doesn't make
> sense to cripple the cpu_load code without fully understanding what it
> is doing and then try to hide the problems it causes.
I understand your concern, Morten.
But if no one explain well what these code for, it means it run out of
control, and should be out of kernel. On the contrary, the new code has
the better performance both on arm and x86 -- I'd like take fengguang's
testing result as advantages, since the task migration and CS reduced.
So, consider above reasons, I don't know how to refuse a better
performance code, but keep the unclean legacy.
>
> The cpu_load code might be older than CFS but in my opinion it still
> makes sense. Changing cpu_load to be a snapshot of weighted_cpuload()
> does not.
>
> Morten
>
--
Thanks
Alex
On Mon, Jan 06, 2014 at 01:35:39PM +0000, Alex Shi wrote:
> On 01/03/2014 12:04 AM, Morten Rasmussen wrote:
> > On Wed, Dec 25, 2013 at 02:58:26PM +0000, Alex Shi wrote:
> >>
> >>>> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
> >>>> From: Alex Shi <[email protected]>
> >>>> Date: Sat, 23 Nov 2013 23:18:09 +0800
> >>>> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
> >>>>
> >>>> Task migration happens when target just a bit less then source cpu load.
> >>>> To reduce such situation happens, aggravate the target cpu load with
> >>>> sd->imbalance_pct/100 in wake_affine.
> >>>>
> >>>> In find_idlest/busiest_group, change the aggravate to local cpu only
> >>>> from old group aggravation.
> >>>>
> >>>> on my pandaboard ES.
> >>>>
> >>>> latest kernel 527d1511310a89 + whole patchset
> >>>> hackbench -T -g 10 -f 40
> >>>> 23.25" 21.99"
> >>>> 23.16" 21.20"
> >>>> 24.24" 21.89"
> >>>> hackbench -p -g 10 -f 40
> >>>> 26.52" 21.46"
> >>>> 23.89" 22.96"
> >>>> 25.65" 22.73"
> >>>> hackbench -P -g 10 -f 40
> >>>> 20.14" 19.72"
> >>>> 19.96" 19.10"
> >>>> 21.76" 20.03"
> >>>>
> >>>> Signed-off-by: Alex Shi <[email protected]>
> >>>> ---
> >>>> kernel/sched/fair.c | 35 ++++++++++++++++-------------------
> >>>> 1 file changed, 16 insertions(+), 19 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index bccdd89..3623ba4 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
> >>>>
> >>>> static unsigned long weighted_cpuload(const int cpu);
> >>>> static unsigned long source_load(int cpu);
> >>>> -static unsigned long target_load(int cpu);
> >>>> +static unsigned long target_load(int cpu, int imbalance_pct);
> >>>> static unsigned long power_of(int cpu);
> >>>> static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> >>>>
> >>>> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
> >>>> * Return a high guess at the load of a migration-target cpu weighted
> >>>> * according to the scheduling class and "nice" value.
> >>>> */
> >>>> -static unsigned long target_load(int cpu)
> >>>> +static unsigned long target_load(int cpu, int imbalance_pct)
> >>>> {
> >>>> struct rq *rq = cpu_rq(cpu);
> >>>> unsigned long total = weighted_cpuload(cpu);
> >>>>
> >>>> + /*
> >>>> + * without cpu_load decay, in most of time cpu_load is same as total
> >>>> + * so we need to make target a bit heavier to reduce task migration
> >>>> + */
> >>>> + total = total * imbalance_pct / 100;
> >>>> +
> >>>> if (!sched_feat(LB_BIAS))
> >>>> return total;
> >>>>
> >>>> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>>> this_cpu = smp_processor_id();
> >>>> prev_cpu = task_cpu(p);
> >>>> load = source_load(prev_cpu);
> >>>> - this_load = target_load(this_cpu);
> >>>> + this_load = target_load(this_cpu, 100);
> >>>>
> >>>> /*
> >>>> * If sync wakeup then subtract the (maximum possible)
> >>>> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>>>
> >>>> if (balanced ||
> >>>> (this_load <= load &&
> >>>> - this_load + target_load(prev_cpu) <= tl_per_task)) {
> >>>> + this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
> >>>> /*
> >>>> * This domain has SD_WAKE_AFFINE and
> >>>> * p is cache cold in this domain, and
> >>>> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>> {
> >>>> struct sched_group *idlest = NULL, *group = sd->groups;
> >>>> unsigned long min_load = ULONG_MAX, this_load = 0;
> >>>> - int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >>>>
> >>>> do {
> >>>> unsigned long load, avg_load;
> >>>> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>>
> >>>> for_each_cpu(i, sched_group_cpus(group)) {
> >>>> /* Bias balancing toward cpus of our domain */
> >>>> - if (local_group)
> >>>> + if (i == this_cpu)
> >>>
> >>> What is the motivation for changing the local_group load calculation?
> >>> Now the load contributions of all cpus in the local group, except
> >>> this_cpu, will contribute more as their contribution (this_load) is
> >>> determined using target_load() instead.
> >>
> >> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
> >> 2 cores(guess no HT at that time) in cpu socket. With the cores number
> >
> > NUMA support was already present. I guess that means support for systems
> > with significantly more than two cpus.
>
> Thanks a lot for comments, Morten!
> the find_idlest_group used in select_task_rq_fair, that designed to
> select cpu near by old cpu whenever for wake or fork tasks. So in common
> case, the selected cpu is rarely on another NUMA.
Yes. My point was just that there was support for more than two cpus
when this code was introduce, and group could therefore have more than
two cpus in them.
> >
> >> increasing trend, the sched_group become large and large, to give whole
> >> group this bias value is becoming non-sense. So it looks reasonable to
> >> just bias this cpu only.
> >
> > I think that is question whether you want to bias the whole group or
> > not at wake-up balancing. If you don't change the bias weight and only
> > apply it to this_cpu, you will be more prone to send the waking task
> > somewhere else.
>
> We don't need to follow the exactly old logical. Guess people will be
> more happier to see the code/logical simple with better benchmark
> performance. :)
I'm not against changes, especially if they make things simpler or make
more sense. But IMHO, that is not the same as just removing code and
showing benchmark improvements. There must be an explaination of why the
new code works better and the resulting code needs to make sense.
> >
> >>>
> >>> If I'm not mistaken, that will lead to more frequent load balancing as
> >>> the local_group bias has been reduced. That is the opposite of your
> >>> intentions based on your comment in target_load().
> >>>
> >>>> load = source_load(i);
> >>>> else
> >>>> - load = target_load(i);
> >>>> + load = target_load(i, sd->imbalance_pct);
> >>>
> >>> You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
> >>> that you removed above. sd->imbalance_pct may have been arbitrarily
> >>> chosen in the past, but changing it may affect behavior.
> >>
> >> the first commit with this line:
> >> int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >> is 147cbb4bbe99, that also has no explanation for this imbalance value.
> >
> > 100 + (sd->imbalance_pct-100)/2 is used several places in sched.c before
> > 147cbb4bbe99. It seems that it was used to encourage balancing in
> > certain situations such as wake-up, while sd->imbalance_pct was used
> > unmodified in other situations.
> >
> > That seems to still be the case in v3.13-rc6. find_busiest_group() and
> > task_numa_compare() use sd->imbalance_pct, while task_numa_migrate(),
> > wake_affine(), and find_idlest_group() use 100 + (sd->imbalance_pct -
> > 100)/2. The pattern seems to be to use the reduced imbalance_pct for
> > wake-up and the full imbalance_pct for balancing of runnable tasks.
> >
> > AFAICT, using sd->imbalance_pct in find_idlest_group() will increase the
> > bias towards the local group. However, I would not be comfortable
> > changing it without understanding the full implications.
> >
> >> AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
> >> reason to use half of imbalance_pct.
> >
> > Based on the comment removed in 147cbb4bbe99 it seems that the reason is
> > to encourage balancing at wake-up. It doesn't seem to have any relation
> > to the number of cpus.
> >
> >> but sched_domain is used widely for many kinds of platform/cpu type, the
> >> value is arbitrary too.
> >> Another reason to use it, I scaled any cpu load which is not this_cpu,
> >> so a bit conservation is to use origin imbalance_pct, not half of it.
> >
> > The net result is almost the same if the groups have two cpus each.
> > Stronger for groups with one cpu, and weaker for groups with more than
> > two cpus.
> >
> >>>
> >>>>
> >>>> avg_load += load;
> >>>> }
> >>>> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>> }
> >>>> } while (group = group->next, group != sd->groups);
> >>>>
> >>>> - if (!idlest || 100*this_load < imbalance*min_load)
> >>>> + if (!idlest || this_load < min_load)
> >>>> return NULL;
> >>>> return idlest;
> >>>> }
> >>>> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>>>
> >>>> nr_running = rq->nr_running;
> >>>>
> >>>> - /* Bias balancing toward cpus of our domain */
> >>>> - if (local_group)
> >>>> - load = target_load(i);
> >>>> + /* Bias balancing toward dst cpu */
> >>>> + if (env->dst_cpu == i)
> >>>> + load = target_load(i, env->sd->imbalance_pct);
> >>>
> >>> Here you do the same group load bias change as above.
> >>>
> >
> > I don't think it is right to change the bias here to only apply to the
> > dst_cpu. rebalance_domains()/load_balance() is balancing groups not just
> > pulling task to dst_cpu. It must pull enough tasks to balance the
> > groups, even if it means temporarily overloading dst_cpu. The bias
> > should therefore apply to the whole group.
>
> I have different understanding for load_balance(). It the dst_cpu
> overloaded, other cpu need to move back some load. That cause more
> expensive task CS.
My understanding is that should_we_balance() decides which cpu is
eligible for doing the load balancing for a given domain (and the
domains above). That is, only one cpu in a group is allowed to load
balance between the local group and other groups. That cpu would
therefore be reponsible for pulling enough load that the groups are
balanced even if it means temporarily overloading itself. The other cpus
in the group will take care of load balancing the extra load within the
local group later.
If the balance cpu didn't overload itself it wouldn't always be able to
balance the groups, especially for large groups.
If my understanding is right, I think it is wrong to not bias the entire
target group.
> >
> >>>> else
> >>>> load = source_load(i);
> >>>>
> >>>> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >>>> if ((local->idle_cpus < busiest->idle_cpus) &&
> >>>> busiest->sum_nr_running <= busiest->group_weight)
> >>>> goto out_balanced;
> >>>> - } else {
> >>>> - /*
> >>>> - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> >>>> - * imbalance_pct to be conservative.
> >>>> - */
> >>>> - if (100 * busiest->avg_load <=
> >>>> - env->sd->imbalance_pct * local->avg_load)
> >>>> - goto out_balanced;
> >>>> }
> >>>>
> >>>> force_balance:
> >>>
> >>> As said my previous replies to this series, I think this problem should
> >>> be solved by fixing the cause of the problem, that is the cpu_load
> >>> calculation, instead of biasing the cpu_load where-ever it is used to
> >>> hide the problem.
> >>>
> >>> Doing a bit of git archaeology reveals that the cpu_load code goes back
> >>> to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
> >>> patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
> >>> my opinion that made logical sense. If we are about to change to *_idx=0
> >>> we are removing the main idea behind that code and there needs to be a
> >>> new one. Otherwise, cpu_load doesn't make sense.
> >>
> >> The commit is written in 2005. The CFS scheduler merged into kernel in
> >> Oct 2007. it is a too old legacy for us ...
> >
> > And that is why I think cpu_load should be reconsidered. It doesn't make
> > sense to cripple the cpu_load code without fully understanding what it
> > is doing and then try to hide the problems it causes.
>
> I understand your concern, Morten.
> But if no one explain well what these code for, it means it run out of
> control, and should be out of kernel.
Fully agree. That is why I think all the users of cpu_load must be
considered as well. I doesn't make sense to rip out most of the cpu_load
code and change its behaviour without considering the code that uses
cpu_load.
I like the idea of cleaning up the cpu_load stuff, but it must be
replaced with something that makes sense. Otherwise, we just save the
problems for later and carry around unexplainable code until it happens.
> On the contrary, the new code has
> the better performance both on arm and x86 -- I'd like take fengguang's
> testing result as advantages, since the task migration and CS reduced.
> So, consider above reasons, I don't know how to refuse a better
> performance code, but keep the unclean legacy.
I may have missed something, but I don't understand the reason for the
performance improvements that you are reporting. I see better numbers
for a few benchmarks, but I still don't understand why the code makes
sense after the cleanup. If we don't understand why it works, we cannot
be sure that it doesn't harm other benchmarks. There is always a chance
that we miss something but, IMHO, not having any idea to begin with
increases the chances for problems later significantly. So why not get
to the bottom of the problem of cleaning up cpu_load?
Have you done more extensive benchmarking? Have you seen any regressions
in other benchmarks?
Morten
On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> My understanding is that should_we_balance() decides which cpu is
> eligible for doing the load balancing for a given domain (and the
> domains above). That is, only one cpu in a group is allowed to load
> balance between the local group and other groups. That cpu would
> therefore be reponsible for pulling enough load that the groups are
> balanced even if it means temporarily overloading itself. The other cpus
> in the group will take care of load balancing the extra load within the
> local group later.
Correct.
> I may have missed something, but I don't understand the reason for the
> performance improvements that you are reporting. I see better numbers
> for a few benchmarks, but I still don't understand why the code makes
> sense after the cleanup. If we don't understand why it works, we cannot
> be sure that it doesn't harm other benchmarks. There is always a chance
> that we miss something but, IMHO, not having any idea to begin with
> increases the chances for problems later significantly. So why not get
> to the bottom of the problem of cleaning up cpu_load?
>
> Have you done more extensive benchmarking? Have you seen any regressions
> in other benchmarks?
I only remember hackbench numbers and that generally fares well with a
more aggressive balancer since it has no actual work to speak of the
migration penalty is very low and because there's a metric ton of tasks
the aggressive leveling makes for more coherent 'throughput'.
On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> > My understanding is that should_we_balance() decides which cpu is
> > eligible for doing the load balancing for a given domain (and the
> > domains above). That is, only one cpu in a group is allowed to load
> > balance between the local group and other groups. That cpu would
> > therefore be reponsible for pulling enough load that the groups are
> > balanced even if it means temporarily overloading itself. The other cpus
> > in the group will take care of load balancing the extra load within the
> > local group later.
>
> Correct.
On that; one of the things I wanted to (and previously did attempt but
failed) is trying to rotate this cpu. Currently its always the first cpu
(of the group) and that gives a noticeable bias.
If we could slowly rotate the cpu that does this that would alleviate
both the load and cost bias.
One thing I was thinking of is keeping a global counter maybe:
'x := jiffies >> n'
might be good enough and using the 'x % nr_cpus_in_group'-th cpu
instead.
Then again, these are micro issue and not a lot of people complain
about this.
On 7 January 2014 14:15, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
>> > My understanding is that should_we_balance() decides which cpu is
>> > eligible for doing the load balancing for a given domain (and the
>> > domains above). That is, only one cpu in a group is allowed to load
>> > balance between the local group and other groups. That cpu would
>> > therefore be reponsible for pulling enough load that the groups are
>> > balanced even if it means temporarily overloading itself. The other cpus
>> > in the group will take care of load balancing the extra load within the
>> > local group later.
>>
>> Correct.
>
> On that; one of the things I wanted to (and previously did attempt but
> failed) is trying to rotate this cpu. Currently its always the first cpu
> (of the group) and that gives a noticeable bias.
Isn't the current policy (it's the 1st idle cpu in priority). a good
enough way to rotate the cpus ? Are you need the rotation for loaded
use case too ?
>
> If we could slowly rotate the cpu that does this that would alleviate
> both the load and cost bias.
>
> One thing I was thinking of is keeping a global counter maybe:
> 'x := jiffies >> n'
> might be good enough and using the 'x % nr_cpus_in_group'-th cpu
> instead.
>
> Then again, these are micro issue and not a lot of people complain
> about this.
On Tue, Jan 07, 2014 at 02:32:07PM +0100, Vincent Guittot wrote:
> On 7 January 2014 14:15, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
> >> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> >> > My understanding is that should_we_balance() decides which cpu is
> >> > eligible for doing the load balancing for a given domain (and the
> >> > domains above). That is, only one cpu in a group is allowed to load
> >> > balance between the local group and other groups. That cpu would
> >> > therefore be reponsible for pulling enough load that the groups are
> >> > balanced even if it means temporarily overloading itself. The other cpus
> >> > in the group will take care of load balancing the extra load within the
> >> > local group later.
> >>
> >> Correct.
> >
> > On that; one of the things I wanted to (and previously did attempt but
> > failed) is trying to rotate this cpu. Currently its always the first cpu
> > (of the group) and that gives a noticeable bias.
>
> Isn't the current policy (it's the 1st idle cpu in priority). a good
> enough way to rotate the cpus ? Are you need the rotation for loaded
> use case too ?
Yeah its for the fully loaded case. And like I said, there's not been
many complaints on this.
The 'problem' is that its always same cpu that does the most expensive
full machine balance; and always that cpu that is the one that gains
extra load to redistribute in the group. So its penalized twice.
Like said, really minor issue. Just something I thought I'd mention.
On Tue, Jan 07, 2014 at 01:15:23PM +0000, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> > > My understanding is that should_we_balance() decides which cpu is
> > > eligible for doing the load balancing for a given domain (and the
> > > domains above). That is, only one cpu in a group is allowed to load
> > > balance between the local group and other groups. That cpu would
> > > therefore be reponsible for pulling enough load that the groups are
> > > balanced even if it means temporarily overloading itself. The other cpus
> > > in the group will take care of load balancing the extra load within the
> > > local group later.
> >
> > Correct.
>
> On that; one of the things I wanted to (and previously did attempt but
> failed) is trying to rotate this cpu. Currently its always the first cpu
> (of the group) and that gives a noticeable bias.
>
> If we could slowly rotate the cpu that does this that would alleviate
> both the load and cost bias.
>From a load perspective wouldn't it be better to pick the least loaded
cpu in the group? It is not cheap to implement, but in theory it should
give less balancing within the group later an less unfairness until it
happens.
Rotating the cpu is probably good enough for most cases and certainly
easier to implement.
>
> One thing I was thinking of is keeping a global counter maybe:
> 'x := jiffies >> n'
> might be good enough and using the 'x % nr_cpus_in_group'-th cpu
> instead.
>
> Then again, these are micro issue and not a lot of people complain
> about this.
The bias continues after they first round of load balance by the other
cpus?
Pulling everything to one cpu is not ideal from a performance point of
view. You loose some available cpu cycles until the balance settles.
However, it is not easy to do better and maintain scalability at the
same time.
On Tue, Jan 07, 2014 at 03:16:32PM +0000, Morten Rasmussen wrote:
> From a load perspective wouldn't it be better to pick the least loaded
> cpu in the group? It is not cheap to implement, but in theory it should
> give less balancing within the group later an less unfairness until it
> happens.
I tried that; see 04f733b4afac5dc93ae9b0a8703c60b87def491e for why it
doesn't work.
> Rotating the cpu is probably good enough for most cases and certainly
> easier to implement.
Indeed.
> The bias continues after they first round of load balance by the other
> cpus?
The cost, yes. Even when perfectly balanced, we still get to iterate the
entire machine computing s[gd]_lb_stats to find out we're good and don't
need to move tasks about.
> Pulling everything to one cpu is not ideal from a performance point of
> view. You loose some available cpu cycles until the balance settles.
> However, it is not easy to do better and maintain scalability at the
> same time.
Right, its part of the cost we pay for scaling better. And rotating this
cost around a bit would alleviate the obvious bias.
On 01/07/2014 08:59 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
>> My understanding is that should_we_balance() decides which cpu is
>> eligible for doing the load balancing for a given domain (and the
>> domains above). That is, only one cpu in a group is allowed to load
>> balance between the local group and other groups. That cpu would
>> therefore be reponsible for pulling enough load that the groups are
>> balanced even if it means temporarily overloading itself. The other cpus
>> in the group will take care of load balancing the extra load within the
>> local group later.
Thanks for both of you comments and explanations! :)
I know this patch's change is arguable and my attempt doesn't tune well. But I believe I am in a correct way. :) let me explain a bit for this patch again.
First cpu_load includes the history load info, so repeatedly decay and use the history load is kind of non-sense. and the old source/target_load randomly select history load or current load just according to max/min, it also owe a well explanation.
Second, we consider the bias in source/target_load already. but still use imbalance_pct as last check in idlest/busiest group finding. It is also a kind of redundant job. If we can consider the source/target bias, we'd better not use imbalance_pct again.
And last, imbalance pct overused with quickly core number increasing cpu. Like in find_busiset_group:
Assume a 2 groups domain, each group has 8 cores cpus.
The target group will bias 8 * (imbalance_pct -100)
= 8 * (125 - 100) = 200.
Since each of cpu bias .25 times load, for 8 cpus, totally bias 2 times average cpu load between groups. That is a too much. But if there only 2 cores in cpu group(common case when the code introduced). the bias is just 2 * 25 / 100 = 0.5 times average cpu load.
Now this patchset remove the cpu_load array avoid repeated history decay; reorganize the imbalance_pct usage to avoid redundant balance bias. and reduce the bias value between cpu groups -- maybe it isn't tune well. :)
>
> Correct.
>
>> I may have missed something, but I don't understand the reason for the
>> performance improvements that you are reporting. I see better numbers
>> for a few benchmarks, but I still don't understand why the code makes
>> sense after the cleanup. If we don't understand why it works, we cannot
>> be sure that it doesn't harm other benchmarks. There is always a chance
>> that we miss something but, IMHO, not having any idea to begin with
>> increases the chances for problems later significantly. So why not get
>> to the bottom of the problem of cleaning up cpu_load?
>>
>> Have you done more extensive benchmarking? Have you seen any regressions
>> in other benchmarks?
>
> I only remember hackbench numbers and that generally fares well with a
> more aggressive balancer since it has no actual work to speak of the
> migration penalty is very low and because there's a metric ton of tasks
> the aggressive leveling makes for more coherent 'throughput'.
I just tested hackbench on arm. and with more testing times plus rebase to .13-rc6, the variation increased, then the benefit become unclear. anyway still no regression find on both perf-stat cpu-migration times and real execute time.
On 0day performance testing should tested kbuild, hackbench, aim7, dbench, tbench, sysbench, netperf etc. etc. No regression found.
The 0day performance testing also catch a cpu migration reduced on kvm guest.
https://lkml.org/lkml/2013/12/21/135
and another benchmark get benefit on the old patchset:
like the testing results show on 0day performance testing:
https://lkml.org/lkml/2013/12/4/102
Hi Alex,
We obsevered 150% performance gain with vm-scalability/300s-mmap-pread-seq
testcase with this patch applied. Here is a list of changes we got so far:
testbox : brickland
testcase: vm-scalability/300s-mmap-pread-seq
f1b6442c7dd12802e622 d70495ef86f397816d73
(parent commit) (this commit)
------------------------ ------------------------
26393249.80 +150.9% 66223933.60 vm-scalability.throughput
225.12 -49.9% 112.75 time.elapsed_time
36333.40 -90.7% 3392.20 vmstat.system.cs
2.40 +375.0% 11.40 vmstat.cpu.id
3770081.60 -97.7% 87673.40 time.major_page_faults
3975276.20 -97.0% 117409.60 time.voluntary_context_switches
3.05 +301.7% 12.24 iostat.cpu.idle
21118.41 -70.3% 6277.19 time.system_time
18.40 +130.4% 42.40 vmstat.cpu.us
77.00 -41.3% 45.20 vmstat.cpu.sy
47459.60 -31.3% 32592.20 vmstat.system.in
82435.40 -12.1% 72443.60 time.involuntary_context_switches
5128.13 +14.0% 5848.30 time.user_time
11656.20 -7.8% 10745.60 time.percent_of_cpu_this_job_got
1069997484.80 +0.3% 1073679919.00 time.minor_page_faults
Btw, the latest patchset include more clean up.
[email protected]:alexshi/power-scheduling.git noload
Guess fengguang's 0day performance is doing test on it.
--
Thanks
Alex