2021-10-09 18:12:44

by Tao Zhou

[permalink] [raw]
Subject: [PATCH] sched/fair: Use this_sd->weight to calculate span_avg

avg_idle, avg_cost got from this_rq and this_sd. I think
use this_sd->weight to calculate and estimate the number
of loop cpus in the target domain.
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6a05d9b5443..7fab7b70814c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6300,7 +6300,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
avg_idle = this_rq->wake_avg_idle;
avg_cost = this_sd->avg_scan_cost + 1;

- span_avg = sd->span_weight * avg_idle;
+ span_avg = this_sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
--
2.32.0


2021-10-11 16:40:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use this_sd->weight to calculate span_avg

On Sun, 10 Oct 2021 02:10:55 +0800
Tao Zhou <[email protected]> wrote:

> avg_idle, avg_cost got from this_rq and this_sd. I think
> use this_sd->weight to calculate and estimate the number
> of loop cpus in the target domain.

If that's the case, then shouldn't the CPUs to be checked come from this_sd
as well? I mean, at the beginning of the function we have:

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
return -1;

cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

Where "cpus" comes from sd, and not this_sd.


> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f6a05d9b5443..7fab7b70814c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6300,7 +6300,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> avg_idle = this_rq->wake_avg_idle;
> avg_cost = this_sd->avg_scan_cost + 1;
>
> - span_avg = sd->span_weight * avg_idle;
> + span_avg = this_sd->span_weight * avg_idle;
> if (span_avg > 4*avg_cost)
> nr = div_u64(span_avg, avg_cost);
> else


And after this code, the nr that is determined from the above, is for
limiting the looping over those CPUs from sd, not this_sd:

for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;

} else {
if (!--nr)
return -1;
idle_cpu = __select_idle_cpu(cpu, p);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;
}
}

I'm guessing there's nothing wrong here. But, I don't fully understand it
myself.

-- Steve

2021-10-11 16:49:11

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use this_sd->weight to calculate span_avg

Hi Steven,

On Mon, Oct 11, 2021 at 10:58:02AM -0400, Steven Rostedt wrote:
> On Sun, 10 Oct 2021 02:10:55 +0800
> Tao Zhou <[email protected]> wrote:
>
> > avg_idle, avg_cost got from this_rq and this_sd. I think
> > use this_sd->weight to calculate and estimate the number
> > of loop cpus in the target domain.
>
> If that's the case, then shouldn't the CPUs to be checked come from this_sd
> as well? I mean, at the beginning of the function we have:
>
> this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> if (!this_sd)
> return -1;
>
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> Where "cpus" comes from sd, and not this_sd.

Thank you for reply.


Cycles are spent on this CPU(and in this Domain) if I am not wrong.
If T1(on this CPU) want to try to wake up T2 on another CPU, Kernel
(on this CPU) should evaluate that not spend much time in finding
idle siblings. And the avg_idle and span_avg are two fators to compare
with. avg_idle is the CPU average idle time of this rq and avg_cost
is the last time this Domain was being looped and the recorded avg
cost time. Two values are history averaged. So, use the history to
evaluate future that we do *now* the another Domain cpu search to not
let this CPU(and in this Domain) too busy. This is what I want to say.
Not sure yet.

> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f6a05d9b5443..7fab7b70814c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6300,7 +6300,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > avg_idle = this_rq->wake_avg_idle;
> > avg_cost = this_sd->avg_scan_cost + 1;
> >
> > - span_avg = sd->span_weight * avg_idle;
> > + span_avg = this_sd->span_weight * avg_idle;
> > if (span_avg > 4*avg_cost)
> > nr = div_u64(span_avg, avg_cost);
> > else
>
>
> And after this code, the nr that is determined from the above, is for
> limiting the looping over those CPUs from sd, not this_sd:
>
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> if ((unsigned int)i < nr_cpumask_bits)
> return i;
>
> } else {
> if (!--nr)
> return -1;
> idle_cpu = __select_idle_cpu(cpu, p);
> if ((unsigned int)idle_cpu < nr_cpumask_bits)
> break;
> }
> }
>
> I'm guessing there's nothing wrong here. But, I don't fully understand it
> myself.

I replied to Barry and not say that I missed that AND operation.

Here is another go.

@cpus is per-cpu and irq disable. I am not sure irq disable can
be preempt in in RT. If that is possiable, cpu_ptr is not safe.
Based on the code and comments in __do_set_cpus_allowed(), he will
not use ->pi_lock to change affinity(eg. SCA_MIGRATE_DISABLE).

If the cpu_ptr not get changed(I must be right and wrong more 5 time)
@core is not in @cpu_ptr mask, BUT, the cpu in smt mask of @core is
the one that we use to call select_idle_core() and @core is in this
smt mask but is not allowed and we get @*idle_cpu and @idle is true
we return the @core. Another not sure.

If the @cpu_ptr can changed in eg. just before the @core loop.
It is possible that @*idle_cpu == -1 and all cpu in smt mask
of @core is seemed to be not allowed. And, we return @core.
This case is what I just look at select_idle_core() semantics
and missed that AND operation.

> -- Steve




Thanks,
Tao

2021-10-12 07:14:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use this_sd->weight to calculate span_avg

On Mon, 11 Oct 2021 at 18:45, Tao Zhou <[email protected]> wrote:
>
> Hi Steven,
>
> On Mon, Oct 11, 2021 at 10:58:02AM -0400, Steven Rostedt wrote:
> > On Sun, 10 Oct 2021 02:10:55 +0800
> > Tao Zhou <[email protected]> wrote:
> >
> > > avg_idle, avg_cost got from this_rq and this_sd. I think
> > > use this_sd->weight to calculate and estimate the number
> > > of loop cpus in the target domain.
> >
> > If that's the case, then shouldn't the CPUs to be checked come from this_sd
> > as well? I mean, at the beginning of the function we have:
> >
> > this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> > if (!this_sd)
> > return -1;
> >
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >
> > Where "cpus" comes from sd, and not this_sd.
>
> Thank you for reply.
>
>
> Cycles are spent on this CPU(and in this Domain) if I am not wrong.
> If T1(on this CPU) want to try to wake up T2 on another CPU, Kernel
> (on this CPU) should evaluate that not spend much time in finding
> idle siblings. And the avg_idle and span_avg are two fators to compare
> with. avg_idle is the CPU average idle time of this rq and avg_cost
> is the last time this Domain was being looped and the recorded avg
> cost time. Two values are history averaged. So, use the history to
> evaluate future that we do *now* the another Domain cpu search to not
> let this CPU(and in this Domain) too busy. This is what I want to say.
> Not sure yet.

this_rq->wake_avg_idle is used to guest estimate how much busy is this_cpu
this_sd->avg_scan_cost is used to estimate how much time was spent in
average looking at an idle cpu at this domain level on this cpu
and sd->span_weight is is used to take into account the size of the
target domain on which it is going to loop. But we don't really care
about the size of the domain of the this_cpu

>
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index f6a05d9b5443..7fab7b70814c 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6300,7 +6300,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > avg_idle = this_rq->wake_avg_idle;
> > > avg_cost = this_sd->avg_scan_cost + 1;
> > >
> > > - span_avg = sd->span_weight * avg_idle;
> > > + span_avg = this_sd->span_weight * avg_idle;
> > > if (span_avg > 4*avg_cost)
> > > nr = div_u64(span_avg, avg_cost);
> > > else
> >
> >
> > And after this code, the nr that is determined from the above, is for
> > limiting the looping over those CPUs from sd, not this_sd:
> >
> > for_each_cpu_wrap(cpu, cpus, target + 1) {
> > if (has_idle_core) {
> > i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > if ((unsigned int)i < nr_cpumask_bits)
> > return i;
> >
> > } else {
> > if (!--nr)
> > return -1;
> > idle_cpu = __select_idle_cpu(cpu, p);
> > if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > break;
> > }
> > }
> >
> > I'm guessing there's nothing wrong here. But, I don't fully understand it
> > myself.
>
> I replied to Barry and not say that I missed that AND operation.
>
> Here is another go.
>
> @cpus is per-cpu and irq disable. I am not sure irq disable can
> be preempt in in RT. If that is possiable, cpu_ptr is not safe.
> Based on the code and comments in __do_set_cpus_allowed(), he will
> not use ->pi_lock to change affinity(eg. SCA_MIGRATE_DISABLE).
>
> If the cpu_ptr not get changed(I must be right and wrong more 5 time)
> @core is not in @cpu_ptr mask, BUT, the cpu in smt mask of @core is
> the one that we use to call select_idle_core() and @core is in this
> smt mask but is not allowed and we get @*idle_cpu and @idle is true
> we return the @core. Another not sure.
>
> If the @cpu_ptr can changed in eg. just before the @core loop.
> It is possible that @*idle_cpu == -1 and all cpu in smt mask
> of @core is seemed to be not allowed. And, we return @core.
> This case is what I just look at select_idle_core() semantics
> and missed that AND operation.
>
> > -- Steve
>
>
>
>
> Thanks,
> Tao

2021-10-12 12:47:01

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Use this_sd->weight to calculate span_avg

Hi Vincent,

On Tue, Oct 12, 2021 at 09:11:23AM +0200, Vincent Guittot wrote:
> On Mon, 11 Oct 2021 at 18:45, Tao Zhou <[email protected]> wrote:
> >
> > Hi Steven,
> >
> > On Mon, Oct 11, 2021 at 10:58:02AM -0400, Steven Rostedt wrote:
> > > On Sun, 10 Oct 2021 02:10:55 +0800
> > > Tao Zhou <[email protected]> wrote:
> > >
> > > > avg_idle, avg_cost got from this_rq and this_sd. I think
> > > > use this_sd->weight to calculate and estimate the number
> > > > of loop cpus in the target domain.
> > >
> > > If that's the case, then shouldn't the CPUs to be checked come from this_sd
> > > as well? I mean, at the beginning of the function we have:
> > >
> > > this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> > > if (!this_sd)
> > > return -1;
> > >
> > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > >
> > > Where "cpus" comes from sd, and not this_sd.
> >
> > Thank you for reply.
> >
> >
> > Cycles are spent on this CPU(and in this Domain) if I am not wrong.
> > If T1(on this CPU) want to try to wake up T2 on another CPU, Kernel
> > (on this CPU) should evaluate that not spend much time in finding
> > idle siblings. And the avg_idle and span_avg are two fators to compare
> > with. avg_idle is the CPU average idle time of this rq and avg_cost
> > is the last time this Domain was being looped and the recorded avg
> > cost time. Two values are history averaged. So, use the history to
> > evaluate future that we do *now* the another Domain cpu search to not
> > let this CPU(and in this Domain) too busy. This is what I want to say.
> > Not sure yet.
>
> this_rq->wake_avg_idle is used to guest estimate how much busy is this_cpu
> this_sd->avg_scan_cost is used to estimate how much time was spent in
> average looking at an idle cpu at this domain level on this cpu
> and sd->span_weight is is used to take into account the size of the
> target domain on which it is going to loop. But we don't really care
> about the size of the domain of the this_cpu

So slow for me to get to this, we need to know the loop number base on
the target domain.

Thank you for reply.

> >
> > > > ---
> > > > kernel/sched/fair.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index f6a05d9b5443..7fab7b70814c 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -6300,7 +6300,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > > avg_idle = this_rq->wake_avg_idle;
> > > > avg_cost = this_sd->avg_scan_cost + 1;
> > > >
> > > > - span_avg = sd->span_weight * avg_idle;
> > > > + span_avg = this_sd->span_weight * avg_idle;
> > > > if (span_avg > 4*avg_cost)
> > > > nr = div_u64(span_avg, avg_cost);
> > > > else
> > >
> > >
> > > And after this code, the nr that is determined from the above, is for
> > > limiting the looping over those CPUs from sd, not this_sd:
> > >
> > > for_each_cpu_wrap(cpu, cpus, target + 1) {
> > > if (has_idle_core) {
> > > i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > if ((unsigned int)i < nr_cpumask_bits)
> > > return i;
> > >
> > > } else {
> > > if (!--nr)
> > > return -1;
> > > idle_cpu = __select_idle_cpu(cpu, p);
> > > if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > > break;
> > > }
> > > }
> > >
> > > I'm guessing there's nothing wrong here. But, I don't fully understand it
> > > myself.
> >
> > I replied to Barry and not say that I missed that AND operation.
> >
> > Here is another go.
> >
> > @cpus is per-cpu and irq disable. I am not sure irq disable can
> > be preempt in in RT. If that is possiable, cpu_ptr is not safe.
> > Based on the code and comments in __do_set_cpus_allowed(), he will
> > not use ->pi_lock to change affinity(eg. SCA_MIGRATE_DISABLE).
> >
> > If the cpu_ptr not get changed(I must be right and wrong more 5 time)
> > @core is not in @cpu_ptr mask, BUT, the cpu in smt mask of @core is
> > the one that we use to call select_idle_core() and @core is in this
> > smt mask but is not allowed and we get @*idle_cpu and @idle is true
> > we return the @core. Another not sure.

Crap and wrong. Please ignore this.

> > If the @cpu_ptr can changed in eg. just before the @core loop.
> > It is possible that @*idle_cpu == -1 and all cpu in smt mask
> > of @core is seemed to be not allowed. And, we return @core.
> > This case is what I just look at select_idle_core() semantics
> > and missed that AND operation.
> >
> > > -- Steve
> >
> >
> >
> >
> > Thanks,
> > Tao




Thanks,
Tao