commit bf475ce0a3dd ("sched/fair: Add per-CPU min capacity to
sched_group_capacity") introduced per-cpu min_capacity.
commit e3d6d0cb66f2 ("sched/fair: Add sched_group per-CPU max capacity")
introduced per-cpu max_capacity.
sgc->capacity is the *SUM* of all CPU's capacity in the group.
sgc->{min,max}_capacity are the sg per-cpu variables. Compare with
sgc->capacity to get sgc->{min,max}_capacity makes no sense. Instead,
we should compare one by one in each iteration to get
sgc->{min,max}_capacity of the group.
Signed-off-by: Peng Liu <[email protected]>
---
kernel/sched/fair.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d170b5da0e3..97b164fcda93 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7795,6 +7795,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
for_each_cpu(cpu, sched_group_span(sdg)) {
struct sched_group_capacity *sgc;
struct rq *rq = cpu_rq(cpu);
+ unsigned long cap;
/*
* build_sched_domains() -> init_sched_groups_capacity()
@@ -7808,14 +7809,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* causing divide-by-zero issues on boot.
*/
if (unlikely(!rq->sd)) {
- capacity += capacity_of(cpu);
+ cap = capacity_of(cpu);
+ capacity += cap;
+ min_capacity = min(cap, min_capacity);
+ max_capacity = max(cap, max_capacity);
} else {
sgc = rq->sd->groups->sgc;
capacity += sgc->capacity;
+ min_capacity = min(sgc->min_capacity, min_capacity);
+ max_capacity = max(sgc->max_capacity, max_capacity);
}
-
- min_capacity = min(capacity, min_capacity);
- max_capacity = max(capacity, max_capacity);
}
} else {
/*
--
2.17.1
Hi Peng,
On 31/12/2019 03:51, Peng Liu wrote:
> commit bf475ce0a3dd ("sched/fair: Add per-CPU min capacity to
> sched_group_capacity") introduced per-cpu min_capacity.
>
> commit e3d6d0cb66f2 ("sched/fair: Add sched_group per-CPU max capacity")
> introduced per-cpu max_capacity.
>
> sgc->capacity is the *SUM* of all CPU's capacity in the group.
> sgc->{min,max}_capacity are the sg per-cpu variables. Compare with
> sgc->capacity to get sgc->{min,max}_capacity makes no sense. Instead,
> we should compare one by one in each iteration to get
> sgc->{min,max}_capacity of the group.
>
Worth noting this only affects the SD_OVERLAP case, the other case is fine
(I checked again just to be sure).
Now, on the bright side of things I don't think this currently causes any
harm. The {min,max}_capacity values are used in bits of code that only
gets run on topologies with asymmetric CPU µarchs (SD_ASYM_CPUCAPACITY), and
I know of no such system that is also NUMA, i.e. end up with SD_OVERLAP
(here's hoping nobody gets any funny idea).
Still, nice find!
> Signed-off-by: Peng Liu <[email protected]>
> ---
> kernel/sched/fair.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d170b5da0e3..97b164fcda93 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7795,6 +7795,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> for_each_cpu(cpu, sched_group_span(sdg)) {
> struct sched_group_capacity *sgc;
> struct rq *rq = cpu_rq(cpu);
> + unsigned long cap;
>
> /*
> * build_sched_domains() -> init_sched_groups_capacity()
> @@ -7808,14 +7809,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> * causing divide-by-zero issues on boot.
> */
> if (unlikely(!rq->sd)) {
> - capacity += capacity_of(cpu);
> + cap = capacity_of(cpu);
> + capacity += cap;
> + min_capacity = min(cap, min_capacity);
> + max_capacity = max(cap, max_capacity);
> } else {
> sgc = rq->sd->groups->sgc;
> capacity += sgc->capacity;
> + min_capacity = min(sgc->min_capacity, min_capacity);
> + max_capacity = max(sgc->max_capacity, max_capacity);
> }
> -
> - min_capacity = min(capacity, min_capacity);
> - max_capacity = max(capacity, max_capacity);
> }
> } else {
> /*
>
All that could be shortened like the below, no?
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..9f6c015639ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7773,8 +7773,8 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
*/
for_each_cpu(cpu, sched_group_span(sdg)) {
- struct sched_group_capacity *sgc;
struct rq *rq = cpu_rq(cpu);
+ unsigned long cpu_cap;
/*
* build_sched_domains() -> init_sched_groups_capacity()
@@ -7787,15 +7787,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* This avoids capacity from being 0 and
* causing divide-by-zero issues on boot.
*/
- if (unlikely(!rq->sd)) {
- capacity += capacity_of(cpu);
- } else {
- sgc = rq->sd->groups->sgc;
- capacity += sgc->capacity;
- }
+ if (unlikely(!rq->sd))
+ cpu_cap = capacity_of(cpu);
+ else
+ cpu_cap = rq->sd->groups->sgc->capacity;
+
+ min_capacity = min(cpu_cap, min_capacity);
+ max_capacity = max(cpu_cap, max_capacity);
- min_capacity = min(capacity, min_capacity);
- max_capacity = max(capacity, max_capacity);
+ capacity += cpu_cap;
}
} else {
/*
On Wed, Jan 01, 2020 at 05:56:49AM +0000, Valentin Schneider wrote:
> Hi Peng,
>
> On 31/12/2019 03:51, Peng Liu wrote:
> > commit bf475ce0a3dd ("sched/fair: Add per-CPU min capacity to
> > sched_group_capacity") introduced per-cpu min_capacity.
> >
> > commit e3d6d0cb66f2 ("sched/fair: Add sched_group per-CPU max capacity")
> > introduced per-cpu max_capacity.
> >
> > sgc->capacity is the *SUM* of all CPU's capacity in the group.
> > sgc->{min,max}_capacity are the sg per-cpu variables. Compare with
> > sgc->capacity to get sgc->{min,max}_capacity makes no sense. Instead,
> > we should compare one by one in each iteration to get
> > sgc->{min,max}_capacity of the group.
> >
>
> Worth noting this only affects the SD_OVERLAP case, the other case is fine
> (I checked again just to be sure).
>
> Now, on the bright side of things I don't think this currently causes any
> harm. The {min,max}_capacity values are used in bits of code that only
> gets run on topologies with asymmetric CPU ?archs (SD_ASYM_CPUCAPACITY), and
> I know of no such system that is also NUMA, i.e. end up with SD_OVERLAP
> (here's hoping nobody gets any funny idea).
>
> Still, nice find!
Valentin, thanks for your time!
>
> > Signed-off-by: Peng Liu <[email protected]>
> > ---
> > kernel/sched/fair.c | 11 +++++++----
> > 1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2d170b5da0e3..97b164fcda93 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7795,6 +7795,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> > for_each_cpu(cpu, sched_group_span(sdg)) {
> > struct sched_group_capacity *sgc;
> > struct rq *rq = cpu_rq(cpu);
> > + unsigned long cap;
> >
> > /*
> > * build_sched_domains() -> init_sched_groups_capacity()
> > @@ -7808,14 +7809,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> > * causing divide-by-zero issues on boot.
> > */
> > if (unlikely(!rq->sd)) {
> > - capacity += capacity_of(cpu);
> > + cap = capacity_of(cpu);
> > + capacity += cap;
> > + min_capacity = min(cap, min_capacity);
> > + max_capacity = max(cap, max_capacity);
> > } else {
> > sgc = rq->sd->groups->sgc;
> > capacity += sgc->capacity;
> > + min_capacity = min(sgc->min_capacity, min_capacity);
> > + max_capacity = max(sgc->max_capacity, max_capacity);
> > }
> > -
> > - min_capacity = min(capacity, min_capacity);
> > - max_capacity = max(capacity, max_capacity);
> > }
> > } else {
> > /*
> >
>
> All that could be shortened like the below, no?
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..9f6c015639ef 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7773,8 +7773,8 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> */
>
> for_each_cpu(cpu, sched_group_span(sdg)) {
> - struct sched_group_capacity *sgc;
> struct rq *rq = cpu_rq(cpu);
> + unsigned long cpu_cap;
>
> /*
> * build_sched_domains() -> init_sched_groups_capacity()
> @@ -7787,15 +7787,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> * This avoids capacity from being 0 and
> * causing divide-by-zero issues on boot.
> */
> - if (unlikely(!rq->sd)) {
> - capacity += capacity_of(cpu);
> - } else {
> - sgc = rq->sd->groups->sgc;
> - capacity += sgc->capacity;
> - }
> + if (unlikely(!rq->sd))
> + cpu_cap = capacity_of(cpu);
--------------------------------------------------------------
> + else
> + cpu_cap = rq->sd->groups->sgc->capacity;
sgc->capacity is the *sum* of all CPU's capacity in that group, right?
{min,max}_capacity are per CPU variables(*part* of a group). So we can't
compare *part* to *sum*. Am I overlooking something? Thanks.
> +
> + min_capacity = min(cpu_cap, min_capacity);
> + max_capacity = max(cpu_cap, max_capacity);
>
> - min_capacity = min(capacity, min_capacity);
> - max_capacity = max(capacity, max_capacity);
> + capacity += cpu_cap;
> }
> } else {
> /*
On 01/01/2020 14:13, Peng Liu wrote:
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 08a233e97a01..9f6c015639ef 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7773,8 +7773,8 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
>> */
>>
>> for_each_cpu(cpu, sched_group_span(sdg)) {
>> - struct sched_group_capacity *sgc;
>> struct rq *rq = cpu_rq(cpu);
>> + unsigned long cpu_cap;
>>
>> /*
>> * build_sched_domains() -> init_sched_groups_capacity()
>> @@ -7787,15 +7787,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
>> * This avoids capacity from being 0 and
>> * causing divide-by-zero issues on boot.
>> */
>> - if (unlikely(!rq->sd)) {
>> - capacity += capacity_of(cpu);
>> - } else {
>> - sgc = rq->sd->groups->sgc;
>> - capacity += sgc->capacity;
>> - }
>> + if (unlikely(!rq->sd))
>> + cpu_cap = capacity_of(cpu);
>
> --------------------------------------------------------------
>> + else
>> + cpu_cap = rq->sd->groups->sgc->capacity;
>
> sgc->capacity is the *sum* of all CPU's capacity in that group, right?
Right
> {min,max}_capacity are per CPU variables(*part* of a group). So we can't
> compare *part* to *sum*. Am I overlooking something? Thanks.
>
AIUI rq->sd->groups->sgc->capacity should be the capacity of the rq's CPU
(IOW the groups here should be made of single CPUs).
This should be true regardless of overlapping domains, since they sit on top
of the regular domains. Let me paint an example with a simple 2-core SMT2
system:
MC [ ]
SMT [ ][ ]
0 1 2 3
cpu_rq(0)->sd will point to the sched_domain of CPU0 at SMT level (it is the
"base domain", IOW the lowest domain in the topology hierarchy). Its groups
will be:
{0} ----> {1}
^ /
`-----'
and sd->groups will point at the group spanning the "local" CPU, in our case
CPU0, and thus here will be a group containing only CPU0.
I do not know why sched_group_capacity is used here however. As I understand
things, we could use cpu_capacity() unconditionally.
>> +
>> + min_capacity = min(cpu_cap, min_capacity);
>> + max_capacity = max(cpu_cap, max_capacity);
>>
>> - min_capacity = min(capacity, min_capacity);
>> - max_capacity = max(capacity, max_capacity);
>> + capacity += cpu_cap;
>> }
>> } else {
>> /*
On Wed, Jan 01, 2020 at 06:55:48PM +0000, Valentin Schneider wrote:
> On 01/01/2020 14:13, Peng Liu wrote:
> >> ---
> > --------------------------------------------------------------
> >> + else
> >> + cpu_cap = rq->sd->groups->sgc->capacity;
> >
> > sgc->capacity is the *sum* of all CPU's capacity in that group, right?
>
> Right
>
> > {min,max}_capacity are per CPU variables(*part* of a group). So we can't
> > compare *part* to *sum*. Am I overlooking something? Thanks.
> >
>
> AIUI rq->sd->groups->sgc->capacity should be the capacity of the rq's CPU
> (IOW the groups here should be made of single CPUs).
>
> This should be true regardless of overlapping domains, since they sit on top
> of the regular domains. Let me paint an example with a simple 2-core SMT2
> system:
>
> MC [ ]
> SMT [ ][ ]
> 0 1 2 3
>
> cpu_rq(0)->sd will point to the sched_domain of CPU0 at SMT level (it is the
> "base domain", IOW the lowest domain in the topology hierarchy). Its groups
> will be:
>
> {0} ----> {1}
> ^ /
> `-----'
>
> and sd->groups will point at the group spanning the "local" CPU, in our case
> CPU0, and thus here will be a group containing only CPU0.
>
>
> I do not know why sched_group_capacity is used here however. As I understand
> things, we could use cpu_capacity() unconditionally.
Valentin, sorry for my tardy reply, but there are always so many things
to do.
Thanks for your patient explanation, and the picture is intuitive
and clear. Indeed, the group in lowest domain only contains one CPU, and
the only CPU in the first group should be the rq's CPU. So, I wonder if
we can do like this?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d170b5da0e3..c9d7613c74d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7793,9 +7793,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
*/
for_each_cpu(cpu, sched_group_span(sdg)) {
- struct sched_group_capacity *sgc;
- struct rq *rq = cpu_rq(cpu);
-
/*
* build_sched_domains() -> init_sched_groups_capacity()
* gets here before we've attached the domains to the
@@ -7807,15 +7804,11 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* This avoids capacity from being 0 and
* causing divide-by-zero issues on boot.
*/
- if (unlikely(!rq->sd)) {
- capacity += capacity_of(cpu);
- } else {
- sgc = rq->sd->groups->sgc;
- capacity += sgc->capacity;
- }
+ unsigned long cpu_cap = capacity_of(cpu);
- min_capacity = min(capacity, min_capacity);
- max_capacity = max(capacity, max_capacity);
+ min_capacity = min(cpu_cap, min_capacity);
+ max_capacity = max(cpu_cap, max_capacity);
+ capacity += cpu_cap;
}
} else {
/*
On 03/01/2020 14:21, Peng Liu wrote:
> Thanks for your patient explanation, and the picture is intuitive
> and clear. Indeed, the group in lowest domain only contains one CPU, and
> the only CPU in the first group should be the rq's CPU. So, I wonder if
> we can do like this?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d170b5da0e3..c9d7613c74d2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7793,9 +7793,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> */
>
> for_each_cpu(cpu, sched_group_span(sdg)) {
> - struct sched_group_capacity *sgc;
> - struct rq *rq = cpu_rq(cpu);
> -
> /*
> * build_sched_domains() -> init_sched_groups_capacity()
> * gets here before we've attached the domains to the
> @@ -7807,15 +7804,11 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
> * This avoids capacity from being 0 and
> * causing divide-by-zero issues on boot.
> */
> - if (unlikely(!rq->sd)) {
> - capacity += capacity_of(cpu);
> - } else {
> - sgc = rq->sd->groups->sgc;
> - capacity += sgc->capacity;
> - }
> + unsigned long cpu_cap = capacity_of(cpu);
>
> - min_capacity = min(capacity, min_capacity);
> - max_capacity = max(capacity, max_capacity);
> + min_capacity = min(cpu_cap, min_capacity);
> + max_capacity = max(cpu_cap, max_capacity);
> + capacity += cpu_cap;
> }
> } else {
> /*
>
Yep, if we refactor it to always use capacity_of() we'd end up with
something like this. The comment block could be updated (or removed) as
well.
There must be (or have been) a reason to use the sched_group_capacity
structure, but I'm not aware of it and I don't have time right now to dig
through the git history to figure it out. I didn't see anyone suggesting
or talking about this simplification in the discussion thread of
9abf24d46518 ("sched: Check sched_domain before computing group power")
What you can try is sending that as the v2, and see if anyone screams. FWIW
you can add this to it too:
Reviewed-by: Valentin Schneider <[email protected]>