2023-05-15 12:05:49

by Tobias Huschle

[permalink] [raw]
Subject: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

The current load balancer implementation implies that scheduler groups,
within the same scheduler domain, all host the same number of CPUs.

This appears to be valid for non-s390 architectures. Nevertheless, s390
can actually have scheduler groups of unequal size.
The current scheduler behavior causes some s390 configs to use SMT
while some cores are still idle, leading to a performance degredation
under certain levels of workload.

Please refer to the patch's commit message for more details and an
example. This patch is a proposal on how to integrate the size of
scheduler groups into the decision process.

This patch is the most basic approach to address this issue and does
not claim to be perfect as-is.

Other ideas that also proved to address the problem but are more
complex but also potentially more precise:
1. On scheduler group building, count the number of CPUs within each
group that are first in their sibling mask. This represents the
number of CPUs that can be used before running into SMT. This
should be slightly more accurate than using the full group weight
if the number of available SMT threads per core varies.
2. Introduce a new scheduler group classification (smt_busy) in
between of fully_busy and has_spare. This classification would
indicate that a group still has spare capacity, but will run
into SMT when using that capacity. This would make the load
balancer prefer groups with fully idle CPUs over ones that are
about to run into SMT.

Feedback would be greatly appreciated.

Tobias Huschle (1):
sched/fair: Consider asymmetric scheduler groups in load balancer

kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--
2.34.1



2023-05-15 12:08:07

by Tobias Huschle

[permalink] [raw]
Subject: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

The current load balancer implementation implies that scheduler groups,
within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8
x x x x x

Adding a 6th task yields the following distribution:

Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8
SMT1 x x x x x
SMT2 x

The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle <[email protected]>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* group's child domain.
*/
if (sds.prefer_sibling && local->group_type == group_has_spare &&
- busiest->sum_nr_running > local->sum_nr_running + 1)
+ busiest->sum_nr_running * local->group_weight >
+ local->sum_nr_running * busiest->group_weight + 1)
goto force_balance;

if (busiest->group_type != group_overloaded) {
--
2.34.1


2023-05-16 14:19:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On Mon, 15 May 2023 at 13:46, Tobias Huschle <[email protected]> wrote:
>
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not
> of the same size.
>
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390
> architectures. Nevertheless, s390 can have scheduler groups of unequal
> size.
>
> This introduces a performance degredation in the following scenario:
>
> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> the remaining 2 are located on another socket:
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
>
> Placing some workload ( x = one task ) yields the following
> scenarios:
>
> The first 5 tasks are distributed evenly across the two groups.
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
> x x x x x
>
> Adding a 6th task yields the following distribution:
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
> SMT1 x x x x x
> SMT2 x

Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?

Could you share with us your scheduler topology ?

>
> The task is added to the 2nd scheduler group, as the scheduler has the
> assumption that scheduler groups are of the same size, so they should
> also host the same number of tasks. This makes CPU 7 run into SMT
> thread, which comes with a performance penalty. This means, that in
> the window of 6-8 tasks, load balancing is done suboptimally, because
> SMT is used although there is no reason to do so as fully idle CPUs
> are still available.
>
> Taking the weight of the scheduler groups into account, ensures that
> a load balancing CPU within a smaller group will not try to pull tasks
> from a bigger group while the bigger group still has idle CPUs
> available.
>
> Signed-off-by: Tobias Huschle <[email protected]>
> ---
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> * group's child domain.
> */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> - busiest->sum_nr_running > local->sum_nr_running + 1)
> + busiest->sum_nr_running * local->group_weight >
> + local->sum_nr_running * busiest->group_weight + 1)

This is the prefer_sibling path. Could it be that you should disable
prefer_siling between your sockets for such topology ? the default
path compares the number of idle CPUs when groups has spare capacity


> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> --
> 2.34.1
>

2023-05-16 16:38:59

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 15/05/2023 13:46, Tobias Huschle wrote:
> The current load balancer implementation implies that scheduler groups,
> within the same scheduler domain, all host the same number of CPUs.
>
> This appears to be valid for non-s390 architectures. Nevertheless, s390
> can actually have scheduler groups of unequal size.

Arm (classical) big.Little had this for years before we switched to flat
scheduling (only MC sched domain) over CPU capacity boundaries for Arm
DynamIQ.

Arm64 Juno platform in mainline:

root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list
0,3-5
1-2
1-2
0,3-5
0,3-5
0,3-5

root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'

domain0 39 <--
domain1 3f
domain0 06 <--
domain1 3f
domain0 06
domain1 3f
domain0 39
domain1 3f
domain0 39
domain1 3f
domain0 39
domain1 3f

root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
MC
DIE

But we don't have SMT on the mobile processors.

It looks like you are only interested to get group_weight dependency
into this 'prefer_sibling' condition of find_busiest_group()?

We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove
SD_PREFER_SIBLING from sd->child so we don't run this condition.

> The current scheduler behavior causes some s390 configs to use SMT
> while some cores are still idle, leading to a performance degredation
> under certain levels of workload.
>
> Please refer to the patch's commit message for more details and an
> example. This patch is a proposal on how to integrate the size of
> scheduler groups into the decision process.
>
> This patch is the most basic approach to address this issue and does
> not claim to be perfect as-is.
>
> Other ideas that also proved to address the problem but are more
> complex but also potentially more precise:
> 1. On scheduler group building, count the number of CPUs within each
> group that are first in their sibling mask. This represents the
> number of CPUs that can be used before running into SMT. This
> should be slightly more accurate than using the full group weight
> if the number of available SMT threads per core varies.
> 2. Introduce a new scheduler group classification (smt_busy) in
> between of fully_busy and has_spare. This classification would
> indicate that a group still has spare capacity, but will run
> into SMT when using that capacity. This would make the load
> balancer prefer groups with fully idle CPUs over ones that are
> about to run into SMT.
>
> Feedback would be greatly appreciated.
>
> Tobias Huschle (1):
> sched/fair: Consider asymmetric scheduler groups in load balancer
>
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>


2023-06-05 08:22:39

by Tobias Huschle

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 2023-05-16 15:36, Vincent Guittot wrote:
> On Mon, 15 May 2023 at 13:46, Tobias Huschle <[email protected]>
> wrote:
>>
>> The current load balancer implementation implies that scheduler
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>>
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures. Nevertheless, s390 can have scheduler groups of unequal
>> size.
>>
>> This introduces a performance degredation in the following scenario:
>>
>> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> the remaining 2 are located on another socket:
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>>
>> Placing some workload ( x = one task ) yields the following
>> scenarios:
>>
>> The first 5 tasks are distributed evenly across the two groups.
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>> x x x x x
>>
>> Adding a 6th task yields the following distribution:
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>> SMT1 x x x x x
>> SMT2 x
>
> Your description is a bit confusing for me. What you name CPU above
> should be named Core, doesn' it ?
>
> Could you share with us your scheduler topology ?
>

You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):

CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED
POLARIZATION ADDRESS
0 0 0 0 0 0 0:0:0 yes yes horizontal
0
1 0 0 0 0 0 1:1:0 yes yes horizontal
1
2 0 0 0 0 1 2:2:0 yes yes horizontal
2
3 0 0 0 0 1 3:3:0 yes yes horizontal
3
4 0 0 0 0 2 4:4:0 yes yes horizontal
4
5 0 0 0 0 2 5:5:0 yes yes horizontal
5
6 0 0 0 0 3 6:6:0 yes yes horizontal
6
7 0 0 0 0 3 7:7:0 yes yes horizontal
7
8 0 0 0 0 4 8:8:0 yes yes horizontal
8
9 0 0 0 0 4 9:9:0 yes yes horizontal
9
10 0 0 0 0 5 10:10:0 yes yes horizontal
10
11 0 0 0 0 5 11:11:0 yes yes horizontal
11
12 0 0 0 1 6 12:12:0 yes yes horizontal
12
13 0 0 0 1 6 13:13:0 yes yes horizontal
13
14 0 0 0 1 7 14:14:0 yes yes horizontal
14
15 0 0 0 1 7 15:15:0 yes yes horizontal
15

So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

If I run stress-ng with 8 cpu stressors on the original code I get a
distribution
like this:

00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
x x x x x x x x

Which means that the two cores in the smaller group are running into SMT
while two
cores in the larger group are still idle. This is caused by the
prefer_sibling path
which really wants to see both groups run the same number of tasks.

>>
>> The task is added to the 2nd scheduler group, as the scheduler has the
>> assumption that scheduler groups are of the same size, so they should
>> also host the same number of tasks. This makes CPU 7 run into SMT
>> thread, which comes with a performance penalty. This means, that in
>> the window of 6-8 tasks, load balancing is done suboptimally, because
>> SMT is used although there is no reason to do so as fully idle CPUs
>> are still available.
>>
>> Taking the weight of the scheduler groups into account, ensures that
>> a load balancing CPU within a smaller group will not try to pull tasks
>> from a bigger group while the bigger group still has idle CPUs
>> available.
>>
>> Signed-off-by: Tobias Huschle <[email protected]>
>> ---
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group
>> *find_busiest_group(struct lb_env *env)
>> * group's child domain.
>> */
>> if (sds.prefer_sibling && local->group_type == group_has_spare
>> &&
>> - busiest->sum_nr_running > local->sum_nr_running + 1)
>> + busiest->sum_nr_running * local->group_weight >
>> + local->sum_nr_running * busiest->group_weight
>> + 1)
>
> This is the prefer_sibling path. Could it be that you should disable
> prefer_siling between your sockets for such topology ? the default
> path compares the number of idle CPUs when groups has spare capacity
>
>

If I disable prefer_sibling (I played around with it a bit), I run into
the problem,
that the tasks are distributed s.t. each group has the same amount of
idle CPUs, which
yields distributions similar to this:

00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
x x x x x x x x

Now both groups have 4 idle CPUs which fulfills the criteria imposed by
the load balancer,
but the larger group is now running SMT while the smaller one is just
idle.

So, in this asymmetric setup, both criteria seem to not work in an
optimal way. Going for
the same number of idle CPUs or alternatively for the same number of
running processes
both cause a sub-optimal distribution of tasks, leading to unnecessary
SMT.

It seems also to be possible to address the regular load balancing path
by aiming to have the
same unused capacity between groups instead of the same number of idle
CPUs. This seems to
have been considered in the past, but the choice went in favor of the
number of idle CPUs.
Since this decision was actively taken already, I focused on the
prefer_sibling path.

The question now would be how to address this properly (or if I'm
missing something here).
As mentioned in the cover letter, this was the most simplistic and least
invasive approach
I could find, others might be more sophisticated but also have some
side-effects.

I have a bit of a hard time leaving this one as-is, as it just
introduces two additional
multiplications with no effect for most architectures. Maybe an
architectures specific
inline function that the compiler can optimize away if not needed?

>> goto force_balance;
>>
>> if (busiest->group_type != group_overloaded) {
>> --
>> 2.34.1
>>


2023-07-04 09:34:16

by Tobias Huschle

[permalink] [raw]
Subject: Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 2023-05-16 18:35, Dietmar Eggemann wrote:
> On 15/05/2023 13:46, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler
>> groups,
>> within the same scheduler domain, all host the same number of CPUs.
>>
>> This appears to be valid for non-s390 architectures. Nevertheless,
>> s390
>> can actually have scheduler groups of unequal size.
>
> Arm (classical) big.Little had this for years before we switched to
> flat
> scheduling (only MC sched domain) over CPU capacity boundaries for Arm
> DynamIQ.
>
> Arm64 Juno platform in mainline:
>
> root@juno:~# cat
> /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list
> 0,3-5
> 1-2
> 1-2
> 0,3-5
> 0,3-5
> 0,3-5
>
> root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'
>
> domain0 39 <--
> domain1 3f
> domain0 06 <--
> domain1 3f
> domain0 06
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
>
> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> MC
> DIE
>
> But we don't have SMT on the mobile processors.
>
> It looks like you are only interested to get group_weight dependency
> into this 'prefer_sibling' condition of find_busiest_group()?
>
Sorry, looks like your reply hit some bad filter of my mail program.
Let me answer, although it's a bit late.

Yes, I would like to get the group_weight into the prefer_sibling path.
Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
allows to have CPUs to be pretty far apart (cache-wise), which means,
the load balancer should avoid to move tasks back and forth between
those CPUs if possible.

We can't remove SD_PREFER_SIBLING either, as this would cause the load
balancer to aim for having the same number of idle CPUs in all groups,
which is a problem as well in asymmetric groups, for example:

With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
x x x x x x x x

Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
x x x x x x x x


Hence the idea to add the group_weight to the prefer_sibling path.

I was wondering if this would be the right place to address this issue
or if I should go down another route.

> We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove
> SD_PREFER_SIBLING from sd->child so we don't run this condition.
>
>> The current scheduler behavior causes some s390 configs to use SMT
>> while some cores are still idle, leading to a performance degredation
>> under certain levels of workload.
>>
>> Please refer to the patch's commit message for more details and an
>> example. This patch is a proposal on how to integrate the size of
>> scheduler groups into the decision process.
>>
>> This patch is the most basic approach to address this issue and does
>> not claim to be perfect as-is.
>>
>> Other ideas that also proved to address the problem but are more
>> complex but also potentially more precise:
>> 1. On scheduler group building, count the number of CPUs within each
>> group that are first in their sibling mask. This represents the
>> number of CPUs that can be used before running into SMT. This
>> should be slightly more accurate than using the full group weight
>> if the number of available SMT threads per core varies.
>> 2. Introduce a new scheduler group classification (smt_busy) in
>> between of fully_busy and has_spare. This classification would
>> indicate that a group still has spare capacity, but will run
>> into SMT when using that capacity. This would make the load
>> balancer prefer groups with fully idle CPUs over ones that are
>> about to run into SMT.
>>
>> Feedback would be greatly appreciated.
>>
>> Tobias Huschle (1):
>> sched/fair: Consider asymmetric scheduler groups in load balancer
>>
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>


2023-07-04 13:46:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote:
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not
> of the same size.
>
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390
> architectures.

Mostly; there's historically the cpuset case where we can create
lopsided groups like that. And today we're growing all these hybrid
things that will also tickle this, except they're looking towards
different balancer extentions to deal with the IPC difference so might
not be immediately caring about this here issue.


> Signed-off-by: Tobias Huschle <[email protected]>
> ---
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> * group's child domain.
> */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> - busiest->sum_nr_running > local->sum_nr_running + 1)
> + busiest->sum_nr_running * local->group_weight >
> + local->sum_nr_running * busiest->group_weight + 1)

Should that not be: busiest->group_weight * (local->sum_nr_running + 1) ?

I'm not opposed to this -- it seems fairly straight forward.

> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> --
> 2.34.1
>

2023-07-05 08:18:20

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

Le lundi 05 juin 2023 ? 10:07:16 (+0200), Tobias Huschle a ?crit :
> On 2023-05-16 15:36, Vincent Guittot wrote:
> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <[email protected]>
> > wrote:
> > >
> > > The current load balancer implementation implies that scheduler
> > > groups,
> > > within the same domain, all host the same number of CPUs. This is
> > > reflected in the condition, that a scheduler group, which is load
> > > balancing and classified as having spare capacity, should pull work
> > > from the busiest group, if the local group runs less processes than
> > > the busiest one. This implies that these two groups should run the
> > > same number of processes, which is problematic if the groups are not
> > > of the same size.
> > >
> > > The assumption that scheduler groups within the same scheduler domain
> > > host the same number of CPUs appears to be true for non-s390
> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
> > > size.
> > >
> > > This introduces a performance degredation in the following scenario:
> > >
> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > > the remaining 2 are located on another socket:
> > >
> > > Socket -----1----- -2-
> > > CPU 1 2 3 4 5 6 7 8
> > >
> > > Placing some workload ( x = one task ) yields the following
> > > scenarios:
> > >
> > > The first 5 tasks are distributed evenly across the two groups.
> > >
> > > Socket -----1----- -2-
> > > CPU 1 2 3 4 5 6 7 8
> > > x x x x x
> > >
> > > Adding a 6th task yields the following distribution:
> > >
> > > Socket -----1----- -2-
> > > CPU 1 2 3 4 5 6 7 8
> > > SMT1 x x x x x
> > > SMT2 x
> >
> > Your description is a bit confusing for me. What you name CPU above
> > should be named Core, doesn' it ?
> >
> > Could you share with us your scheduler topology ?
> >
>
> You are correct, it should say core instead of CPU.
>
> One actual configuration from one of my test machines (lscpu -e):
>

[...]

>
> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

Thaks for the details

>
> If I run stress-ng with 8 cpu stressors on the original code I get a
> distribution
> like this:
>
> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
> x x x x x x x x
>
> Which means that the two cores in the smaller group are running into SMT
> while two
> cores in the larger group are still idle. This is caused by the
> prefer_sibling path
> which really wants to see both groups run the same number of tasks.

yes and it considers that there are the same number of CPUs per group

>
> > >
> > > The task is added to the 2nd scheduler group, as the scheduler has the
> > > assumption that scheduler groups are of the same size, so they should
> > > also host the same number of tasks. This makes CPU 7 run into SMT
> > > thread, which comes with a performance penalty. This means, that in
> > > the window of 6-8 tasks, load balancing is done suboptimally, because
> > > SMT is used although there is no reason to do so as fully idle CPUs
> > > are still available.
> > >
> > > Taking the weight of the scheduler groups into account, ensures that
> > > a load balancing CPU within a smaller group will not try to pull tasks
> > > from a bigger group while the bigger group still has idle CPUs
> > > available.
> > >
> > > Signed-off-by: Tobias Huschle <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0ca13ac..b1307d7e4065 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10426,7 +10426,8 @@ static struct sched_group
> > > *find_busiest_group(struct lb_env *env)
> > > * group's child domain.
> > > */
> > > if (sds.prefer_sibling && local->group_type ==
> > > group_has_spare &&
> > > - busiest->sum_nr_running > local->sum_nr_running + 1)
> > > + busiest->sum_nr_running * local->group_weight >
> > > + local->sum_nr_running *
> > > busiest->group_weight + 1)


what you want to test here is that moving 1 task from busiest to local group
would help and balance the ratio of tasks per cpu

(busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight

which can be develop into

busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight

and you also have to change how we calculate the imbalance which just provide the half of the diff of nr_running

by something like

(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight)

> >
> > This is the prefer_sibling path. Could it be that you should disable
> > prefer_siling between your sockets for such topology ? the default
> > path compares the number of idle CPUs when groups has spare capacity
> >
> >
>
> If I disable prefer_sibling (I played around with it a bit), I run into the
> problem,
> that the tasks are distributed s.t. each group has the same amount of idle
> CPUs, which
> yields distributions similar to this:
>
> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
> x x x x x x x x
>
> Now both groups have 4 idle CPUs which fulfills the criteria imposed by the
> load balancer,
> but the larger group is now running SMT while the smaller one is just idle.
>
> So, in this asymmetric setup, both criteria seem to not work in an optimal
> way. Going for
> the same number of idle CPUs or alternatively for the same number of running
> processes
> both cause a sub-optimal distribution of tasks, leading to unnecessary SMT.

there is the same behavior and assumption here too


>
> It seems also to be possible to address the regular load balancing path by
> aiming to have the
> same unused capacity between groups instead of the same number of idle CPUs.
> This seems to
> have been considered in the past, but the choice went in favor of the number
> of idle CPUs.

unused capacity doesn't give the instantaneous state so a group can be idle but without
unused capacity

> Since this decision was actively taken already, I focused on the
> prefer_sibling path.
>
> The question now would be how to address this properly (or if I'm missing
> something here).
> As mentioned in the cover letter, this was the most simplistic and least
> invasive approach
> I could find, others might be more sophisticated but also have some
> side-effects.
>
> I have a bit of a hard time leaving this one as-is, as it just introduces
> two additional
> multiplications with no effect for most architectures. Maybe an
> architectures specific
> inline function that the compiler can optimize away if not needed?
>
> > > goto force_balance;
> > >
> > > if (busiest->group_type != group_overloaded) {
> > > --
> > > 2.34.1
> > >
>

2023-07-06 12:09:33

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 04/07/2023 11:11, Tobias Huschle wrote:
> On 2023-05-16 18:35, Dietmar Eggemann wrote:
>> On 15/05/2023 13:46, Tobias Huschle wrote:
>>> The current load balancer implementation implies that scheduler groups,
>>> within the same scheduler domain, all host the same number of CPUs.
>>>
>>> This appears to be valid for non-s390 architectures. Nevertheless, s390
>>> can actually have scheduler groups of unequal size.
>>
>> Arm (classical) big.Little had this for years before we switched to flat
>> scheduling (only MC sched domain) over CPU capacity boundaries for Arm
>> DynamIQ.
>>
>> Arm64 Juno platform in mainline:
>>
>> root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list
>> 0,3-5
>> 1-2
>> 1-2
>> 0,3-5
>> 0,3-5
>> 0,3-5
>>
>> root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'
>>
>> domain0 39 <--
>> domain1 3f
>> domain0 06 <--
>> domain1 3f
>> domain0 06
>> domain1 3f
>> domain0 39
>> domain1 3f
>> domain0 39
>> domain1 3f
>> domain0 39
>> domain1 3f
>>
>> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
>> MC
>> DIE
>>
>> But we don't have SMT on the mobile processors.
>>
>> It looks like you are only interested to get group_weight dependency
>> into this 'prefer_sibling' condition of find_busiest_group()?
>>
> Sorry, looks like your reply hit some bad filter of my mail program.
> Let me answer, although it's a bit late.
>
> Yes, I would like to get the group_weight into the prefer_sibling path.
> Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
> allows to have CPUs to be pretty far apart (cache-wise), which means,
> the load balancer should avoid to move tasks back and forth between
> those CPUs if possible.
>
> We can't remove SD_PREFER_SIBLING either, as this would cause the load
> balancer to aim for having the same number of idle CPUs in all groups,
> which is a problem as well in asymmetric groups, for example:
>
> With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>                 x     x     x     x      x  x  x  x
>
> Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>     x  x  x     x  x     x     x  x
>
>
> Hence the idea to add the group_weight to the prefer_sibling path.
>
> I was wondering if this would be the right place to address this issue
> or if I should go down another route.

Yes, it's the right place to fix it for you. IMHO, there is still some
discussion needed about the correct condition and changes in
calculate_imbalance() for your case if I read the comments on this
thread correctly.

Arm64 big.Little wouldn't be affected since we explicitly remove
SD_PREFER_SIBLING on MC for our legacy MC,DIE setups to avoid spreading
tasks across DIE sched groups holding CPUs with different capacities.

[...]

2023-07-06 17:31:13

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer



On 5/15/23 5:16 PM, Tobias Huschle wrote:
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not
> of the same size.
>
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390
> architectures. Nevertheless, s390 can have scheduler groups of unequal
> size.
>
> This introduces a performance degredation in the following scenario:
>
> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> the remaining 2 are located on another socket:
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
>
> Placing some workload ( x = one task ) yields the following
> scenarios:
>
> The first 5 tasks are distributed evenly across the two groups.
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
> x x x x x
>
> Adding a 6th task yields the following distribution:
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
> SMT1 x x x x x
> SMT2 x
>
> The task is added to the 2nd scheduler group, as the scheduler has the
> assumption that scheduler groups are of the same size, so they should
> also host the same number of tasks. This makes CPU 7 run into SMT
> thread, which comes with a performance penalty. This means, that in
> the window of 6-8 tasks, load balancing is done suboptimally, because
> SMT is used although there is no reason to do so as fully idle CPUs
> are still available.
>
> Taking the weight of the scheduler groups into account, ensures that
> a load balancing CPU within a smaller group will not try to pull tasks
> from a bigger group while the bigger group still has idle CPUs
> available.
>
> Signed-off-by: Tobias Huschle <[email protected]>
> ---
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> * group's child domain.
> */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> - busiest->sum_nr_running > local->sum_nr_running + 1)
> + busiest->sum_nr_running * local->group_weight >
> + local->sum_nr_running * busiest->group_weight + 1)
> goto force_balance;
>


I assume its SMT2 here. sched group is enclosed in [busy_cpus/idle_cpus/weight]

Lets take an example: we will begin the with the said imbalance.
[3/9/12] -- local group
[3/1/4] -- busy group.
we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to balance.
but calculate_imbalance will lead to zero as the imbalance no? in case of prefer sibling
case it find the difference of sum_nr_running of the two groups. It will be 3-3 = 0

this would need modifications to calculate_imbalance.
Maybe something like this will help? NOT TESTED AT ALL.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e027d4086edc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return;
}

- if (busiest->group_weight == 1 || sds->prefer_sibling) {
+ if (busiest->group_weight == 1 ||
+ (sds->prefer_sibling &&
+ busiest->group_weight == local->group_weight)) {
unsigned int nr_diff = busiest->sum_nr_running;
/*
* When prefer sibling, evenly spread running tasks on
@@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
env->imbalance = nr_diff;
+ }
+ if (sds->prefer_sibling &&
+ busiest->group_weight != local->group_weight) {
+ env->migration_type = migrate_task;
+ env->imbalance = 1;
} else {


---------------------------------------------------------------------------------------------------
On a tangential dimension.


Since prefer_siblings make inherent assumption that all groups have equal weight,
it will cause complications when group_weights are different. I think that becomes very
tricky when there are more than two groups. Depending on which cpu is doing
load_balance there can be unnecessary migrations.


Currently even in NUMA this can be similar case right? There will be unequal number of CPU's right?
If we fix that case and remove prefer siblings in your arch, will that work?

Maybe something like this? NOT TESTED AT ALL.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..fc6377f48ced 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10514,6 +10514,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto out_balanced;

if (busiest->group_weight > 1 &&
+ busiest->group_weight == local->group_weight &&
local->idle_cpus <= (busiest->idle_cpus + 1))
/*
* If the busiest group is not overloaded
@@ -10526,6 +10527,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
*/
goto out_balanced;

+ if ((busiest->group_weight != local->group_weight) &&
+ (local->idle_cpus * busiest->group_weight <=
+ local->group_weight * (busiest->idle_cpus + 1)))
+ goto out_balanced;
+
if (busiest->sum_h_nr_running == 1)
/*
* busiest doesn't have any tasks waiting to run





> if (busiest->group_type != group_overloaded) {

2023-07-07 08:20:37

by Tobias Huschle

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 2023-07-06 19:19, Shrikanth Hegde wrote:
> On 5/15/23 5:16 PM, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>>
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures. Nevertheless, s390 can have scheduler groups of unequal
>> size.
>>
>> This introduces a performance degredation in the following scenario:
>>
>> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> the remaining 2 are located on another socket:
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>>
>> Placing some workload ( x = one task ) yields the following
>> scenarios:
>>
>> The first 5 tasks are distributed evenly across the two groups.
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>> x x x x x
>>
>> Adding a 6th task yields the following distribution:
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>> SMT1 x x x x x
>> SMT2 x
>>
>> The task is added to the 2nd scheduler group, as the scheduler has the
>> assumption that scheduler groups are of the same size, so they should
>> also host the same number of tasks. This makes CPU 7 run into SMT
>> thread, which comes with a performance penalty. This means, that in
>> the window of 6-8 tasks, load balancing is done suboptimally, because
>> SMT is used although there is no reason to do so as fully idle CPUs
>> are still available.
>>
>> Taking the weight of the scheduler groups into account, ensures that
>> a load balancing CPU within a smaller group will not try to pull tasks
>> from a bigger group while the bigger group still has idle CPUs
>> available.
>>
>> Signed-off-by: Tobias Huschle <[email protected]>
>> ---
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group
>> *find_busiest_group(struct lb_env *env)
>> * group's child domain.
>> */
>> if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> - busiest->sum_nr_running > local->sum_nr_running + 1)
>> + busiest->sum_nr_running * local->group_weight >
>> + local->sum_nr_running * busiest->group_weight + 1)
>> goto force_balance;
>>
>
>
> I assume its SMT2 here. sched group is enclosed in
> [busy_cpus/idle_cpus/weight]
>
> Lets take an example: we will begin the with the said imbalance.
> [3/9/12] -- local group
> [3/1/4] -- busy group.
> we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to
> balance.
> but calculate_imbalance will lead to zero as the imbalance no? in case
> of prefer sibling
> case it find the difference of sum_nr_running of the two groups. It
> will be 3-3 = 0
>
> this would need modifications to calculate_imbalance.
> Maybe something like this will help? NOT TESTED AT ALL.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e027d4086edc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> return;
> }
>
> - if (busiest->group_weight == 1 || sds->prefer_sibling)
> {
> + if (busiest->group_weight == 1 ||
> + (sds->prefer_sibling &&
> + busiest->group_weight == local->group_weight))
> {
> unsigned int nr_diff = busiest->sum_nr_running;
> /*
> * When prefer sibling, evenly spread running
> tasks on
> @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> env->migration_type = migrate_task;
> lsub_positive(&nr_diff, local->sum_nr_running);
> env->imbalance = nr_diff;
> + }
> + if (sds->prefer_sibling &&
> + busiest->group_weight != local->group_weight) {
> + env->migration_type = migrate_task;
> + env->imbalance = 1;
> } else {
>

I also had a look at this when Vincent pointed out that this part is
missing.
The formula proposed by Vincent works pretty well, it is not even
necessary
to add additional if-statements here.
>
> ---------------------------------------------------------------------------------------------------
> On a tangential dimension.
>
>
> Since prefer_siblings make inherent assumption that all groups have
> equal weight,
> it will cause complications when group_weights are different. I think
> that becomes very
> tricky when there are more than two groups. Depending on which cpu is
> doing
> load_balance there can be unnecessary migrations.
>
>
> Currently even in NUMA this can be similar case right? There will be
> unequal number of CPU's right?
> If we fix that case and remove prefer siblings in your arch, will that
> work?
>
> Maybe something like this? NOT TESTED AT ALL.
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..fc6377f48ced 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10514,6 +10514,7 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
> goto out_balanced;
>
> if (busiest->group_weight > 1 &&
> + busiest->group_weight == local->group_weight &&
> local->idle_cpus <= (busiest->idle_cpus + 1))
> /*
> * If the busiest group is not overloaded
> @@ -10526,6 +10527,11 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
> */
> goto out_balanced;
>
> + if ((busiest->group_weight != local->group_weight) &&
> + (local->idle_cpus * busiest->group_weight <=
> + local->group_weight *
> (busiest->idle_cpus + 1)))
> + goto out_balanced;
> +
> if (busiest->sum_h_nr_running == 1)
> /*
> * busiest doesn't have any tasks waiting to
> run
>
>
>
>
>
>> if (busiest->group_type != group_overloaded) {

I played around with alternate solutions as well, yours could be
interesting in order
to declare the problematic state as balanced essentially. I wouldn't be
opposed to
remove prefer_siblings. The question would be if it is necessary to
address both
scenarios anyway.



2023-07-07 08:21:26

by Tobias Huschle

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 2023-07-04 15:40, Peter Zijlstra wrote:
> On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>>
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures.
>
> Mostly; there's historically the cpuset case where we can create
> lopsided groups like that. And today we're growing all these hybrid
> things that will also tickle this, except they're looking towards
> different balancer extentions to deal with the IPC difference so might
> not be immediately caring about this here issue.
>
>
>> Signed-off-by: Tobias Huschle <[email protected]>
>> ---
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group
>> *find_busiest_group(struct lb_env *env)
>> * group's child domain.
>> */
>> if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> - busiest->sum_nr_running > local->sum_nr_running + 1)
>> + busiest->sum_nr_running * local->group_weight >
>> + local->sum_nr_running * busiest->group_weight + 1)
>
> Should that not be: busiest->group_weight * (local->sum_nr_running + 1)
> ?

I agree, adding the brackets makes more sense and is clearer on what's
intended by this check while also preserving the original behavior for
local->group_weight == busiest->group_weight

>
> I'm not opposed to this -- it seems fairly straight forward.

Appreciated, I will go ahead and send a patch once I incorporated the
other feedback I got.
Thanks.

>
>> goto force_balance;
>>
>> if (busiest->group_type != group_overloaded) {
>> --
>> 2.34.1
>>

2023-07-07 08:22:20

by Tobias Huschle

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 2023-07-05 09:52, Vincent Guittot wrote:
> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>> On 2023-05-16 15:36, Vincent Guittot wrote:
>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <[email protected]>
>> > wrote:
>> > >
>> > > The current load balancer implementation implies that scheduler
>> > > groups,
>> > > within the same domain, all host the same number of CPUs. This is
>> > > reflected in the condition, that a scheduler group, which is load
>> > > balancing and classified as having spare capacity, should pull work
>> > > from the busiest group, if the local group runs less processes than
>> > > the busiest one. This implies that these two groups should run the
>> > > same number of processes, which is problematic if the groups are not
>> > > of the same size.
>> > >
>> > > The assumption that scheduler groups within the same scheduler domain
>> > > host the same number of CPUs appears to be true for non-s390
>> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
>> > > size.
>> > >
>> > > This introduces a performance degredation in the following scenario:
>> > >
>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> > > the remaining 2 are located on another socket:
>> > >
>> > > Socket -----1----- -2-
>> > > CPU 1 2 3 4 5 6 7 8
>> > >
>> > > Placing some workload ( x = one task ) yields the following
>> > > scenarios:
>> > >
>> > > The first 5 tasks are distributed evenly across the two groups.
>> > >
>> > > Socket -----1----- -2-
>> > > CPU 1 2 3 4 5 6 7 8
>> > > x x x x x
>> > >
>> > > Adding a 6th task yields the following distribution:
>> > >
>> > > Socket -----1----- -2-
>> > > CPU 1 2 3 4 5 6 7 8
>> > > SMT1 x x x x x
>> > > SMT2 x
>> >
>> > Your description is a bit confusing for me. What you name CPU above
>> > should be named Core, doesn' it ?
>> >
>> > Could you share with us your scheduler topology ?
>> >
>>
>> You are correct, it should say core instead of CPU.
>>
>> One actual configuration from one of my test machines (lscpu -e):
>>
>
> [...]
>
>>
>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>
> Thaks for the details
>
>>
>> If I run stress-ng with 8 cpu stressors on the original code I get a
>> distribution
>> like this:
>>
>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
>> x x x x x x x x
>>
>> Which means that the two cores in the smaller group are running into
>> SMT
>> while two
>> cores in the larger group are still idle. This is caused by the
>> prefer_sibling path
>> which really wants to see both groups run the same number of tasks.
>
> yes and it considers that there are the same number of CPUs per group
>
>>
>> > >
>> > > The task is added to the 2nd scheduler group, as the scheduler has the
>> > > assumption that scheduler groups are of the same size, so they should
>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>> > > thread, which comes with a performance penalty. This means, that in
>> > > the window of 6-8 tasks, load balancing is done suboptimally, because
>> > > SMT is used although there is no reason to do so as fully idle CPUs
>> > > are still available.
>> > >
>> > > Taking the weight of the scheduler groups into account, ensures that
>> > > a load balancing CPU within a smaller group will not try to pull tasks
>> > > from a bigger group while the bigger group still has idle CPUs
>> > > available.
>> > >
>> > > Signed-off-by: Tobias Huschle <[email protected]>
>> > > ---
>> > > kernel/sched/fair.c | 3 ++-
>> > > 1 file changed, 2 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>> > > --- a/kernel/sched/fair.c
>> > > +++ b/kernel/sched/fair.c
>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>> > > *find_busiest_group(struct lb_env *env)
>> > > * group's child domain.
>> > > */
>> > > if (sds.prefer_sibling && local->group_type ==
>> > > group_has_spare &&
>> > > - busiest->sum_nr_running > local->sum_nr_running + 1)
>> > > + busiest->sum_nr_running * local->group_weight >
>> > > + local->sum_nr_running *
>> > > busiest->group_weight + 1)
>
>
> what you want to test here is that moving 1 task from busiest to local
> group
> would help and balance the ratio of tasks per cpu
>
> (busiest->sum_nr_running - 1) / busiest->group_weight >
> (local->sum_nr_running + 1) / local->group_weight
>
> which can be develop into
>
> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
> * busiest->group_weight + busiest->group_weight + local->group_weight
>
> and you also have to change how we calculate the imbalance which just
> provide the half of the diff of nr_running
>
> by something like
>
> (busiest->sum_nr_running * local->group_weight) -
> (local->sum_nr_running * busiest->group_weight) /
> (busiest->group_weight + local->group_weight)
>

Ahh right, I had a look at the imbalance part now and your suggestion
works
pretty well. Just had to make some minor adjustments so far.
Nice side effect is, that this allows the load balancer to behave
exactly the
same as before in the cases where local->group_weight ==
busiest->group_weight.
The behavior only changes for the case where the groups are not of equal
size.

I will figure out a solution and send a patch soon which incorporates
these
adjustments plus a more detailed description.

Thanks for the feedback.

>> >
>> > This is the prefer_sibling path. Could it be that you should disable
>> > prefer_siling between your sockets for such topology ? the default
>> > path compares the number of idle CPUs when groups has spare capacity
>> >
>> >
>>
>> If I disable prefer_sibling (I played around with it a bit), I run
>> into the
>> problem,
>> that the tasks are distributed s.t. each group has the same amount of
>> idle
>> CPUs, which
>> yields distributions similar to this:
>>
>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
>> x x x x x x x x
>>
>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>> by the
>> load balancer,
>> but the larger group is now running SMT while the smaller one is just
>> idle.
>>
>> So, in this asymmetric setup, both criteria seem to not work in an
>> optimal
>> way. Going for
>> the same number of idle CPUs or alternatively for the same number of
>> running
>> processes
>> both cause a sub-optimal distribution of tasks, leading to unnecessary
>> SMT.
>
> there is the same behavior and assumption here too
>
>
>>
>> It seems also to be possible to address the regular load balancing
>> path by
>> aiming to have the
>> same unused capacity between groups instead of the same number of idle
>> CPUs.
>> This seems to
>> have been considered in the past, but the choice went in favor of the
>> number
>> of idle CPUs.
>
> unused capacity doesn't give the instantaneous state so a group can be
> idle but without
> unused capacity
>
>> Since this decision was actively taken already, I focused on the
>> prefer_sibling path.
>>
>> The question now would be how to address this properly (or if I'm
>> missing
>> something here).
>> As mentioned in the cover letter, this was the most simplistic and
>> least
>> invasive approach
>> I could find, others might be more sophisticated but also have some
>> side-effects.
>>
>> I have a bit of a hard time leaving this one as-is, as it just
>> introduces
>> two additional
>> multiplications with no effect for most architectures. Maybe an
>> architectures specific
>> inline function that the compiler can optimize away if not needed?
>>
>> > > goto force_balance;
>> > >
>> > > if (busiest->group_type != group_overloaded) {
>> > > --
>> > > 2.34.1
>> > >
>>

2023-07-07 15:07:49

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer



On 7/7/23 1:14 PM, Tobias Huschle wrote:
> On 2023-07-05 09:52, Vincent Guittot wrote:
>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>>> On 2023-05-16 15:36, Vincent Guittot wrote:
>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <[email protected]>
>>> > wrote:
>>> > >
>>> > > The current load balancer implementation implies that scheduler
>>> > > groups,
>>> > > within the same domain, all host the same number of CPUs. This is
>>> > > reflected in the condition, that a scheduler group, which is load
>>> > > balancing and classified as having spare capacity, should pull work
>>> > > from the busiest group, if the local group runs less processes than
>>> > > the busiest one. This implies that these two groups should run the
>>> > > same number of processes, which is problematic if the groups are not
>>> > > of the same size.
>>> > >
>>> > > The assumption that scheduler groups within the same scheduler
>>> domain
>>> > > host the same number of CPUs appears to be true for non-s390
>>> > > architectures. Nevertheless, s390 can have scheduler groups of
>>> unequal
>>> > > size.
>>> > >
>>> > > This introduces a performance degredation in the following scenario:
>>> > >
>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>>> > > the remaining 2 are located on another socket:
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > >
>>> > > Placing some workload ( x = one task ) yields the following
>>> > > scenarios:
>>> > >
>>> > > The first 5 tasks are distributed evenly across the two groups.
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > >          x x x          x x
>>> > >
>>> > > Adding a 6th task yields the following distribution:
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > > SMT1     x x x          x x
>>> > > SMT2                    x
>>> >
>>> > Your description is a bit confusing for me. What you name CPU above
>>> > should be named Core, doesn' it ?
>>> >
>>> > Could you share with us your scheduler topology ?
>>> >
>>>
>>> You are correct, it should say core instead of CPU.
>>>
>>> One actual configuration from one of my test machines (lscpu -e):
>>>
>>
>> [...]
>>
>>>
>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>>
>> Thaks for the details
>>
>>>
>>> If I run stress-ng with 8 cpu stressors on the original code I get a
>>> distribution
>>> like this:
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>                 x     x     x     x      x  x  x  x
>>>
>>> Which means that the two cores in the smaller group are running into SMT
>>> while two
>>> cores in the larger group are still idle. This is caused by the
>>> prefer_sibling path
>>> which really wants to see both groups run the same number of tasks.
>>
>> yes and it considers that there are the same number of CPUs per group
>>
>>>
>>> > >
>>> > > The task is added to the 2nd scheduler group, as the scheduler
>>> has the
>>> > > assumption that scheduler groups are of the same size, so they
>>> should
>>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>>> > > thread, which comes with a performance penalty. This means, that in
>>> > > the window of 6-8 tasks, load balancing is done suboptimally,
>>> because
>>> > > SMT is used although there is no reason to do so as fully idle CPUs
>>> > > are still available.
>>> > >
>>> > > Taking the weight of the scheduler groups into account, ensures that
>>> > > a load balancing CPU within a smaller group will not try to pull
>>> tasks
>>> > > from a bigger group while the bigger group still has idle CPUs
>>> > > available.
>>> > >
>>> > > Signed-off-by: Tobias Huschle <[email protected]>
>>> > > ---
>>> > >  kernel/sched/fair.c | 3 ++-
>>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>>> > >
>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>>> > > --- a/kernel/sched/fair.c
>>> > > +++ b/kernel/sched/fair.c
>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>>> > > *find_busiest_group(struct lb_env *env)
>>> > >          * group's child domain.
>>> > >          */
>>> > >         if (sds.prefer_sibling && local->group_type ==
>>> > > group_has_spare &&
>>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>>> > > +           busiest->sum_nr_running * local->group_weight >
>>> > > +                       local->sum_nr_running *
>>> > > busiest->group_weight + 1)
>>
>>
>> what you want to test here is that moving 1 task from busiest to local
>> group
>> would help and balance the ratio of tasks per cpu
>>
>> (busiest->sum_nr_running - 1) / busiest->group_weight >
>> (local->sum_nr_running + 1) / local->group_weight
>>
>> which can be develop into
>>
>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
>> * busiest->group_weight + busiest->group_weight + local->group_weight
>>
>> and you also have to change how we calculate the imbalance which just
>> provide the half of the diff of nr_running
>>
>> by something like
>>
>> (busiest->sum_nr_running * local->group_weight) -
>> (local->sum_nr_running * busiest->group_weight) /
>> (busiest->group_weight + local->group_weight)
>>
>
> Ahh right, I had a look at the imbalance part now and your suggestion works
> pretty well. Just had to make some minor adjustments so far.
> Nice side effect is, that this allows the load balancer to behave
> exactly the
> same as before in the cases where local->group_weight ==
> busiest->group_weight.
> The behavior only changes for the case where the groups are not of equal
> size.


Not sure if it has been figured/discussed out already, pointing one possible scenario.

Taking the formulas:
busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight
and calulate_imbalance:
(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight)


First lets say imbalance was like this. same example as before. sched_group in [busy_cpus/idle_cpus/group_weight]
[3/9/12] - local group.
[3/1/4] - busy group.

3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) -- 24/16 >> 1 -- lets say 1.
we will balance, good.

[4/8/12]
[2/2/4]
There will not be further load balances. good.

a new process comes, it would be scheduled on [4/8/120 sched group as that would be idlest.
[5/7/12]
[2/2/4]

Process running on [2/2/4] exits. then in this scenario do you expect the balance to happen again? Since balancing would result into optimal performance.
[5/7/12] - busy_group
[1/3/4] - local group

5*4 >= 1*12+12+4 --> will not balance.

[5/7/12] - local group
[1/3/4] - busy group
1*12 >= 5*4 + 12 + 4 --> will not balance.

Is this scenario needs to be handled as well?

>
> I will figure out a solution and send a patch soon which incorporates these
> adjustments plus a more detailed description.
>
> Thanks for the feedback.
>
>>> >
>>> > This is the prefer_sibling path. Could it be that you should disable
>>> > prefer_siling between your sockets for such topology ? the default
>>> > path compares the number of idle CPUs when groups has spare capacity
>>> >
>>> >
>>>
>>> If I disable prefer_sibling (I played around with it a bit), I run
>>> into the
>>> problem,
>>> that the tasks are distributed s.t. each group has the same amount of
>>> idle
>>> CPUs, which
>>> yields distributions similar to this:
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>     x  x  x     x  x     x     x  x
>>>
>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>>> by the
>>> load balancer,
>>> but the larger group is now running SMT while the smaller one is just
>>> idle.
>>>
>>> So, in this asymmetric setup, both criteria seem to not work in an
>>> optimal
>>> way. Going for
>>> the same number of idle CPUs or alternatively for the same number of
>>> running
>>> processes
>>> both cause a sub-optimal distribution of tasks, leading to
>>> unnecessary SMT.
>>
>> there is the same behavior and assumption here too
>>
>>
>>>
>>> It seems also to be possible to address the regular load balancing
>>> path by
>>> aiming to have the
>>> same unused capacity between groups instead of the same number of
>>> idle CPUs.
>>> This seems to
>>> have been considered in the past, but the choice went in favor of the
>>> number
>>> of idle CPUs.
>>
>> unused capacity doesn't give the instantaneous state so a group can be
>> idle but without
>> unused capacity
>>
>>> Since this decision was actively taken already, I focused on the
>>> prefer_sibling path.
>>>
>>> The question now would be how to address this properly (or if I'm
>>> missing
>>> something here).
>>> As mentioned in the cover letter, this was the most simplistic and least
>>> invasive approach
>>> I could find, others might be more sophisticated but also have some
>>> side-effects.
>>>
>>> I have a bit of a hard time leaving this one as-is, as it just
>>> introduces
>>> two additional
>>> multiplications with no effect for most architectures. Maybe an
>>> architectures specific
>>> inline function that the compiler can optimize away if not needed?
>>>
>>> > >                 goto force_balance;
>>> > >
>>> > >         if (busiest->group_type != group_overloaded) {
>>> > > --
>>> > > 2.34.1
>>> > >
>>>

2023-07-07 16:18:13

by Tobias Huschle

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

On 2023-07-07 16:33, Shrikanth Hegde wrote:
> On 7/7/23 1:14 PM, Tobias Huschle wrote:
>> On 2023-07-05 09:52, Vincent Guittot wrote:
>>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>>>> On 2023-05-16 15:36, Vincent Guittot wrote:
>>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <[email protected]>
>>>> > wrote:
>>>> > >
>>>> > > The current load balancer implementation implies that scheduler
>>>> > > groups,
>>>> > > within the same domain, all host the same number of CPUs. This is
>>>> > > reflected in the condition, that a scheduler group, which is load
>>>> > > balancing and classified as having spare capacity, should pull work
>>>> > > from the busiest group, if the local group runs less processes than
>>>> > > the busiest one. This implies that these two groups should run the
>>>> > > same number of processes, which is problematic if the groups are not
>>>> > > of the same size.
>>>> > >
>>>> > > The assumption that scheduler groups within the same scheduler
>>>> domain
>>>> > > host the same number of CPUs appears to be true for non-s390
>>>> > > architectures. Nevertheless, s390 can have scheduler groups of
>>>> unequal
>>>> > > size.
>>>> > >
>>>> > > This introduces a performance degredation in the following scenario:
>>>> > >
>>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>>>> > > the remaining 2 are located on another socket:
>>>> > >
>>>> > > Socket   -----1-----    -2-
>>>> > > CPU      1 2 3 4 5 6    7 8
>>>> > >
>>>> > > Placing some workload ( x = one task ) yields the following
>>>> > > scenarios:
>>>> > >
>>>> > > The first 5 tasks are distributed evenly across the two groups.
>>>> > >
>>>> > > Socket   -----1-----    -2-
>>>> > > CPU      1 2 3 4 5 6    7 8
>>>> > >          x x x          x x
>>>> > >
>>>> > > Adding a 6th task yields the following distribution:
>>>> > >
>>>> > > Socket   -----1-----    -2-
>>>> > > CPU      1 2 3 4 5 6    7 8
>>>> > > SMT1     x x x          x x
>>>> > > SMT2                    x
>>>> >
>>>> > Your description is a bit confusing for me. What you name CPU above
>>>> > should be named Core, doesn' it ?
>>>> >
>>>> > Could you share with us your scheduler topology ?
>>>> >
>>>>
>>>> You are correct, it should say core instead of CPU.
>>>>
>>>> One actual configuration from one of my test machines (lscpu -e):
>>>>
>>>
>>> [...]
>>>
>>>>
>>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>>>
>>> Thaks for the details
>>>
>>>>
>>>> If I run stress-ng with 8 cpu stressors on the original code I get a
>>>> distribution
>>>> like this:
>>>>
>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>                 x     x     x     x      x  x  x  x
>>>>
>>>> Which means that the two cores in the smaller group are running into
>>>> SMT
>>>> while two
>>>> cores in the larger group are still idle. This is caused by the
>>>> prefer_sibling path
>>>> which really wants to see both groups run the same number of tasks.
>>>
>>> yes and it considers that there are the same number of CPUs per group
>>>
>>>>
>>>> > >
>>>> > > The task is added to the 2nd scheduler group, as the scheduler
>>>> has the
>>>> > > assumption that scheduler groups are of the same size, so they
>>>> should
>>>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>>>> > > thread, which comes with a performance penalty. This means, that in
>>>> > > the window of 6-8 tasks, load balancing is done suboptimally,
>>>> because
>>>> > > SMT is used although there is no reason to do so as fully idle CPUs
>>>> > > are still available.
>>>> > >
>>>> > > Taking the weight of the scheduler groups into account, ensures that
>>>> > > a load balancing CPU within a smaller group will not try to pull
>>>> tasks
>>>> > > from a bigger group while the bigger group still has idle CPUs
>>>> > > available.
>>>> > >
>>>> > > Signed-off-by: Tobias Huschle <[email protected]>
>>>> > > ---
>>>> > >  kernel/sched/fair.c | 3 ++-
>>>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>>>> > >
>>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>>>> > > --- a/kernel/sched/fair.c
>>>> > > +++ b/kernel/sched/fair.c
>>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>>>> > > *find_busiest_group(struct lb_env *env)
>>>> > >          * group's child domain.
>>>> > >          */
>>>> > >         if (sds.prefer_sibling && local->group_type ==
>>>> > > group_has_spare &&
>>>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>>>> > > +           busiest->sum_nr_running * local->group_weight >
>>>> > > +                       local->sum_nr_running *
>>>> > > busiest->group_weight + 1)
>>>
>>>
>>> what you want to test here is that moving 1 task from busiest to
>>> local
>>> group
>>> would help and balance the ratio of tasks per cpu
>>>
>>> (busiest->sum_nr_running - 1) / busiest->group_weight >
>>> (local->sum_nr_running + 1) / local->group_weight
>>>
>>> which can be develop into
>>>
>>> busiest->sum_nr_running * local->group_weight >=
>>> local->sum_nr_running
>>> * busiest->group_weight + busiest->group_weight + local->group_weight
>>>
>>> and you also have to change how we calculate the imbalance which just
>>> provide the half of the diff of nr_running
>>>
>>> by something like
>>>
>>> (busiest->sum_nr_running * local->group_weight) -
>>> (local->sum_nr_running * busiest->group_weight) /
>>> (busiest->group_weight + local->group_weight)
>>>
>>
>> Ahh right, I had a look at the imbalance part now and your suggestion
>> works
>> pretty well. Just had to make some minor adjustments so far.
>> Nice side effect is, that this allows the load balancer to behave
>> exactly the
>> same as before in the cases where local->group_weight ==
>> busiest->group_weight.
>> The behavior only changes for the case where the groups are not of
>> equal
>> size.
>
>
> Not sure if it has been figured/discussed out already, pointing one
> possible scenario.
>
> Taking the formulas:
> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
> * busiest->group_weight + busiest->group_weight + local->group_weight
> and calulate_imbalance:
> (busiest->sum_nr_running * local->group_weight) -
> (local->sum_nr_running * busiest->group_weight) /
> (busiest->group_weight + local->group_weight)
>

I was considering to just use the imbalance as an indicator whether we
should
balance or not, i.e. check if the second formula yields a value greater
than 0.
Will have to play around with that a bit though.

>
> First lets say imbalance was like this. same example as before.
> sched_group in [busy_cpus/idle_cpus/group_weight]
> [3/9/12] - local group.
> [3/1/4] - busy group.
>
> 3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) --
> 24/16 >> 1 -- lets say 1.
> we will balance, good.
>
> [4/8/12]
> [2/2/4]
> There will not be further load balances. good.
>
> a new process comes, it would be scheduled on [4/8/120 sched group as
> that would be idlest.
> [5/7/12]
> [2/2/4]
>
> Process running on [2/2/4] exits. then in this scenario do you expect
> the balance to happen again? Since balancing would result into optimal
> performance.
> [5/7/12] - busy_group
> [1/3/4] - local group
>
> 5*4 >= 1*12+12+4 --> will not balance.
>
> [5/7/12] - local group
> [1/3/4] - busy group
> 1*12 >= 5*4 + 12 + 4 --> will not balance.
>
> Is this scenario needs to be handled as well?

So, from an SMT standpoint, we would not need to balance here, both
groups
should not run into SMT. Now, would it be beneficial to balance anyway?
Now we have:
[5/7/12] -> 42% busy
[1/3/4] -> 25% busy

If we would now balance and move one task around we would get either
[6/6/12] -> 50% busy
[0/4/4] -> 0% busy
or
[4/8/12] -> 33% busy
[2/2/4] -> 50% busy

The first case does probably not make that much sense (unless we have
workload
which would benefit from maybe cache locality) and we want everything to
run
in one group.
The second case brings the smaller group right onto the edge of using
SMT, while
also creating the possibility (depending on the algorithm we would use),
that
now the larger group will attempt to pull work from the smaller group
again,
ending up in a back and forth between the two. This is obviously also
true for
the first variant.

Could you maybe elaborate on what you meant by optimal performance?

>
>>
>> I will figure out a solution and send a patch soon which incorporates
>> these
>> adjustments plus a more detailed description.
>>
>> Thanks for the feedback.
>>
>>>> >
>>>> > This is the prefer_sibling path. Could it be that you should disable
>>>> > prefer_siling between your sockets for such topology ? the default
>>>> > path compares the number of idle CPUs when groups has spare capacity
>>>> >
>>>> >
>>>>
>>>> If I disable prefer_sibling (I played around with it a bit), I run
>>>> into the
>>>> problem,
>>>> that the tasks are distributed s.t. each group has the same amount
>>>> of
>>>> idle
>>>> CPUs, which
>>>> yields distributions similar to this:
>>>>
>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>     x  x  x     x  x     x     x  x
>>>>
>>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>>>> by the
>>>> load balancer,
>>>> but the larger group is now running SMT while the smaller one is
>>>> just
>>>> idle.
>>>>
>>>> So, in this asymmetric setup, both criteria seem to not work in an
>>>> optimal
>>>> way. Going for
>>>> the same number of idle CPUs or alternatively for the same number of
>>>> running
>>>> processes
>>>> both cause a sub-optimal distribution of tasks, leading to
>>>> unnecessary SMT.
>>>
>>> there is the same behavior and assumption here too
>>>
>>>
>>>>
>>>> It seems also to be possible to address the regular load balancing
>>>> path by
>>>> aiming to have the
>>>> same unused capacity between groups instead of the same number of
>>>> idle CPUs.
>>>> This seems to
>>>> have been considered in the past, but the choice went in favor of
>>>> the
>>>> number
>>>> of idle CPUs.
>>>
>>> unused capacity doesn't give the instantaneous state so a group can
>>> be
>>> idle but without
>>> unused capacity
>>>
>>>> Since this decision was actively taken already, I focused on the
>>>> prefer_sibling path.
>>>>
>>>> The question now would be how to address this properly (or if I'm
>>>> missing
>>>> something here).
>>>> As mentioned in the cover letter, this was the most simplistic and
>>>> least
>>>> invasive approach
>>>> I could find, others might be more sophisticated but also have some
>>>> side-effects.
>>>>
>>>> I have a bit of a hard time leaving this one as-is, as it just
>>>> introduces
>>>> two additional
>>>> multiplications with no effect for most architectures. Maybe an
>>>> architectures specific
>>>> inline function that the compiler can optimize away if not needed?
>>>>
>>>> > >                 goto force_balance;
>>>> > >
>>>> > >         if (busiest->group_type != group_overloaded) {
>>>> > > --
>>>> > > 2.34.1
>>>> > >
>>>>

2023-07-07 16:38:42

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer



On 7/7/23 9:29 PM, Tobias Huschle wrote:
> On 2023-07-07 16:33, Shrikanth Hegde wrote:
>> On 7/7/23 1:14 PM, Tobias Huschle wrote:
>>> On 2023-07-05 09:52, Vincent Guittot wrote:
>>>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>>>>> On 2023-05-16 15:36, Vincent Guittot wrote:
>>>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <[email protected]>
>>>>> > wrote:
>>>>> > >
>>>>> > > The current load balancer implementation implies that scheduler
>>>>> > > groups,
>>>>> > > within the same domain, all host the same number of CPUs. This is
>>>>> > > reflected in the condition, that a scheduler group, which is load
>>>>> > > balancing and classified as having spare capacity, should pull
>>>>> work
>>>>> > > from the busiest group, if the local group runs less processes
>>>>> than
>>>>> > > the busiest one. This implies that these two groups should run the
>>>>> > > same number of processes, which is problematic if the groups
>>>>> are not
>>>>> > > of the same size.
>>>>> > >
>>>>> > > The assumption that scheduler groups within the same scheduler
>>>>> domain
>>>>> > > host the same number of CPUs appears to be true for non-s390
>>>>> > > architectures. Nevertheless, s390 can have scheduler groups of
>>>>> unequal
>>>>> > > size.
>>>>> > >
>>>>> > > This introduces a performance degredation in the following
>>>>> scenario:
>>>>> > >
>>>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU
>>>>> socket,
>>>>> > > the remaining 2 are located on another socket:
>>>>> > >
>>>>> > > Socket   -----1-----    -2-
>>>>> > > CPU      1 2 3 4 5 6    7 8
>>>>> > >
>>>>> > > Placing some workload ( x = one task ) yields the following
>>>>> > > scenarios:
>>>>> > >
>>>>> > > The first 5 tasks are distributed evenly across the two groups.
>>>>> > >
>>>>> > > Socket   -----1-----    -2-
>>>>> > > CPU      1 2 3 4 5 6    7 8
>>>>> > >          x x x          x x
>>>>> > >
>>>>> > > Adding a 6th task yields the following distribution:
>>>>> > >
>>>>> > > Socket   -----1-----    -2-
>>>>> > > CPU      1 2 3 4 5 6    7 8
>>>>> > > SMT1     x x x          x x
>>>>> > > SMT2                    x
>>>>> >
>>>>> > Your description is a bit confusing for me. What you name CPU above
>>>>> > should be named Core, doesn' it ?
>>>>> >
>>>>> > Could you share with us your scheduler topology ?
>>>>> >
>>>>>
>>>>> You are correct, it should say core instead of CPU.
>>>>>
>>>>> One actual configuration from one of my test machines (lscpu -e):
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>>>>
>>>> Thaks for the details
>>>>
>>>>>
>>>>> If I run stress-ng with 8 cpu stressors on the original code I get a
>>>>> distribution
>>>>> like this:
>>>>>
>>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>>                 x     x     x     x      x  x  x  x
>>>>>
>>>>> Which means that the two cores in the smaller group are running
>>>>> into SMT
>>>>> while two
>>>>> cores in the larger group are still idle. This is caused by the
>>>>> prefer_sibling path
>>>>> which really wants to see both groups run the same number of tasks.
>>>>
>>>> yes and it considers that there are the same number of CPUs per group
>>>>
>>>>>
>>>>> > >
>>>>> > > The task is added to the 2nd scheduler group, as the scheduler
>>>>> has the
>>>>> > > assumption that scheduler groups are of the same size, so they
>>>>> should
>>>>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>>>>> > > thread, which comes with a performance penalty. This means,
>>>>> that in
>>>>> > > the window of 6-8 tasks, load balancing is done suboptimally,
>>>>> because
>>>>> > > SMT is used although there is no reason to do so as fully idle
>>>>> CPUs
>>>>> > > are still available.
>>>>> > >
>>>>> > > Taking the weight of the scheduler groups into account, ensures
>>>>> that
>>>>> > > a load balancing CPU within a smaller group will not try to pull
>>>>> tasks
>>>>> > > from a bigger group while the bigger group still has idle CPUs
>>>>> > > available.
>>>>> > >
>>>>> > > Signed-off-by: Tobias Huschle <[email protected]>
>>>>> > > ---
>>>>> > >  kernel/sched/fair.c | 3 ++-
>>>>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>> > >
>>>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>>>>> > > --- a/kernel/sched/fair.c
>>>>> > > +++ b/kernel/sched/fair.c
>>>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>>>>> > > *find_busiest_group(struct lb_env *env)
>>>>> > >          * group's child domain.
>>>>> > >          */
>>>>> > >         if (sds.prefer_sibling && local->group_type ==
>>>>> > > group_has_spare &&
>>>>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>>>>> > > +           busiest->sum_nr_running * local->group_weight >
>>>>> > > +                       local->sum_nr_running *
>>>>> > > busiest->group_weight + 1)
>>>>
>>>>
>>>> what you want to test here is that moving 1 task from busiest to local
>>>> group
>>>> would help and balance the ratio of tasks per cpu
>>>>
>>>> (busiest->sum_nr_running - 1) / busiest->group_weight >
>>>> (local->sum_nr_running + 1) / local->group_weight
>>>>
>>>> which can be develop into
>>>>
>>>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
>>>> * busiest->group_weight + busiest->group_weight + local->group_weight
>>>>
>>>> and you also have to change how we calculate the imbalance which just
>>>> provide the half of the diff of nr_running
>>>>
>>>> by something like
>>>>
>>>> (busiest->sum_nr_running * local->group_weight) -
>>>> (local->sum_nr_running * busiest->group_weight) /
>>>> (busiest->group_weight + local->group_weight)
>>>>
>>>
>>> Ahh right, I had a look at the imbalance part now and your suggestion
>>> works
>>> pretty well. Just had to make some minor adjustments so far.
>>> Nice side effect is, that this allows the load balancer to behave
>>> exactly the
>>> same as before in the cases where local->group_weight ==
>>> busiest->group_weight.
>>> The behavior only changes for the case where the groups are not of equal
>>> size.
>>
>>
>> Not sure if it has been figured/discussed out already, pointing one
>> possible scenario.
>>
>> Taking the formulas:
>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
>> * busiest->group_weight + busiest->group_weight + local->group_weight
>> and calulate_imbalance:
>> (busiest->sum_nr_running * local->group_weight) -
>> (local->sum_nr_running * busiest->group_weight) /
>> (busiest->group_weight + local->group_weight)
>>
>
> I was considering to just use the imbalance as an indicator whether we
> should
> balance or not, i.e. check if the second formula yields a value greater
> than 0.
> Will have to play around with that a bit though.
>
>>
>> First lets say imbalance was like this. same example as before.
>> sched_group in [busy_cpus/idle_cpus/group_weight]
>> [3/9/12] - local group.
>> [3/1/4] - busy group.
>>
>> 3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) --
>> 24/16 >> 1 -- lets say 1.
>> we will balance, good.
>>
>> [4/8/12]
>> [2/2/4]
>> There will not be further load balances. good.
>>
>> a new process comes, it would be scheduled on [4/8/120 sched group as
>> that would be idlest.
>> [5/7/12]
>> [2/2/4]
>>
>> Process running on [2/2/4] exits. then in this scenario do you expect
>> the balance to happen again? Since balancing would result into optimal
>> performance.
>> [5/7/12] - busy_group
>> [1/3/4] - local group
>>
>> 5*4 >= 1*12+12+4 --> will not balance.
>>
>> [5/7/12] - local group
>> [1/3/4] - busy group
>> 1*12 >= 5*4 + 12 + 4 --> will not balance.
>>
>> Is this scenario needs to be handled as well?
>
> So, from an SMT standpoint, we would not need to balance here, both groups
> should not run into SMT. Now, would it be beneficial to balance anyway?
> Now we have:
> [5/7/12] -> 42% busy
> [1/3/4]  -> 25% busy
>
> If we would now balance and move one task around we would get either
> [6/6/12] -> 50% busy
> [0/4/4]  ->  0% busy
> or
> [4/8/12] -> 33% busy
> [2/2/4]  -> 50% busy
>
> The first case does probably not make that much sense (unless we have
> workload
> which would benefit from maybe cache locality) and we want everything to
> run
> in one group.
> The second case brings the smaller group right onto the edge of using
> SMT, while
> also creating the possibility (depending on the algorithm we would use),
> that
> now the larger group will attempt to pull work from the smaller group
> again,
> ending up in a back and forth between the two. This is obviously also
> true for
> the first variant.
>
> Could you maybe elaborate on what you meant by optimal performance?
>

I assumed it might be optimal to have have both group run more or less
similar utilization point.

Now, that i read your description, it makes sense. load balance may not be needed
in this case. Did a few combinations for the check to balance condition. I think
it holds good. ( Haven't done all the case). Sorry for the noise.


>>
>>>
>>> I will figure out a solution and send a patch soon which incorporates
>>> these
>>> adjustments plus a more detailed description.
>>>
>>> Thanks for the feedback.
>>>
>>>>> >
>>>>> > This is the prefer_sibling path. Could it be that you should disable
>>>>> > prefer_siling between your sockets for such topology ? the default
>>>>> > path compares the number of idle CPUs when groups has spare capacity
>>>>> >
>>>>> >
>>>>>
>>>>> If I disable prefer_sibling (I played around with it a bit), I run
>>>>> into the
>>>>> problem,
>>>>> that the tasks are distributed s.t. each group has the same amount of
>>>>> idle
>>>>> CPUs, which
>>>>> yields distributions similar to this:
>>>>>
>>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>>     x  x  x     x  x     x     x  x
>>>>>
>>>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>>>>> by the
>>>>> load balancer,
>>>>> but the larger group is now running SMT while the smaller one is just
>>>>> idle.
>>>>>
>>>>> So, in this asymmetric setup, both criteria seem to not work in an
>>>>> optimal
>>>>> way. Going for
>>>>> the same number of idle CPUs or alternatively for the same number of
>>>>> running
>>>>> processes
>>>>> both cause a sub-optimal distribution of tasks, leading to
>>>>> unnecessary SMT.
>>>>
>>>> there is the same behavior and assumption here too
>>>>
>>>>
>>>>>
>>>>> It seems also to be possible to address the regular load balancing
>>>>> path by
>>>>> aiming to have the
>>>>> same unused capacity between groups instead of the same number of
>>>>> idle CPUs.
>>>>> This seems to
>>>>> have been considered in the past, but the choice went in favor of the
>>>>> number
>>>>> of idle CPUs.
>>>>
>>>> unused capacity doesn't give the instantaneous state so a group can be
>>>> idle but without
>>>> unused capacity
>>>>
>>>>> Since this decision was actively taken already, I focused on the
>>>>> prefer_sibling path.
>>>>>
>>>>> The question now would be how to address this properly (or if I'm
>>>>> missing
>>>>> something here).
>>>>> As mentioned in the cover letter, this was the most simplistic and
>>>>> least
>>>>> invasive approach
>>>>> I could find, others might be more sophisticated but also have some
>>>>> side-effects.
>>>>>
>>>>> I have a bit of a hard time leaving this one as-is, as it just
>>>>> introduces
>>>>> two additional
>>>>> multiplications with no effect for most architectures. Maybe an
>>>>> architectures specific
>>>>> inline function that the compiler can optimize away if not needed?
>>>>>
>>>>> > >                 goto force_balance;
>>>>> > >
>>>>> > >         if (busiest->group_type != group_overloaded) {
>>>>> > > --
>>>>> > > 2.34.1
>>>>> > >
>>>>>