Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <27520864-3693-324e-717e-6bd628fb75ed@linux.vnet.ibm.com>
Date:   Fri, 7 Jul 2023 20:03:35 +0530
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.12.0
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in
 load balancer
Content-Language: en-US
To:     Tobias Huschle <huschle@linux.ibm.com>
Cc:     juri.lelli@redhat.com, vschneid@redhat.com,
        srikar@linux.vnet.ibm.com, peterz@infradead.org,
        linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
        rostedt@goodmis.org, bsegall@google.com, mingo@redhat.com,
        mgorman@suse.de, bristot@redhat.com, dietmar.eggemann@arm.com,
        Vincent Guittot <vincent.guittot@linaro.org>
References: <20230515114601.12737-1-huschle@linux.ibm.com>
 <20230515114601.12737-2-huschle@linux.ibm.com>
 <CAKfTPtC9050oY2EikUTAXTL8pAui3L+Sr4DBS0T-TccGNaA2hw@mail.gmail.com>
 <9021d4d99370162a815928cd6467f4a5@linux.ibm.com>
 <ZKUhPO3xcvCfjWfe@vingu-book>
 <12befe210503469beac9e711d9782675@linux.ibm.com>
From:   Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
In-Reply-To: <12befe210503469beac9e711d9782675@linux.ibm.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk


On 7/7/23 1:14 PM, Tobias Huschle wrote:
> On 2023-07-05 09:52, Vincent Guittot wrote:
>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>>> On 2023-05-16 15:36, Vincent Guittot wrote:
>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com>
>>> > wrote:
>>> > >
>>> > > The current load balancer implementation implies that scheduler
>>> > > groups,
>>> > > within the same domain, all host the same number of CPUs. This is
>>> > > reflected in the condition, that a scheduler group, which is load
>>> > > balancing and classified as having spare capacity, should pull work
>>> > > from the busiest group, if the local group runs less processes than
>>> > > the busiest one. This implies that these two groups should run the
>>> > > same number of processes, which is problematic if the groups are not
>>> > > of the same size.
>>> > >
>>> > > The assumption that scheduler groups within the same scheduler
>>> domain
>>> > > host the same number of CPUs appears to be true for non-s390
>>> > > architectures. Nevertheless, s390 can have scheduler groups of
>>> unequal
>>> > > size.
>>> > >
>>> > > This introduces a performance degredation in the following scenario:
>>> > >
>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>>> > > the remaining 2 are located on another socket:
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > >
>>> > > Placing some workload ( x = one task ) yields the following
>>> > > scenarios:
>>> > >
>>> > > The first 5 tasks are distributed evenly across the two groups.
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > >          x x x          x x
>>> > >
>>> > > Adding a 6th task yields the following distribution:
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > > SMT1     x x x          x x
>>> > > SMT2                    x
>>> >
>>> > Your description is a bit confusing for me. What you name CPU above
>>> > should be named Core, doesn' it ?
>>> >
>>> > Could you share with us your scheduler topology ?
>>> >
>>>
>>> You are correct, it should say core instead of CPU.
>>>
>>> One actual configuration from one of my test machines (lscpu -e):
>>>
>>
>> [...]
>>
>>>
>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>>
>> Thaks for the details
>>
>>>
>>> If I run stress-ng with 8 cpu stressors on the original code I get a
>>> distribution
>>> like this:
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>                 x     x     x     x      x  x  x  x
>>>
>>> Which means that the two cores in the smaller group are running into SMT
>>> while two
>>> cores in the larger group are still idle. This is caused by the
>>> prefer_sibling path
>>> which really wants to see both groups run the same number of tasks.
>>
>> yes and it considers that there are the same number of CPUs per group
>>
>>>
>>> > >
>>> > > The task is added to the 2nd scheduler group, as the scheduler
>>> has the
>>> > > assumption that scheduler groups are of the same size, so they
>>> should
>>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>>> > > thread, which comes with a performance penalty. This means, that in
>>> > > the window of 6-8 tasks, load balancing is done suboptimally,
>>> because
>>> > > SMT is used although there is no reason to do so as fully idle CPUs
>>> > > are still available.
>>> > >
>>> > > Taking the weight of the scheduler groups into account, ensures that
>>> > > a load balancing CPU within a smaller group will not try to pull
>>> tasks
>>> > > from a bigger group while the bigger group still has idle CPUs
>>> > > available.
>>> > >
>>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>>> > > ---
>>> > >  kernel/sched/fair.c | 3 ++-
>>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>>> > >
>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>>> > > --- a/kernel/sched/fair.c
>>> > > +++ b/kernel/sched/fair.c
>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>>> > > *find_busiest_group(struct lb_env *env)
>>> > >          * group's child domain.
>>> > >          */
>>> > >         if (sds.prefer_sibling && local->group_type ==
>>> > > group_has_spare &&
>>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>>> > > +           busiest->sum_nr_running * local->group_weight >
>>> > > +                       local->sum_nr_running *
>>> > > busiest->group_weight + 1)
>>
>>
>> what you want to test here is that moving 1 task from busiest to local
>> group
>> would help and balance the ratio of tasks per cpu
>>
>> (busiest->sum_nr_running - 1) / busiest->group_weight >
>> (local->sum_nr_running + 1) / local->group_weight
>>
>> which can be develop into
>>
>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
>> * busiest->group_weight + busiest->group_weight + local->group_weight
>>
>> and you also have to change how we calculate the imbalance which just
>> provide the half of the diff of nr_running
>>
>> by something like
>>
>> (busiest->sum_nr_running * local->group_weight) -
>> (local->sum_nr_running * busiest->group_weight) /
>> (busiest->group_weight + local->group_weight)
>>
> 
> Ahh right, I had a look at the imbalance part now and your suggestion works
> pretty well. Just had to make some minor adjustments so far.
> Nice side effect is, that this allows the load balancer to behave
> exactly the
> same as before in the cases where local->group_weight ==
> busiest->group_weight.
> The behavior only changes for the case where the groups are not of equal
> size.


Not sure if it has been figured/discussed out already, pointing one possible scenario. 

Taking the formulas:
busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight
and calulate_imbalance:
(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight)


First lets say imbalance was like this. same example as before. sched_group in [busy_cpus/idle_cpus/group_weight]
[3/9/12] - local group.
[3/1/4] - busy group.

3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) -- 24/16 >> 1 -- lets say 1. 
we will balance, good.

[4/8/12] 
[2/2/4] 
There will not be further load balances. good.

a new process comes, it would be scheduled on [4/8/120 sched group as that would be idlest. 
[5/7/12] 
[2/2/4] 

Process running on [2/2/4] exits. then in this scenario do you expect the balance to happen again? Since balancing would result into optimal performance.
[5/7/12] - busy_group
[1/3/4] - local group

5*4 >= 1*12+12+4 --> will not balance. 

[5/7/12] - local group
[1/3/4] - busy group 
1*12 >= 5*4 + 12 + 4 --> will not balance. 

Is this scenario needs to be handled as well?

> 
> I will figure out a solution and send a patch soon which incorporates these
> adjustments plus a more detailed description.
> 
> Thanks for the feedback.
> 
>>> >
>>> > This is the prefer_sibling path. Could it be that you should disable
>>> > prefer_siling between your sockets for such topology ? the default
>>> > path compares the number of idle CPUs when groups has spare capacity
>>> >
>>> >
>>>
>>> If I disable prefer_sibling (I played around with it a bit), I run
>>> into the
>>> problem,
>>> that the tasks are distributed s.t. each group has the same amount of
>>> idle
>>> CPUs, which
>>> yields distributions similar to this:
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>     x  x  x     x  x     x     x  x
>>>
>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>>> by the
>>> load balancer,
>>> but the larger group is now running SMT while the smaller one is just
>>> idle.
>>>
>>> So, in this asymmetric setup, both criteria seem to not work in an
>>> optimal
>>> way. Going for
>>> the same number of idle CPUs or alternatively for the same number of
>>> running
>>> processes
>>> both cause a sub-optimal distribution of tasks, leading to
>>> unnecessary SMT.
>>
>> there is the same behavior and assumption here too
>>
>>
>>>
>>> It seems also to be possible to address the regular load balancing
>>> path by
>>> aiming to have the
>>> same unused capacity between groups instead of the same number of
>>> idle CPUs.
>>> This seems to
>>> have been considered in the past, but the choice went in favor of the
>>> number
>>> of idle CPUs.
>>
>> unused capacity doesn't give the instantaneous state so a group can be
>> idle but without
>> unused capacity
>>
>>> Since this decision was actively taken already, I focused on the
>>> prefer_sibling path.
>>>
>>> The question now would be how to address this properly (or if I'm
>>> missing
>>> something here).
>>> As mentioned in the cover letter, this was the most simplistic and least
>>> invasive approach
>>> I could find, others might be more sophisticated but also have some
>>> side-effects.
>>>
>>> I have a bit of a hard time leaving this one as-is, as it just
>>> introduces
>>> two additional
>>> multiplications with no effect for most architectures. Maybe an
>>> architectures specific
>>> inline function that the compiler can optimize away if not needed?
>>>
>>> > >                 goto force_balance;
>>> > >
>>> > >         if (busiest->group_type != group_overloaded) {
>>> > > --
>>> > > 2.34.1
>>> > >
>>>