Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 18 Jul 2023 18:25:22 +0100
From:   Qais Yousef <qyousef@layalina.io>
To:     Vincent Guittot <vincent.guittot@linaro.org>
Cc:     Ingo Molnar <mingo@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in
 load balance
Message-ID: <20230718172522.s4gcfx3ppljwbks7@airbuntu>
References: <20230716014125.139577-1-qyousef@layalina.io>
 <ZLaKFFjY6NWaJdOq@vingu-book>
 <20230718161829.ws3vn3ufnod6kpxh@airbuntu>
 <CAKfTPtA55NemHq0tZPuiEN=c3DRZWD-7jf7ZrKdHE9y9b_szZg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAKfTPtA55NemHq0tZPuiEN=c3DRZWD-7jf7ZrKdHE9y9b_szZg@mail.gmail.com>
Precedence: bulk

On 07/18/23 18:31, Vincent Guittot wrote:
> On Tue, 18 Jul 2023 at 18:18, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 07/18/23 14:48, Vincent Guittot wrote:
> > > Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > > > We've seen cases while running geekbench that an idle little core never
> > > > pulls a task from a bigger overloaded cluster for 100s of ms and
> > > > sometimes over a second.
> > > >
> > > > It turned out that the load balance identifies this as a migrate_util
> > > > type since the local group (little cluster) has a spare capacity and
> > > > will try to pull a task. But the little cluster capacity is very small
> > > > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > > > core which has a capacity of over 700, this means the util of each tasks
> > > > will be around 350+ range. Which is always bigger than the spare
> > > > capacity of the little group with a single idle core.
> > > >
> > > > When trying to detach_tasks() we bail out then because of the comparison
> > > > of:
> > > >
> > > >     if (util > env->imbalance)
> > > >             goto next;
> > > >
> > > > In calculate_imbalance() we convert a migrate_util into migrate_task
> > > > type if the CPU trying to do the pull is idle. But we only do this if
> > > > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > > > contains the local group's spare capacity. If it is 0, this means it's
> > > > fully busy.
> > > >
> > > > Removing this condition fixes the problem, but since I can't fully
> > > > understand why it checks for 0, sending this as RFC. It could be a typo
> > > > and meant to check for
> > > >
> > > >     env->imbalance != 0
> > > >
> > > > instead?
> > > >
> > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index a80a73909dc2..682d9d6a8691 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > >                      * waiting task in this overloaded busiest group. Let's
> > > >                      * try to pull it.
> > > >                      */
> > > > -                   if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > > > +                   if (env->idle != CPU_NOT_IDLE) {
> > >
> > > With this change you completely skip migrate_util for idle and newly idle case
> > > and this would be too aggressive.
> >
> > Yeah I didn't have great confidence in it to be honest.
> >
> > Could you help me understand the meaning of env->imbalance == 0 though? At this
> > stage its value is
> >
> >         env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
> >
> > which AFAICT is calculating the _spare_ capacity, right? So when we check
> > env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
> > group is fully utilized? Why it must be fully utilized to do the pull? It's
> > counter intuitive to me. I'm probably misinterpreting something but can't see
> 
> This is a special case. We have some situations where group_util is
> higher than capacity because of tasks newly migrated to this group for
> example so the spare capacity is null but one cpu is idle or newly
> idle. In this case we try to pull a task with the risk that this group
> becomes overloaded. That's why we do not try to pull a task every
> time.
> But that might be good choice all the time

So on misfit, I do see that a bigger cpu will pull the task quickly as soon as
a bigger cpu gets idle.

This scenario is the opposite. Maybe the exception in my case is that the
little cpu has spare capacity as it's mostly idle all the time. It's just
unlucky circumstances at wake up ended up putting two tasks on bigger core.

Specifically, at the start of some of the sub-tests, there's a good chance that
we have simultaneous wake ups and there's a limitation/race in EAS because of
the gap between select_task_rq_fair() and enqueue_task_fair(). If two task wake
up simultaneously, select_task_rq_fair() could be called twice before the
enqueue_task_fair() and end up selecting the same CPU for both tasks not
realizing one of them is just waiting to be enqueued. IOW, EAS will not take
into account the updated util of one of the CPUs because of the (short) delay
to enqueue it.

This should be fixed (the wake up race), but this is a different story and
a bit trickier.

The risk of pulling always is:

	1. Risk force migrating prev task if it woke up shortly after the pull.
	   Which is no worse IMHO than misfit going almost immediately to
	   bigger core.

	2. Not sure of not being too smart about which task to pull. I can
	   envisage other scenarios where one of the two tasks is better to
	   pull. In geekbench both tasks are equal. But maybe in other use
	   cases one of them less impactful. For example if one of them has
	   a low uclamp_max but the other doesn't. But this case is unsupported
	   feature at the moment. My plan (hope) to treat these uclamp_max as
	   misfit migration. Which I think is the better path in general to
	   treat special cases. So for migration_util this behavior might be
	   sensible all the time. We are working too hard, let's use all of our
	   resources and make use all of idle cpus. If prev_task wakes up,
	   there's no harm; I doubt the cache hotness is a problem even given
	   two tasks are busy all the time trashing L1 anyway.

	3. Not sure what will happen in cases we have nu_running > nr_cpus and
	   some tasks happen to sleep for brief period of times. Two tasks
	   stuck on little core is worse than two tasks stuck on mid or big
	   core. But maybe migrate_util will pull it back again given how
	   little capacity they have?

Cheers

--
Qais Yousef

> 
> > it.
> >
> > >
> > > We can do something similar to migrate_load in detach_tasks():
> > >
> > > ---
> > >  kernel/sched/fair.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d3df5b1642a6..64111ac7e137 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> > >               case migrate_util:
> > >                       util = task_util_est(p);
> > >
> > > -                     if (util > env->imbalance)
> > > +                     /*
> > > +                      * Make sure that we don't migrate too much utilization.
> > > +                      * Nevertheless, let relax the constraint if
> > > +                      * scheduler fails to find a good waiting task to
> > > +                      * migrate.
> > > +                      */
> > > +                     if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> > >                               goto next;
> >
> > Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
> > Still debugging it but I _think_ it's a combination of two things:
> >
> >         1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
> >            occasional 1s and less frequent 2s
> >         2. something might wake up briefly on that cpu in between load balance,
> >            and given how small the littles are they make the required
> >            nr_balance_failed to tip the scale even higher
> >
> >
> > Thanks
> >
> > --
> > Qais Yousef
> >
> > >
> > >                       env->imbalance -= util;
> > > --
> > >
> > >
> > >
> > > >                             env->migration_type = migrate_task;
> > > >                             env->imbalance = 1;
> > > >                     }
> > > > --
> > > > 2.25.1
> > > >