Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <CALOAHbAS26LP2p9Fe7m6xynZmazYENmx_HfTV4LebwPWr7XLmA@mail.gmail.com>
 <dfbe1030-05bf-3371-bc0a-56f79dcd6f39@arm.com>
In-Reply-To: <dfbe1030-05bf-3371-bc0a-56f79dcd6f39@arm.com>
From:   Yafang Shao <laoar.shao@gmail.com>
Date:   Mon, 19 Jul 2021 20:11:50 +0800
Message-ID: <CALOAHbBLTwjnYyqdSkAqzT=X9v-NSygM0rfK_Bk5JMwZ6vB_fQ@mail.gmail.com>
Subject: Re: [RFC PATCH 1/1] sched: do active load balance in balance callback
To:     Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc:     Vincent Guittot <vincent.guittot@linaro.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Valentin Schneider <valentin.schneider@arm.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Benjamin Segall <bsegall@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Wed, Jul 14, 2021 at 10:23 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 11/07/2021 09:40, Yafang Shao wrote:
> > The active load balance which means to migrate the CFS task running on
> > the busiest CPU to the new idle CPU has a known issue[1][2] that
> > there are some race window between waking up the migration thread on the
> > busiest CPU and it begins to preempt the current running CFS task.
> > These race window may cause unexpected behavior that the latency
> > sensitive RT tasks may be preempted by the migration thread as it has a
> > higher priority.
> >
> > This RFC patch tries to improve this situation. Instead of waking up the
> > migration thread to do this work, this patch do it in the balance
> > callback as follows,
> >
> >      The New idle CPUm                The target CPUn
> >      find the target task A           CFS task A is running
> >      queue it into the target CPUn    A is scheduling out
> >                                       do balance callback and migrate A to CPUm
> > It avoids two context switches - task A to migration/n and migration/n to
> > task B. And it avoids preempting the RT task if the RT task has already
> > preempted task A before we do the queueing.
> >
> > TODO:
> > - I haven't done some benchmark to measure the impact on performance
> > - To avoid deadlock I have to unlock the busiest_rq->lock before
> >   calling attach_one_task() and lock it again after executing
> >   attach_one_task(). That may re-introduce the issue addressed by
> >   commit 565790d28b1e ("sched: Fix balance_callback()")
> >
> > [1]. https://lore.kernel.org/lkml/CAKfTPtBygNcVewbb0GQOP5xxO96am3YeTZNP5dK9BxKHJJAL-g@mail.gmail.com/
> > [2]. https://lore.kernel.org/lkml/20210615121551.31138-1-laoar.shao@gmail.com/
>
> This didn't apply for me and I guess won't compile on tip/sched/core:
>
> raw_spin_{,un}lock(&busiest_rq->lock) -> raw_spin_rq_{,un}lock(busiest_rq)
>
> p->state == TASK_RUNNING -> p->__state or task_is_running(p)
>

I made this patch based on Linus's tree. I will do it based on tip/sched/core.

> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  kernel/sched/core.c  |  1 +
> >  kernel/sched/fair.c  | 69 ++++++++++++++------------------------------
> >  kernel/sched/sched.h |  6 +++-
> >  3 files changed, 28 insertions(+), 48 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ca80df205ce..a0a90a37e746 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -8208,6 +8208,7 @@ void __init sched_init(void)
> >                 rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
> >                 rq->balance_callback = &balance_push_callback;
> >                 rq->active_balance = 0;
> > +               rq->active_balance_target = NULL;
> >                 rq->next_balance = jiffies;
> >                 rq->push_cpu = 0;
> >                 rq->cpu = i;
>
> [...]
>
> > +DEFINE_PER_CPU(struct callback_head, active_balance_head);
> > +
> >  /*
> >   * Check this_cpu to ensure it is balanced within domain. Attempt to move
> >   * tasks if there is an imbalance.
> > @@ -9845,15 +9817,14 @@ static int load_balance(int this_cpu, struct
> > rq *this_rq,
> >                         if (!busiest->active_balance) {
> >                                 busiest->active_balance = 1;
> >                                 busiest->push_cpu = this_cpu;
> > +                               busiest->active_balance_target = busiest->curr;
> >                                 active_balance = 1;
> >                         }
> > -                       raw_spin_unlock_irqrestore(&busiest->lock, flags);
> >
> > -                       if (active_balance) {
> > -                               stop_one_cpu_nowait(cpu_of(busiest),
> > -                                       active_load_balance_cpu_stop, busiest,
> > -                                       &busiest->active_balance_work);
> > -                       }
> > +                       if (active_balance)
> > +                               queue_balance_callback(busiest,
> > &per_cpu(active_balance_head, busiest->cpu),
> > active_load_balance_cpu_stop);
>
>
> When you defer the active load balance of p into a balance_callback
> (from __schedule()) p has to stop running on busiest, right?

Right. But p doesn't have to stop running it immediately.

> Deferring active load balance for too long might be defeat the purpose
> of load balance which has to happen now.
>

Maybe we need to do some benchmark to measure whether it is proper to
deter the active load balance.
But I don't know which benchmark is suitable now.

> Also, before balance_callback get invoked,  active balancing might try
> to migrate p again and again but fails because `busiest->active_balance`
> is still 1 (you kept this former synchronization meant for
> active_balance_work). In this case the likelihood increases that one of
> the error condition in active_load_balance_cpu_stop() hit when it's
> finally called.
>

Seems that is a problem. I will think about it.

> What's wrong with the FIFO-1 "stopper" for CFS active lb?
>

We have to introduce another per-cpu kernel thread, but I don't know
whether it is worth doing it.


-- 
Thanks
Yafang