Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20220609025515.2086253-1-joshdon@google.com>
In-Reply-To: <20220609025515.2086253-1-joshdon@google.com>
From:   Vincent Guittot <vincent.guittot@linaro.org>
Date:   Thu, 9 Jun 2022 15:41:59 +0200
Message-ID: <CAKfTPtCc4NZ51qruKX1zROFHqg-_MQBadaJfVbYPxT=7c4So=Q@mail.gmail.com>
Subject: Re: [PATCH] sched: allow newidle balancing to bail out of load_balance
To:     Josh Don <joshdon@google.com>
Cc:     Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Thu, 9 Jun 2022 at 04:55, Josh Don <joshdon@google.com> wrote:
>
> While doing newidle load balancing, it is possible for new tasks to
> arrive, such as with pending wakeups. newidle_balance() already accounts
> for this by exiting the sched_domain load_balance() iteration if it
> detects these cases. This is very important for minimizing wakeup
> latency.
>
> However, if we are already in load_balance(), we may stay there for a
> while before returning back to newidle_balance(). This is most
> exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
> very straightforward workaround to this is to adjust should_we_balance()
> to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
> detected.

This one is close to the other tests and I wonder if it should be
better placed before taking the busiest rq lock and detaching some
tasks.

Beside your use case where all other threads can't move in local cpu
and load_balance() loops and clears other cpus, most of the time is
probably spent in fbg() and fbq() so there are more chance that a task
woke in this meantime and I imagine that it becomes useless to take
lock and move tasks from another cpu if the local cpu is no more newly
idle.

Have you tried other places in load_balance() and does this one
provide the lowest wakeup latency ?

That being said, the current patch makes sense.

>
> This was tested with the following reproduction:
> - two threads that take turns sleeping and waking each other up are
>   affined to two cores
> - a large number of threads with 100% utilization are pinned to all
>   other cores
>
> Without this patch, wakeup latency was ~120us for the pair of threads,
> almost entirely spent in load_balance(). With this patch, wakeup latency
> is ~6us.
>
> Signed-off-by: Josh Don <joshdon@google.com>
> ---
>  kernel/sched/fair.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8c5b74f66bd3..5abf30117824 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9834,9 +9834,15 @@ static int should_we_balance(struct lb_env *env)
>         /*
>          * In the newly idle case, we will allow all the CPUs
>          * to do the newly idle load balance.
> +        *
> +        * However, we bail out if we already have tasks or a wakeup pending,
> +        * to optimize wakeup latency.
>          */
> -       if (env->idle == CPU_NEWLY_IDLE)
> +       if (env->idle == CPU_NEWLY_IDLE) {
> +               if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
> +                       return 0;
>                 return 1;
> +       }
>
>         /* Try to find first idle CPU */
>         for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
> --
> 2.36.1.476.g0c4daa206d-goog
>