Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20220915165407.1776363-1-yu.c.chen@intel.com> <CAKfTPtD3L4437htX__mCNBZJ+fv4MEdnNhCG2kBoQYhVESB_fg@mail.gmail.com>
 <YzcfAOLswvY05s0n@chenyu5-mobl1>
In-Reply-To: <YzcfAOLswvY05s0n@chenyu5-mobl1>
From:   Vincent Guittot <vincent.guittot@linaro.org>
Date:   Mon, 3 Oct 2022 14:42:00 +0200
Message-ID: <CAKfTPtBiH-CbRvUJLU38ZQQocQ=pfGK-vStfDLpZmT47-95BKg@mail.gmail.com>
Subject: Re: [RFC PATCH] sched/fair: Choose the CPU where short task is
 running during wake up
To:     Chen Yu <yu.c.chen@intel.com>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Juri Lelli <juri.lelli@redhat.com>,
        Rik van Riel <riel@surriel.com>,
        Aaron Lu <aaron.lu@intel.com>,
        Abel Wu <wuyun.abel@bytedance.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        Yicong Yang <yangyicong@hisilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Ingo Molnar <mingo@redhat.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Fri, 30 Sept 2022 at 18:53, Chen Yu <yu.c.chen@intel.com> wrote:
>
> Hi Vincent,
> On 2022-09-29 at 10:00:40 +0200, Vincent Guittot wrote:
> [cut]
> > >
> > > This idea has been suggested by Rik at LPC 2019 when discussing
> > > the latency nice. He asked the following question: if P1 is a small-time
> > > slice task on CPU, can we put the waking task P2 on the CPU and wait for
> > > P1 to release the CPU, without wasting time to search for an idle CPU?
> > > At LPC 2021 Vincent Guittot has proposed:
> > > 1. If the wakee is a long-running task, should we skip the short idle CPU?
> > > 2. If the wakee is a short-running task, can we put it onto a lightly loaded
> > >    local CPU?
> >
> > When I said that, I had in mind to use the task utilization (util_avg
> > or util_est) which reflects the recent behavior of the task but not to
> > compute an average duration
> >
> Ah I see. However there is a scenario(will-it-scale context switch sub-test)
> that, if task A is doing frequent ping-pong context switch with task B on one
> CPU, we should avoid cross-CPU wakeup, by placing the wakee on the same CPU
> as the waker. Since util_avg/est might be high for both waker and wakee,
> we use the average duration to detect this scenario.

yeah, this can be up to 50%

> > >
> > > Current proposal is a variant of 2:
> > > If the target CPU is running a short-time slice task, and the wakee
> > > is also a short-time slice task, the target CPU could be chosen as the
> > > candidate when the system is busy.
> > >
> > > The definition of a short-time slice task is: The average running time
> > > of the task during each run is no more than sysctl_sched_min_granularity.
> > > If a task switches in and then voluntarily relinquishes the CPU
> > > quickly, it is regarded as a short-running task. Choosing
> > > sysctl_sched_min_granularity because it is the minimal slice if there
> > > are too many runnable tasks.
> > >
> [cut]
> > >
> > > +/*
> > > + * If a task switches in and then voluntarily relinquishes the
> > > + * CPU quickly, it is regarded as a short running task.
> > > + * sysctl_sched_min_granularity is chosen as the threshold,
> > > + * as this value is the minimal slice if there are too many
> > > + * runnable tasks, see __sched_period().
> > > + */
> > > +static int is_short_task(struct task_struct *p)
> > > +{
> > > +       return (p->se.sum_exec_runtime <=
> > > +               (p->nvcsw * sysctl_sched_min_granularity));
> >
> > you assume that the task behavior will never change during is whole life time
> >
> I was thinking that the average running time of a task could slowly catch
> up with the latest task behavior, but yes, there would be delay especially

Because you don't forget oldest activity, it will be more and more
difficult to catch up with the latest behavior.

> for rapid changing tasks(and similar to rq->avg_idle). I wonder if we
> could use something like:
>         return (p->se.avg.util_avg <=
>                 (p->nvcsw * PELT(sysctl_sched_min_granularity));

What is PELT(sysctl_sched_min_granularity) ?

You need at least a runtime and a period to compute something similar
to a PELT value.

As an example, a task running A ms every B ms period will have  an util_avg  of
At wakeup, util_avg = (1-y^A)/(1-y^B)*1024*y^(B-A) with y^32=1/2
Before sleeping, util_avg = (1-y^A)/(1-y^B)*1024

To be exact, it's running A segments of 1024us every period of B
segments of 1024us

> to reflect the recent behavior of the task.
>
> thanks,
> Chenyu