Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <87v9kv2545.derkling@matbug.com> <87h7wd15v2.derkling@matbug.net>
 <87imgrlrqi.derkling@matbug.net> <87mu5sqwkt.derkling@matbug.net>
 <87eer42clt.derkling@matbug.net> <87imfi2qbk.derkling@matbug.net> <87blla2pdt.derkling@matbug.net>
In-Reply-To: <87blla2pdt.derkling@matbug.net>
From:   Vincent Guittot <vincent.guittot@linaro.org>
Date:   Fri, 10 Jul 2020 15:21:48 +0200
Message-ID: <CAKfTPtBHmP6BOrx6XGqZ7UpCFxWCZz23KWf4DXtAhRGUPfjebA@mail.gmail.com>
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus
To:     Patrick Bellasi <patrick.bellasi@matbug.net>
Cc:     LKML <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Jonathan Corbet <corbet@lwn.net>,
        Dhaval Giani <dhaval.giani@oracle.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Josef Bacik <jbacik@fb.com>,
        Chris Hyser <chris.hyser@oracle.com>,
        Parth Shah <parth@linux.ibm.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Hi Patrick,

On Tue, 23 Jun 2020 at 09:49, Patrick Bellasi
<patrick.bellasi@matbug.net> wrote:
>
>
> On Tue, Jun 23, 2020 at 09:29:03 +0200, Patrick Bellasi <patrick.bellasi@matbug.net> wrote...
>
> > .:: Scheduler Wakeup Path Requirements Collection Template
> > ==========================================================
> >
> > A) Name
>
> Runtime tunable vruntime wakeup bonus.

Thanks for describing this use case.

>
>
> > B) Target behavior
>
> All SCHED_OTHER tasks get the same (max) vruntime wakeup bonus. This
> bonus affects the chance the task has to preempt the currently running
> task. Some tasks, which are (known to be) latency tolerant, should have
> a smaller chance to preempt a (known to be) latency sensitive task. To
> the contrary, latency sensitive tasks should have a higher chance to
> preempt a currently running latency tolerant task.
>
> This task specific distinction is not provided by the current
> implementation and all SCHED_OTHER tasks are handled according to the
> same simple, system-wide and not run-time tunable policy.
>
>
> > C) Existing control paths
>
> Assuming:
>
>  C: CFS task currently running on CPUx
>  W: CFS task waking up on the same CPUx
>
> And considering the overall simplified workflow:
>
> core::try_to_wake_up()
>
>   // 1) Select on which CPU W will run
>   core::select_task_rq()
>     fair::select_task_rq_fair()
>
>   // 2) Enqueue W on the selected CPU
>   core::ttwu_queue()
>     core::ttwu_do_activate()
>       core::activate_task()
>         core::enqueue_task()
>           fair::enqueue_task_fair()
>             fair::enqueue_entity()
>
>               // 3) Set W's vruntime bonus
>               fair::place_entity()
>                 se->vruntime = ...
>
>       // 4) Check if C can be preempted by W
>       core::ttwu_do_wakeup()
>         core::check_preempt_curr()
>           fair::check_preempt_curr()
>             fair::check_preempt_wakeup(curr, se)
>               fair::wakeup_preempt_entity(curr, se)
>                 vdiff = curr.vruntime - se.vruntime
>                 return vdiff > wakeup_gran(se)
>
> We see that W preempts C iff:
>
>    vdiff > wakeup_gran(se)
>
> Since:
>
> enqueue_entity(cfs_rq, se, flags)
>   place_entity(cfs_rq, se, initial=0)
>     thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>     vruntime = cfs_rq->min_vruntime - thresh
>     se->vruntime = max_vruntime(se->vruntime, vruntime)
>
> a waking task's W.vruntime can get a "vruntime bonus" up to:
>  - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
>  - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>
>
> > D) Desired behavior
>
> The "vruntime bonus" (thresh) computed in place_entity() should have a
> per-task definition, which defaults to the current implementation.
>
> A bigger vruntime bonus can be configured for latency sensitive tasks.
> A smaller vruntime bonus can be configured for latency tolerant tasks.

I'm not sure that adjusting what you called "vruntime bonus" is the
right way to provide some latency because it doesn't only provide a
wakeup latency bonus but also provides a runtime bonus. It means that
one can impact the running time by playing with latency_nice whereas
the goal is only to impact the wakeup latency.  Instead, it should
weight the decision in wakeup_preempt_entity() and wakeup_gran()

>
> TL;DR
>
> The "vruntime bonus" is meant to give sleepers a compensation for the
> service deficit due to them not having (possibly) fully consumed their
> assigned fair CPU quota within the current sched_latency interval, see:
>
>   commit 51e0304ce6e5 ("sched: Implement a gentler fair-sleepers feature")
>
> The scheduler does that based on a conservative assumption: when a task
> sleeps it gives up a portion (P) of its fair CPU bandwidth share in the
> current sched_latency period.
> Willing to be FAIR, i.e. each task gets a FAIR quota of the CPU in each
> sched_latency period, the scheduler wants to give back P to waking
> tasks.
>
> However, striving to minimize overheads and complexity, the CFS
> scheduler does that using a simple heuristic: each task waking up gets a
> bonus, which is capped at one sched_latency period, independently from
> "its nature".
>
> What the scheduler completely disregards is that being completely FAIR
> is not always necessary. Depending on the nature of a task, not all
> tasks require a bonus. To the contrary:
>
>  - a GENTLE_FAIR_SLEEPERS bonus given to a background task could result
>    in preempting a latency sensitive currently running task
>
>  - giving only 1/2 scheduler latency bonus to a latency sensitive task
>    could result in that task being preempted before it completes its
>    current activation.
>
>
> > E) Existing knobs
>
> The SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true) defined vruntime bonus value
> can be considered the current mainline default value.
>
> This means that "all" CFS tasks waking up will get a
>
>    0.5 * sysctl_sched_latency
>
> vruntime bonus wrt the cfs_rq->min_vruntime.
>
>
> > F) Existing knobs limitations
>
> GENTLE_FAIR_SLEEPERS is a system-wide knob and it's not run-time
> tunable on production systems (being a SCHED_DEBUG feature).
>
> Thus, the sched_feature should be removed and replaced by a per-task
> knob.
>
>
> > G) Proportionality Analysis
>
> The value of the vruntime bonus directly affects the chance a task has
> to preempt the currently running task.
>
> Indeed, from the code analysis in C:
>
>   thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>
> is the "wakeup bonus", which is used as:
>
>   vruntime = cfs_rq->min_vruntime - thresh
>   se->vruntime = max_vruntime(se->vruntime, vruntime)
>   vdiff = curr.vruntime - se.vruntime
>
>   preempt condition: vdiff > wakeup_gran(se)
>
>
> > H) Range Analysis
>
> The new per-task knob can cover the range [0..sysctl_sched_latency]
>
> Latency sensitive tasks will get sysctl_sched_latency as bonus.
> Latency tolerant tasks will get 0.
>
> Values lower than the default sysctl_sched_latency/2 will require
> special capabilities (e.g. CAP_SYS_NICE). OTHA, a task can relax
> its wakeup latency requirement by asking for a bonus smaller than the
> default.
>
> Mapping Analysis: check if the range can be mapped into a generic one
> =================
>
> The latency_nice proposal [2] offers a [-20, 19] range which can be
> easily mapped into a vruntime bonus range, e.g. using a simpler linear
> transformation function.
>
> A more elaborated mapping could be defined, based on recomputed
> constants, to provide a relative constant increment.
>
>
> > I) System-Wide tuning
>
> The latency_nice provided knobs should be enough to get the desired
> effect.
>
> In (the remote) case a range wider than the one proposed in [H] should
> be required, perhaps an additional sysctl_sched_latency's multiplier
> knob could be required.
>
>
> > J) Per-Task tuning
>
> The latency_nice provided knobs.
>
>
> > K) Task-Group tuning
>
> For tasks-groups, similarly to what uclamp does, a pair of
> latency_nice_{min,max} clamps should be enough.
>
> The task-specific latency_nice requested value will be restricted by the
> task group's clamps.
>