Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
References: <87v9kv2545.derkling@matbug.com> <87h7wd15v2.derkling@matbug.net> <87imgrlrqi.derkling@matbug.net> <87mu5sqwkt.derkling@matbug.net> <87eer42clt.derkling@matbug.net> <87imfi2qbk.derkling@matbug.net>
User-agent: mu4e 1.4.10; emacs 26.3
From:   Patrick Bellasi <patrick.bellasi@matbug.net>
To:     LKML <linux-kernel@vger.kernel.org>
Cc:     Ingo Molnar <mingo@redhat.com>,
        "Peter Zijlstra \(Intel\)" <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Jonathan Corbet <corbet@lwn.net>,
        Dhaval Giani <dhaval.giani@oracle.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Josef Bacik <jbacik@fb.com>,
        Chris Hyser <chris.hyser@oracle.com>,
        Parth Shah <parth@linux.ibm.com>
Subject: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus
In-reply-to: <87imfi2qbk.derkling@matbug.net>
Message-ID: <87blla2pdt.derkling@matbug.net>
Date:   Tue, 23 Jun 2020 09:49:18 +0200
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk


On Tue, Jun 23, 2020 at 09:29:03 +0200, Patrick Bellasi <patrick.bellasi@matbug.net> wrote...

> .:: Scheduler Wakeup Path Requirements Collection Template
> ==========================================================
>
> A) Name

Runtime tunable vruntime wakeup bonus.


> B) Target behavior

All SCHED_OTHER tasks get the same (max) vruntime wakeup bonus. This
bonus affects the chance the task has to preempt the currently running
task. Some tasks, which are (known to be) latency tolerant, should have
a smaller chance to preempt a (known to be) latency sensitive task. To
the contrary, latency sensitive tasks should have a higher chance to
preempt a currently running latency tolerant task.

This task specific distinction is not provided by the current
implementation and all SCHED_OTHER tasks are handled according to the
same simple, system-wide and not run-time tunable policy.


> C) Existing control paths

Assuming:

 C: CFS task currently running on CPUx
 W: CFS task waking up on the same CPUx

And considering the overall simplified workflow:

core::try_to_wake_up()

  // 1) Select on which CPU W will run
  core::select_task_rq()
    fair::select_task_rq_fair()

  // 2) Enqueue W on the selected CPU
  core::ttwu_queue()
    core::ttwu_do_activate()
      core::activate_task()
        core::enqueue_task()
          fair::enqueue_task_fair()
            fair::enqueue_entity()

              // 3) Set W's vruntime bonus
              fair::place_entity()
                se->vruntime = ...

      // 4) Check if C can be preempted by W
      core::ttwu_do_wakeup()
        core::check_preempt_curr()
          fair::check_preempt_curr()
            fair::check_preempt_wakeup(curr, se)
              fair::wakeup_preempt_entity(curr, se)
                vdiff = curr.vruntime - se.vruntime
                return vdiff > wakeup_gran(se)

We see that W preempts C iff:

   vdiff > wakeup_gran(se)

Since:

enqueue_entity(cfs_rq, se, flags)
  place_entity(cfs_rq, se, initial=0)
    thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
    vruntime = cfs_rq->min_vruntime - thresh
    se->vruntime = max_vruntime(se->vruntime, vruntime)

a waking task's W.vruntime can get a "vruntime bonus" up to:
 - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
 - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)


> D) Desired behavior

The "vruntime bonus" (thresh) computed in place_entity() should have a
per-task definition, which defaults to the current implementation.

A bigger vruntime bonus can be configured for latency sensitive tasks.
A smaller vruntime bonus can be configured for latency tolerant tasks.

TL;DR

The "vruntime bonus" is meant to give sleepers a compensation for the
service deficit due to them not having (possibly) fully consumed their
assigned fair CPU quota within the current sched_latency interval, see:

  commit 51e0304ce6e5 ("sched: Implement a gentler fair-sleepers feature")

The scheduler does that based on a conservative assumption: when a task
sleeps it gives up a portion (P) of its fair CPU bandwidth share in the
current sched_latency period.
Willing to be FAIR, i.e. each task gets a FAIR quota of the CPU in each
sched_latency period, the scheduler wants to give back P to waking
tasks.

However, striving to minimize overheads and complexity, the CFS
scheduler does that using a simple heuristic: each task waking up gets a
bonus, which is capped at one sched_latency period, independently from
"its nature".

What the scheduler completely disregards is that being completely FAIR
is not always necessary. Depending on the nature of a task, not all
tasks require a bonus. To the contrary:

 - a GENTLE_FAIR_SLEEPERS bonus given to a background task could result
   in preempting a latency sensitive currently running task

 - giving only 1/2 scheduler latency bonus to a latency sensitive task
   could result in that task being preempted before it completes its
   current activation.


> E) Existing knobs

The SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true) defined vruntime bonus value
can be considered the current mainline default value.

This means that "all" CFS tasks waking up will get a

   0.5 * sysctl_sched_latency

vruntime bonus wrt the cfs_rq->min_vruntime.


> F) Existing knobs limitations

GENTLE_FAIR_SLEEPERS is a system-wide knob and it's not run-time
tunable on production systems (being a SCHED_DEBUG feature).

Thus, the sched_feature should be removed and replaced by a per-task
knob.


> G) Proportionality Analysis

The value of the vruntime bonus directly affects the chance a task has
to preempt the currently running task.

Indeed, from the code analysis in C:

  thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)

is the "wakeup bonus", which is used as:

  vruntime = cfs_rq->min_vruntime - thresh
  se->vruntime = max_vruntime(se->vruntime, vruntime)
  vdiff = curr.vruntime - se.vruntime

  preempt condition: vdiff > wakeup_gran(se)


> H) Range Analysis

The new per-task knob can cover the range [0..sysctl_sched_latency]

Latency sensitive tasks will get sysctl_sched_latency as bonus.
Latency tolerant tasks will get 0.

Values lower than the default sysctl_sched_latency/2 will require
special capabilities (e.g. CAP_SYS_NICE). OTHA, a task can relax
its wakeup latency requirement by asking for a bonus smaller than the
default.

Mapping Analysis: check if the range can be mapped into a generic one
=================

The latency_nice proposal [2] offers a [-20, 19] range which can be
easily mapped into a vruntime bonus range, e.g. using a simpler linear
transformation function.

A more elaborated mapping could be defined, based on recomputed
constants, to provide a relative constant increment.


> I) System-Wide tuning

The latency_nice provided knobs should be enough to get the desired
effect.

In (the remote) case a range wider than the one proposed in [H] should
be required, perhaps an additional sysctl_sched_latency's multiplier
knob could be required.


> J) Per-Task tuning

The latency_nice provided knobs.


> K) Task-Group tuning

For tasks-groups, similarly to what uclamp does, a pair of
latency_nice_{min,max} clamps should be enough.

The task-specific latency_nice requested value will be restricted by the
task group's clamps.