2020-06-23 07:32:45

by Patrick Bellasi

[permalink] [raw]
Subject: Scheduler wakeup path tuning surface: Use-Cases and Requirements


Since last year's OSPM Summit we started conceiving the idea that task
wakeup path could be better tuned for certain classes of workloads
and usage scenarios. Various people showed interest for a possible
tuning interface for the scheduler wakeup path.


.:: The Problem
===============

The discussions we had so far [1] have not been effective in clearly
identifying if a common tuning surface is possible. The last discussion
at this year's OSPM Summit [2,3] was also kind of inconclusive and left
us with the message: start by collecting the requirements and then see
what interface fits them the best.

General consensus is that a unified interface can be challenging and
maybe not feasible. However, generalisation is also a value
and we should strive for it whenever it's possible.

Someone might think that we did not put enough effort in the analysis of
requirements. Perhaps what we missed so far is also a structured and
organised way to collect requirements which also can help in factoring
out the things they have in common.


.:: The Proposal
================

This thread aims at providing a guided template for the description of
different task wakeup use-cases. It does that by setting a series of
questions aimed at precisely describing what's "currently broken", what
we would like to have instead and how we could achieve it.

What we propose here is that, for each wakeup use-case, we start
by replying to this email to provide the required details/comments for
a predefined list of questions. This will generate independent
discussion threads. Each thread will be free to focus on a specific
proposal but still all the thread will be reasoning around a common set
of fundamental concepts.

The hope is that, by representing all the use-cases as sufficiently
detailed responses to a common set of questions, once the discussion
settles down, we can more easily verify if there are common things
surfacing which then can be naturally wrapped into a unified user-space
API.

A first use-case description, following the template guidelines, will
be posted shortly after this message. This also will provide an example
for how to use the template.

NOTE: Whenever required, pseudo-code or simplified C can be used.

I hope this can drive a fruitful discussion in preparation for LPC!

Best,
Patrick


---8<--- For templates submissions: reply only to the following ---8<---


.:: Scheduler Wakeup Path Requirements Collection Template
==========================================================

A) Name: unique one-liner name for the proposed use-case

B) Target behaviour: one paragraph to describe the wakeup path issue

C) Existing control paths: reference to code paths to justify B)

Assuming v5.6 as the reference kernel, this section should provide
links to code paths such as, e.g.

fair.c:3917
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c?h=v5.6#n3917

Alternatively code snippets can be added inline, e.g.

/*
* The 'current' period is already promised to the current tasks,
* however the extra weight of the new task will slow them down a
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);

NOTE: if the use-case exists only outside the mainline Linux kernel
this section can stay empty

D) Desired behaviour: one paragraph to describe the desired update

NOTE: the current mainline expression is assumed to be correct
for existing use-cases. Thus, here we are looking for run-time
tuning of those existing features.

E) Existing knobs (if any): reference to whatever existing tunable

Some features can already be tuned, but perhaps only via compile time
knobs, SCHED_FEATs or system wide tunable.
If that's the case, we should document them here and explain how they
currently work and what are (if any) the implicit assumptions, e.g.
what is the currently implemented scheduling policy/heuristic.

F) Existing knobs (if any): one paragraph description of the limitations

If the existing knobs are not up to the job for this use-case,
shortly explain here why. It could be because a tuning surface is
already there but it's hardcoded (e.g. compile time knob) or too
coarse grained (e.g. a SCHED_FEAT).

G) Proportionality Analysis: check the nature of the target behavior

Goal here is to verify and discuss if the behaviour (B) has a
proportional nature: different values of the control knobs (E) are
expected to produce different effects for (B).

Special care should be taken to check if the target behaviour has an
intrinsically "binary nature", i.e. only two values make really
sense. In this case it would be very useful to argument why a
generalisation towards a non-binary behaviours does NOT make sense.

H) Range Analysis: identify meaningful ranges

If (G) was successfully, i.e. there is a proportional correlation
between (E) and (B), discuss here about a meaningful range for (E)
and (F).

I) System-Wide tuning: which knobs are required

If required, list new additional tunables here, how they should be
exposed and (if required) which scheduling classes will be affected.

J) Per-Task tuning: which knobs are required

Describe which knobs should be added and which task specific API
(e.g. sched_setscheduler(), prctl(), ...) they should be used.

K) Task-Group tuning: which knobs are required

If the use-case can benefit from a task-group tuning, here it should
**briefly described** how the expected behaviour can be mapped on a
cgroup v2 unified hierarchy.

NOTE: implementation details are not required but we should be able
to hint at which cgroup v2 resource distribution model [5]
should be applied.


---8<--- For templates submissions: exclude the following ---8<---


.:: References
==============

[1] [Discussion v2] Usecases for the per-task latency-nice attribute
Message-ID: [email protected]
https://lore.kernel.org/lkml/[email protected]

[2] Latency Nice: Implementation and UseCases for Scheduler Optmizations
https://ospm.lwn.net/playback/presentation/2.0/playback.html?meetingId=380dd88f044f67ee4c94d0a2a4fb7c3f46cb6391-1589459486615&t=42m37s

[3] LWN: The many faces of "latency nice"
https://lwn.net/Articles/820659

[4] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints
Message-ID: [email protected]
https://lore.kernel.org/lkml/[email protected]

[5] Control Group v2: Resource Distribution Models
https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst


2020-06-23 07:52:29

by Patrick Bellasi

[permalink] [raw]
Subject: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus


On Tue, Jun 23, 2020 at 09:29:03 +0200, Patrick Bellasi <[email protected]> wrote...

> .:: Scheduler Wakeup Path Requirements Collection Template
> ==========================================================
>
> A) Name

Runtime tunable vruntime wakeup bonus.


> B) Target behavior

All SCHED_OTHER tasks get the same (max) vruntime wakeup bonus. This
bonus affects the chance the task has to preempt the currently running
task. Some tasks, which are (known to be) latency tolerant, should have
a smaller chance to preempt a (known to be) latency sensitive task. To
the contrary, latency sensitive tasks should have a higher chance to
preempt a currently running latency tolerant task.

This task specific distinction is not provided by the current
implementation and all SCHED_OTHER tasks are handled according to the
same simple, system-wide and not run-time tunable policy.


> C) Existing control paths

Assuming:

C: CFS task currently running on CPUx
W: CFS task waking up on the same CPUx

And considering the overall simplified workflow:

core::try_to_wake_up()

// 1) Select on which CPU W will run
core::select_task_rq()
fair::select_task_rq_fair()

// 2) Enqueue W on the selected CPU
core::ttwu_queue()
core::ttwu_do_activate()
core::activate_task()
core::enqueue_task()
fair::enqueue_task_fair()
fair::enqueue_entity()

// 3) Set W's vruntime bonus
fair::place_entity()
se->vruntime = ...

// 4) Check if C can be preempted by W
core::ttwu_do_wakeup()
core::check_preempt_curr()
fair::check_preempt_curr()
fair::check_preempt_wakeup(curr, se)
fair::wakeup_preempt_entity(curr, se)
vdiff = curr.vruntime - se.vruntime
return vdiff > wakeup_gran(se)

We see that W preempts C iff:

vdiff > wakeup_gran(se)

Since:

enqueue_entity(cfs_rq, se, flags)
place_entity(cfs_rq, se, initial=0)
thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
vruntime = cfs_rq->min_vruntime - thresh
se->vruntime = max_vruntime(se->vruntime, vruntime)

a waking task's W.vruntime can get a "vruntime bonus" up to:
- 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS)
- 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)


> D) Desired behavior

The "vruntime bonus" (thresh) computed in place_entity() should have a
per-task definition, which defaults to the current implementation.

A bigger vruntime bonus can be configured for latency sensitive tasks.
A smaller vruntime bonus can be configured for latency tolerant tasks.

TL;DR

The "vruntime bonus" is meant to give sleepers a compensation for the
service deficit due to them not having (possibly) fully consumed their
assigned fair CPU quota within the current sched_latency interval, see:

commit 51e0304ce6e5 ("sched: Implement a gentler fair-sleepers feature")

The scheduler does that based on a conservative assumption: when a task
sleeps it gives up a portion (P) of its fair CPU bandwidth share in the
current sched_latency period.
Willing to be FAIR, i.e. each task gets a FAIR quota of the CPU in each
sched_latency period, the scheduler wants to give back P to waking
tasks.

However, striving to minimize overheads and complexity, the CFS
scheduler does that using a simple heuristic: each task waking up gets a
bonus, which is capped at one sched_latency period, independently from
"its nature".

What the scheduler completely disregards is that being completely FAIR
is not always necessary. Depending on the nature of a task, not all
tasks require a bonus. To the contrary:

- a GENTLE_FAIR_SLEEPERS bonus given to a background task could result
in preempting a latency sensitive currently running task

- giving only 1/2 scheduler latency bonus to a latency sensitive task
could result in that task being preempted before it completes its
current activation.


> E) Existing knobs

The SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true) defined vruntime bonus value
can be considered the current mainline default value.

This means that "all" CFS tasks waking up will get a

0.5 * sysctl_sched_latency

vruntime bonus wrt the cfs_rq->min_vruntime.


> F) Existing knobs limitations

GENTLE_FAIR_SLEEPERS is a system-wide knob and it's not run-time
tunable on production systems (being a SCHED_DEBUG feature).

Thus, the sched_feature should be removed and replaced by a per-task
knob.


> G) Proportionality Analysis

The value of the vruntime bonus directly affects the chance a task has
to preempt the currently running task.

Indeed, from the code analysis in C:

thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)

is the "wakeup bonus", which is used as:

vruntime = cfs_rq->min_vruntime - thresh
se->vruntime = max_vruntime(se->vruntime, vruntime)
vdiff = curr.vruntime - se.vruntime

preempt condition: vdiff > wakeup_gran(se)


> H) Range Analysis

The new per-task knob can cover the range [0..sysctl_sched_latency]

Latency sensitive tasks will get sysctl_sched_latency as bonus.
Latency tolerant tasks will get 0.

Values lower than the default sysctl_sched_latency/2 will require
special capabilities (e.g. CAP_SYS_NICE). OTHA, a task can relax
its wakeup latency requirement by asking for a bonus smaller than the
default.

Mapping Analysis: check if the range can be mapped into a generic one
=================

The latency_nice proposal [2] offers a [-20, 19] range which can be
easily mapped into a vruntime bonus range, e.g. using a simpler linear
transformation function.

A more elaborated mapping could be defined, based on recomputed
constants, to provide a relative constant increment.


> I) System-Wide tuning

The latency_nice provided knobs should be enough to get the desired
effect.

In (the remote) case a range wider than the one proposed in [H] should
be required, perhaps an additional sysctl_sched_latency's multiplier
knob could be required.


> J) Per-Task tuning

The latency_nice provided knobs.


> K) Task-Group tuning

For tasks-groups, similarly to what uclamp does, a pair of
latency_nice_{min,max} clamps should be enough.

The task-specific latency_nice requested value will be restricted by the
task group's clamps.

2020-07-09 12:11:28

by Parth Shah

[permalink] [raw]
Subject: [SchedulerTaskPacking] Small background task packing

> A) Name:

Small background task packing

> B) Target behaviour:

All fair task wakeup follows a procedure of finding an idle CPU and
waking the task on this idle CPU. There are two major wakeup paths:
1. Slow-path: Wake up the task on an idle CPU which is in the shallowest
idle states by searching in the highest sched_domain flagged with
SD_BALANCE_FORK or SD_BALANCE_EXEC.
2. Fast-path: Wake up the task on an idle CPU in the LLC sched_domain of
the waker CPU. There are optimization to bias task placement on prev_cpu or
wake_cpu of the task. This path is majorly used except in few cases like
during fork() and exec().

This assumption of picking an idle CPU is fair in-order to uniformly
consume system resources. But not all tasks deserves to wakeup an idle core
as it can hurt power consumption. For e.g. like small background tasks
waking up an idle core and runs only for very small duration. Reducing
number of active cores allows busier cores to boost frequency and hence
saving power can also result in performance benefits.

There is no mechanism in existing wake up path to detect small
background tasks which can be packed on fewer cores.

> C) Existing control paths:

fair:: .select_task_rq = select_task_rq_fair

fair::select_task_rq_fair()
// 1) Slow-path: find idle CPU with shallowest idle states
find_idlest_cpu()

// 2) Fast-path: find idle CPU
fair:select_idle_sibling()
// Wake up on idle core if available
fair:select_idle_core()
// Else wakeup on idle CPU if available
fair:select_idle_cpu()
fair:select_idle_smt()


There are multiple ways to call fair:select_task_rq_fair();

// 1) try_to_wake_up which should lead to fast-path
core::try_to_wake_up()
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);

// 2) wake_up_new_task which should lead to slow-path
core::wake_up_new_task()
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK,0));

// 3) sched_exec which should lead to slow-path
core::sched_exec()
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p),
SD_BALANCE_EXEC, 0);

> D) Desired behaviour:

Latency tolerant tasks with small utilization should be packed
on a busy core rather than waking up a new core/CPU.

Upon detecting small-background tasks, different code-path can be used to
search for a busy core as described below;

sched/fair.c:
static inline bool is_small_bg_task(struct task_struct *p)
{
if (is_latency_tolerant(p) &&
(task_util(p) > (SCHED_CAPACITY_SCALE >> 3)))
return true;

return false;
}

sched/fair.c: in select_task_rq_fair()

if (sd_flag & SD_BALANCE_WAKE) {
if (is_small_bg_task(p)) {
// New proposed wakeup path to do task packing
new_cpu = select_non_idle_core(p, prev_cpu);
if (new_cpu >= 0)
return new_cpu;
}
}

where select_non_idle_core() searches for a busy core already running some
tasks and selects an idle CPU in that busy core to pack the waking task.

Complete code can be found on TurboSched patches [1].

> E) Existing knobs (if any): reference to whatever existing tunable

There are no existing knob which can hint the scheduler about the latency
nature of task. Detecting latency nature of the task can help in
classifying task as small background tasks to be packed on fewer number of
cores.

There are user-space hacks to do task packing for background tasks with the
use of cpuset.cpus or sched_setaffinity() to manually affine the process to
fewer dedicated cores.

> F) Existing knobs (if any): one paragraph description of the limitations

Sysadmin/user has to define cpumask for each process (aka task pinning)
which is static in nature. There are multiple limitations to pin the small
background tasks;

- In presence of just one small background task, there is no power
consumption benefit here. It would be preferable to pin it to busy CPU.

- If a task changes the behavior in its life-cycle then sysadmin will have
to manually pin/unpin such tasks. This is limitation of user to classify
tasks as only "background "one and cannot say if and when it will be
"small" in utilization.

- Sysadmin cannot change the task's affinity mask based on the scheduling
pattern to give most optimal performance and energy consumption.

> G) Proportionality Analysis: check the nature of the target behavior

Task packing has to be noninvasive in nature, meaning only the tasks which
are latency tolerant should be packed on fewer cores. Providing this
behavior needs a run-time tunable knob which can hint the scheduler on
whether the waking task can be packed or not.

Upon detecting the nature of the task, a specific wakeup path can be followed:
1. On latency-tolerant tasks with low utilization, a new proposed
scheduling wakeup path will be followed to do packing
2. On latency-sensitive task, an exiting approach of wakeup can be used.

> H) Range Analysis: identify meaningful ranges

The new knob can be binary input accepting 0/1, where 0 means
latency-sensitive and 1 means latency-tolerant task.

Latency-sensitive tasks (value = 0) can search idle CPU in only the llc
sched_domain while the latency-tolerance (value = 1) tasks can go across
llc sched_domain (just like in slow-path) but here in-order to search for a
busy core.

Mapping analysis:
================
The latency_nice proposal [2] offers a [-20, 19] range which can be
mapped into a binary switch, e.g. using a threshold based function.

However, it is possible to extend the concept of finding busy core by
limiting the time spent on searching based on the latency_nice value from
range [-20, 19] where value of 19 indicates searching in the whole chip for
a busy core, whereas value of 10 could mean search for half of the cores in
the chip.

> I) System-Wide tuning: which knobs are required

The latency_nice provided knobs should be enough to get the desired
effect.

> J) Per-Task tuning: which knobs are required

sched_setscheduler() is sufficient.

> K) Task-Group tuning: which knobs are required

A single attribute classifying task-group as latency_nice or
latency_tolerant is sufficient.


> .:: References
> ==============

[1] [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies
for longer durations
https://lkml.org/lkml/2020/1/21/39
[2] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints
Message-ID: [email protected]
https://lore.kernel.org/lkml/[email protected]
[3] TurboSched: the return of small-task packing
https://lwn.net/Articles/792471/
https://www.phoronix.com/scan.php?page=news_item&px=Linux-TurboSched-V4

2020-07-09 23:11:38

by Chris Hyser

[permalink] [raw]
Subject: [SchedulerWakeupLatency] Skipping Idle Cores and CPU Search

> A) Name:

Skipping Idle Cores and CPU Search


> B) Target behavior:

Finding idle CPUs in the CFS scheduler for scheduling awakened tasks increases system throughput at the expense of
additional wake-up latency. For the majority of processes this is a reasonable trade-off. Some communication tasks are
sufficiently sensitive to latency as well as sufficiently short lived (the search time on large systems exceeding the
typical run before sleep time of the comm task itself) to warrant skipping the search. While a real-time priority in a
correctly set up environment might remove the latency, real-time has its own issues, in particular with cgroup
interoperability/configuration.


> C) Existing control paths: reference to code paths to justify B)

--------------------------------------------------------------

select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
nr = INT_MAX
avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;

if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
return -1;

if (sched_feat(SIS_PROP))
u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
nr = 4;

// conditionally use a per-task tunable to adjust 'nr' or return 'target'.

cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
for_each_cpu_wrap(cpu, cpus, target)
if (!--nr)
return -1;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
break;

return cpu;

-------------------------------------------------------------

select_idle_sibling(struct task_struct *p, int prev, int target)
// if NOT per-task tunable set. ie the default for tasks.
i = select_idle_core(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;

i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;

i = select_idle_smt(p, target);
if ((unsigned)i < nr_cpumask_bits)
return i;

return target;

-------------------------------------------------------------


> D) Desired behavior:

Reduce the maximum wake-up latency of designated CFS tasks by skipping some or all of the idle CPU and core searches by
setting a maximum idle CPU search value (maximum loop iterations).

Searching 'ALL' as the maximum would be the default and implies the current code path which may or may not search up to
ALL. Searching 0 would result in the least latency (shown with experimental results to be included if/when patchset goes
up). One of the considerations is that the maximum length of the search is a function of the size of the LLC scheduling
domain and this is platform dependent. Whether 'some', i.e. a numerical value limiting the search can be used to
"normalize" this latency across differing scheduling domain sizes is under investigation. Clearly differing hardware
will have many other significant differences, but in different sized and dynamically sized VMs running on fleets of
common HW this may be interesting.


> E/F) Existing knobs (and limitations):

There are existing sched_feat: SIS_AVG_CPU, SIS_PROP that attempt to short circuit the idle cpu search path in
select_idle_cpu() based on estimations of the current costs of searching. Neither provides a means of task-based
selectivity potentially shortening the search for an unimportant tasks and lengthening the one for a critical task. This
also highlights the difference between lowering average or tail-end latencies for all tasks versus as desired here,
specific targeted ones. Not surprisingly, latency reduction here, as in general, comes at the expense of system
throughput, in this case unbalanced-ness in the LLC scheduling domain.

The proposal here allows maximum reduction of the search time for specific tasks. Keeping the number small has minimal
impact (though not zero) on system throughput as the vast majority of tasks are subject to the existing searches.


> G) Proportionality Analysis:

While the immediate use case suggests that one either cares about the latency of a task or not, maximally do 'ALL' or 0
searches -- presumably one generally wants maximum latency reduction -- the nature of the problem (see pseudo code)
allows precise specification of the maximum number of CPUs to search placing an upper limit on search latency.


> H) Range Analysis:

The knob is a positive integer representing "max number of CPUs to search". The default would be 'ALL' which could be
translated as INT_MAX. '0 searches' translates to 0. Other values represent a max limit on the search, in this case
iterations of a for loop.


> I) System-Wide tuning:

No system-wide tuning required.


> J) Per-Task tuning:

The proposal is a per-task single positive integer representing the maximum number of CPUs to search. sched_setscheduler
would allow both 'nice' and this value to be set together which is a reasonably expected use case for latency sensitive
tasks.


> K) Task-Group tuning:

As the intention here is to limit this behavior to specific identified tasks, and overall performance can be negatively
affected if too many or the wrong tasks are selected, this use case does not appear to really need task groupings,
hierarchical relationships, or child inheritance.


-chrish

2020-07-10 13:25:03

by Vincent Guittot

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

Hi Patrick,

On Tue, 23 Jun 2020 at 09:49, Patrick Bellasi
<[email protected]> wrote:
>
>
> On Tue, Jun 23, 2020 at 09:29:03 +0200, Patrick Bellasi <[email protected]> wrote...
>
> > .:: Scheduler Wakeup Path Requirements Collection Template
> > ==========================================================
> >
> > A) Name
>
> Runtime tunable vruntime wakeup bonus.

Thanks for describing this use case.

>
>
> > B) Target behavior
>
> All SCHED_OTHER tasks get the same (max) vruntime wakeup bonus. This
> bonus affects the chance the task has to preempt the currently running
> task. Some tasks, which are (known to be) latency tolerant, should have
> a smaller chance to preempt a (known to be) latency sensitive task. To
> the contrary, latency sensitive tasks should have a higher chance to
> preempt a currently running latency tolerant task.
>
> This task specific distinction is not provided by the current
> implementation and all SCHED_OTHER tasks are handled according to the
> same simple, system-wide and not run-time tunable policy.
>
>
> > C) Existing control paths
>
> Assuming:
>
> C: CFS task currently running on CPUx
> W: CFS task waking up on the same CPUx
>
> And considering the overall simplified workflow:
>
> core::try_to_wake_up()
>
> // 1) Select on which CPU W will run
> core::select_task_rq()
> fair::select_task_rq_fair()
>
> // 2) Enqueue W on the selected CPU
> core::ttwu_queue()
> core::ttwu_do_activate()
> core::activate_task()
> core::enqueue_task()
> fair::enqueue_task_fair()
> fair::enqueue_entity()
>
> // 3) Set W's vruntime bonus
> fair::place_entity()
> se->vruntime = ...
>
> // 4) Check if C can be preempted by W
> core::ttwu_do_wakeup()
> core::check_preempt_curr()
> fair::check_preempt_curr()
> fair::check_preempt_wakeup(curr, se)
> fair::wakeup_preempt_entity(curr, se)
> vdiff = curr.vruntime - se.vruntime
> return vdiff > wakeup_gran(se)
>
> We see that W preempts C iff:
>
> vdiff > wakeup_gran(se)
>
> Since:
>
> enqueue_entity(cfs_rq, se, flags)
> place_entity(cfs_rq, se, initial=0)
> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
> vruntime = cfs_rq->min_vruntime - thresh
> se->vruntime = max_vruntime(se->vruntime, vruntime)
>
> a waking task's W.vruntime can get a "vruntime bonus" up to:
> - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS)
> - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>
>
> > D) Desired behavior
>
> The "vruntime bonus" (thresh) computed in place_entity() should have a
> per-task definition, which defaults to the current implementation.
>
> A bigger vruntime bonus can be configured for latency sensitive tasks.
> A smaller vruntime bonus can be configured for latency tolerant tasks.

I'm not sure that adjusting what you called "vruntime bonus" is the
right way to provide some latency because it doesn't only provide a
wakeup latency bonus but also provides a runtime bonus. It means that
one can impact the running time by playing with latency_nice whereas
the goal is only to impact the wakeup latency. Instead, it should
weight the decision in wakeup_preempt_entity() and wakeup_gran()

>
> TL;DR
>
> The "vruntime bonus" is meant to give sleepers a compensation for the
> service deficit due to them not having (possibly) fully consumed their
> assigned fair CPU quota within the current sched_latency interval, see:
>
> commit 51e0304ce6e5 ("sched: Implement a gentler fair-sleepers feature")
>
> The scheduler does that based on a conservative assumption: when a task
> sleeps it gives up a portion (P) of its fair CPU bandwidth share in the
> current sched_latency period.
> Willing to be FAIR, i.e. each task gets a FAIR quota of the CPU in each
> sched_latency period, the scheduler wants to give back P to waking
> tasks.
>
> However, striving to minimize overheads and complexity, the CFS
> scheduler does that using a simple heuristic: each task waking up gets a
> bonus, which is capped at one sched_latency period, independently from
> "its nature".
>
> What the scheduler completely disregards is that being completely FAIR
> is not always necessary. Depending on the nature of a task, not all
> tasks require a bonus. To the contrary:
>
> - a GENTLE_FAIR_SLEEPERS bonus given to a background task could result
> in preempting a latency sensitive currently running task
>
> - giving only 1/2 scheduler latency bonus to a latency sensitive task
> could result in that task being preempted before it completes its
> current activation.
>
>
> > E) Existing knobs
>
> The SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true) defined vruntime bonus value
> can be considered the current mainline default value.
>
> This means that "all" CFS tasks waking up will get a
>
> 0.5 * sysctl_sched_latency
>
> vruntime bonus wrt the cfs_rq->min_vruntime.
>
>
> > F) Existing knobs limitations
>
> GENTLE_FAIR_SLEEPERS is a system-wide knob and it's not run-time
> tunable on production systems (being a SCHED_DEBUG feature).
>
> Thus, the sched_feature should be removed and replaced by a per-task
> knob.
>
>
> > G) Proportionality Analysis
>
> The value of the vruntime bonus directly affects the chance a task has
> to preempt the currently running task.
>
> Indeed, from the code analysis in C:
>
> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>
> is the "wakeup bonus", which is used as:
>
> vruntime = cfs_rq->min_vruntime - thresh
> se->vruntime = max_vruntime(se->vruntime, vruntime)
> vdiff = curr.vruntime - se.vruntime
>
> preempt condition: vdiff > wakeup_gran(se)
>
>
> > H) Range Analysis
>
> The new per-task knob can cover the range [0..sysctl_sched_latency]
>
> Latency sensitive tasks will get sysctl_sched_latency as bonus.
> Latency tolerant tasks will get 0.
>
> Values lower than the default sysctl_sched_latency/2 will require
> special capabilities (e.g. CAP_SYS_NICE). OTHA, a task can relax
> its wakeup latency requirement by asking for a bonus smaller than the
> default.
>
> Mapping Analysis: check if the range can be mapped into a generic one
> =================
>
> The latency_nice proposal [2] offers a [-20, 19] range which can be
> easily mapped into a vruntime bonus range, e.g. using a simpler linear
> transformation function.
>
> A more elaborated mapping could be defined, based on recomputed
> constants, to provide a relative constant increment.
>
>
> > I) System-Wide tuning
>
> The latency_nice provided knobs should be enough to get the desired
> effect.
>
> In (the remote) case a range wider than the one proposed in [H] should
> be required, perhaps an additional sysctl_sched_latency's multiplier
> knob could be required.
>
>
> > J) Per-Task tuning
>
> The latency_nice provided knobs.
>
>
> > K) Task-Group tuning
>
> For tasks-groups, similarly to what uclamp does, a pair of
> latency_nice_{min,max} clamps should be enough.
>
> The task-specific latency_nice requested value will be restricted by the
> task group's clamps.
>

2020-07-10 20:01:22

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus


On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot <[email protected]> wrote...

> Hi Patrick,

Hi Vincent,

[...]

>> > C) Existing control paths
>>
>> Assuming:
>>
>> C: CFS task currently running on CPUx
>> W: CFS task waking up on the same CPUx
>>
>> And considering the overall simplified workflow:
>>
>> core::try_to_wake_up()
>>
>> // 1) Select on which CPU W will run
>> core::select_task_rq()
>> fair::select_task_rq_fair()
>>
>> // 2) Enqueue W on the selected CPU
>> core::ttwu_queue()
>> core::ttwu_do_activate()
>> core::activate_task()
>> core::enqueue_task()
>> fair::enqueue_task_fair()
>> fair::enqueue_entity()
>>
>> // 3) Set W's vruntime bonus
>> fair::place_entity()
>> se->vruntime = ...
>>
>> // 4) Check if C can be preempted by W
>> core::ttwu_do_wakeup()
>> core::check_preempt_curr()
>> fair::check_preempt_curr()
>> fair::check_preempt_wakeup(curr, se)
>> fair::wakeup_preempt_entity(curr, se)
>> vdiff = curr.vruntime - se.vruntime
>> return vdiff > wakeup_gran(se)
>>
>> We see that W preempts C iff:
>>
>> vdiff > wakeup_gran(se)
>>
>> Since:
>>
>> enqueue_entity(cfs_rq, se, flags)
>> place_entity(cfs_rq, se, initial=0)
>> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>> vruntime = cfs_rq->min_vruntime - thresh
>> se->vruntime = max_vruntime(se->vruntime, vruntime)
>>
>> a waking task's W.vruntime can get a "vruntime bonus" up to:
>> - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS)
>> - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>>
>>
>> > D) Desired behavior
>>
>> The "vruntime bonus" (thresh) computed in place_entity() should have a
>> per-task definition, which defaults to the current implementation.
>>
>> A bigger vruntime bonus can be configured for latency sensitive tasks.
>> A smaller vruntime bonus can be configured for latency tolerant tasks.
>
> I'm not sure that adjusting what you called "vruntime bonus" is the
> right way to provide some latency because it doesn't only provide a
> wakeup latency bonus but also provides a runtime bonus.

True, however that's what we already do but _just_ in an hard-coded way.

A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
FAIR_SLEEPERS. Point is that not all tasks are the same: for some this
bonus can be not really required, for others too small.

Regarding the 'runtime bonus' I think it's kind of unavoidable,
if we really want a latency sensitive task being scheduled
before the others.

> It means that one can impact the running time by playing with
> latency_nice whereas the goal is only to impact the wakeup latency.

Well, but I'm not sure how much you can really gain considering that
this bonus is given only at wakeup time: the task should keep
suspending himself. It would get a better result by just asking for a
lower nice value.

Now, asking for a reduced nice value is RLIMIT_NICE and CAP_SYS_NICE
protected. The same will be for latency_nice.

Moreover, considering that by default tasks will get what we already
have as hard-coded or less of a bonus, I don't see how easy should be to
abuse.

To the contrary we can introduce a very useful knob to allow certain
tasks to voluntarily demote themselves and avoid annoying a currently
running task.

> Instead, it should weight the decision in wakeup_preempt_entity() and
> wakeup_gran()

In those functions we already take the task prio into consideration
(ref details at the end of this message).

Lower nice value tasks have more chances to preempt current since they
will have a smaller wakeup_gran, indeed:

we preempt IFF vdiff(se, current) > wakeup_gran(se)
\----------------/ \-------------/
A B

While task's prio affects B, in this proposal, lantecy_nice works on the
A side of the equation above by making it a bit more task specific.

That said, it's true that both latency_nice and prio will ultimately
play a role on how much CPU bandwidth a task gets.

Question is: do we deem it useful to have an additional knob working on
the A side of the equation above?

Best,
Patrick



---8<------8<------8<------8<------8<------8<------8<------8<------8<------8<---

TL;DR: The nice value already affects the wakeup latency

As reported above:

check_preempt_wakeup(rq, p, wake_flags)
wakeup_preempt_entity(curr, se)
(d) vdiff = curr.vruntime - se.vruntime
(e) return vdiff > wakeup_gran(se)

we see that W preempts C iff:

vdiff > wakeup_gran(se)

But:

wakeup_gran(se)
calc_delta_fair(delta=sysctl_sched_wakeup_granularity, se)
__calc_delta(delta_exec=delta, weight=NICE_0_LOAD, lw=&se->load)
(c) wakeup_gran = sched_wakeup_granularity * (NICE_0_LOAD / W.load.weight)

Thus, the wakeup granularity of W depends on:
- the system-wide configured wakeup granularity
sysctl_sched_wakeup_granularity := [0..1e9]ns
- W.load.weight [88761, .., 15]

But since:

set_user_nice()
p->static_prio = NICE_TO_PRIO(nice) = 120 + nice
set_load_weight(p, update_load=true)
reweight_task(p, prio)
(a) prio = p->static_prio - MAX_RT_PRIO = 120 + nice - 100 = nice + 20
(b) weight = scale_load(sched_prio_to_weight[prio]) = [88761, ..., 15]
reweight_entity(cfs_rq, se, weight=weight, runnable=weight)
update_load_set(lw=&se->load, w=weight)
lw->weight = w
p->prio = effective_prio(p)

We see that by tuning a task's nice value we affect its wakeup granularity:

lower the nice
=(a)=> lower the prio
=(b)=> higher the weight
=(c)=> smaller the wakeup_grain

This means that for a given system-wide knob (sched_wakeup_granularity),
we still get different behaviours depending on a task specific knob.
A smaller nice makes more likely W to preempt C.

2020-07-13 13:01:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi
<[email protected]> wrote:
>
>
> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot <[email protected]> wrote...
>
> > Hi Patrick,
>
> Hi Vincent,
>
> [...]
>
> >> > C) Existing control paths
> >>
> >> Assuming:
> >>
> >> C: CFS task currently running on CPUx
> >> W: CFS task waking up on the same CPUx
> >>
> >> And considering the overall simplified workflow:
> >>
> >> core::try_to_wake_up()
> >>
> >> // 1) Select on which CPU W will run
> >> core::select_task_rq()
> >> fair::select_task_rq_fair()
> >>
> >> // 2) Enqueue W on the selected CPU
> >> core::ttwu_queue()
> >> core::ttwu_do_activate()
> >> core::activate_task()
> >> core::enqueue_task()
> >> fair::enqueue_task_fair()
> >> fair::enqueue_entity()
> >>
> >> // 3) Set W's vruntime bonus
> >> fair::place_entity()
> >> se->vruntime = ...
> >>
> >> // 4) Check if C can be preempted by W
> >> core::ttwu_do_wakeup()
> >> core::check_preempt_curr()
> >> fair::check_preempt_curr()
> >> fair::check_preempt_wakeup(curr, se)
> >> fair::wakeup_preempt_entity(curr, se)
> >> vdiff = curr.vruntime - se.vruntime
> >> return vdiff > wakeup_gran(se)
> >>
> >> We see that W preempts C iff:
> >>
> >> vdiff > wakeup_gran(se)
> >>
> >> Since:
> >>
> >> enqueue_entity(cfs_rq, se, flags)
> >> place_entity(cfs_rq, se, initial=0)
> >> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
> >> vruntime = cfs_rq->min_vruntime - thresh
> >> se->vruntime = max_vruntime(se->vruntime, vruntime)
> >>
> >> a waking task's W.vruntime can get a "vruntime bonus" up to:
> >> - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS)
> >> - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
> >>
> >>
> >> > D) Desired behavior
> >>
> >> The "vruntime bonus" (thresh) computed in place_entity() should have a
> >> per-task definition, which defaults to the current implementation.
> >>
> >> A bigger vruntime bonus can be configured for latency sensitive tasks.
> >> A smaller vruntime bonus can be configured for latency tolerant tasks.
> >
> > I'm not sure that adjusting what you called "vruntime bonus" is the
> > right way to provide some latency because it doesn't only provide a
> > wakeup latency bonus but also provides a runtime bonus.
>
> True, however that's what we already do but _just_ in an hard-coded way.
>
> A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
> FAIR_SLEEPERS. Point is that not all tasks are the same: for some this

From a nice and fair PoV, it's not a bonus but the opposite. In fact
it's limiting how much credit, the task will keep from its sleep time.
Also keep in mind that this value is clamped by its vruntime so a task
can't get bonus

> bonus can be not really required, for others too small.
>
> Regarding the 'runtime bonus' I think it's kind of unavoidable,
> if we really want a latency sensitive task being scheduled
> before the others.

That's where I disagree.
2 tasks with the same nice priority must get the same running time
whatever their latency_nice priority. The only difference should be to
select which one will run 1st or make it easy for 1 task to preempt
the other one but at the end both should be able to get the same
running time

>
> > It means that one can impact the running time by playing with
> > latency_nice whereas the goal is only to impact the wakeup latency.
>
> Well, but I'm not sure how much you can really gain considering that
> this bonus is given only at wakeup time: the task should keep
> suspending himself. It would get a better result by just asking for a
> lower nice value.
>
> Now, asking for a reduced nice value is RLIMIT_NICE and CAP_SYS_NICE
> protected. The same will be for latency_nice.
>
> Moreover, considering that by default tasks will get what we already
> have as hard-coded or less of a bonus, I don't see how easy should be to
> abuse.
>
> To the contrary we can introduce a very useful knob to allow certain
> tasks to voluntarily demote themselves and avoid annoying a currently
> running task.
>
> > Instead, it should weight the decision in wakeup_preempt_entity() and
> > wakeup_gran()
>
> In those functions we already take the task prio into consideration
> (ref details at the end of this message).
>
> Lower nice value tasks have more chances to preempt current since they
> will have a smaller wakeup_gran, indeed:

yes, and this is there to ensure a fair distribution of running time
and prevent a task to increase significantly its vruntime compare to
others

-1 min that se already got more runtime than current
0 that se's vruntime will go above current's vruntime after a runtime
duration of sched_min_granularity
and 1 means that se got less runtime than current and its vruntime
will still be lower than current ones even after a runtime duration of
sched_min_granularity

IMHO, latency_nice should impact the decision only for case 0 but not
the case -1 and 1.
That being said, we can extend the case 0 a bit to include the
situation where current's vruntime will become greater than se's
vruntimes after a runtime duration of sched_min_granularity like
below:

curr->vruntime
|<-- wakeup_gran(se) -->|<--
wakeupgran(curr) -->|
current range: se->vruntime +1 | 0 | -1
new range: se->vruntime +1 | 0
| -1


>
> we preempt IFF vdiff(se, current) > wakeup_gran(se)
> \----------------/ \-------------/
> A B
>
> While task's prio affects B, in this proposal, lantecy_nice works on the
> A side of the equation above by making it a bit more task specific.
>
> That said, it's true that both latency_nice and prio will ultimately
> play a role on how much CPU bandwidth a task gets.

whereas nice is there to split CPU bandwidth between task according to
their prio, latency nice should not

>
> Question is: do we deem it useful to have an additional knob working on
> the A side of the equation above?
>
> Best,
> Patrick
>
>
>
> ---8<------8<------8<------8<------8<------8<------8<------8<------8<------8<---
>
> TL;DR: The nice value already affects the wakeup latency
>
> As reported above:
>
> check_preempt_wakeup(rq, p, wake_flags)
> wakeup_preempt_entity(curr, se)
> (d) vdiff = curr.vruntime - se.vruntime
> (e) return vdiff > wakeup_gran(se)
>
> we see that W preempts C iff:
>
> vdiff > wakeup_gran(se)
>
> But:
>
> wakeup_gran(se)
> calc_delta_fair(delta=sysctl_sched_wakeup_granularity, se)
> __calc_delta(delta_exec=delta, weight=NICE_0_LOAD, lw=&se->load)
> (c) wakeup_gran = sched_wakeup_granularity * (NICE_0_LOAD / W.load.weight)
>
> Thus, the wakeup granularity of W depends on:
> - the system-wide configured wakeup granularity
> sysctl_sched_wakeup_granularity := [0..1e9]ns
> - W.load.weight [88761, .., 15]
>
> But since:
>
> set_user_nice()
> p->static_prio = NICE_TO_PRIO(nice) = 120 + nice
> set_load_weight(p, update_load=true)
> reweight_task(p, prio)
> (a) prio = p->static_prio - MAX_RT_PRIO = 120 + nice - 100 = nice + 20
> (b) weight = scale_load(sched_prio_to_weight[prio]) = [88761, ..., 15]
> reweight_entity(cfs_rq, se, weight=weight, runnable=weight)
> update_load_set(lw=&se->load, w=weight)
> lw->weight = w
> p->prio = effective_prio(p)
>
> We see that by tuning a task's nice value we affect its wakeup granularity:
>
> lower the nice
> =(a)=> lower the prio
> =(b)=> higher the weight
> =(c)=> smaller the wakeup_grain
>
> This means that for a given system-wide knob (sched_wakeup_granularity),
> we still get different behaviours depending on a task specific knob.
> A smaller nice makes more likely W to preempt C.
>

2020-07-16 16:52:08

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

On 13/07/2020 14:59, Vincent Guittot wrote:
> On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi
> <[email protected]> wrote:
>>
>>
>> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot <[email protected]> wrote...

[...]

>>> Instead, it should weight the decision in wakeup_preempt_entity() and
>>> wakeup_gran()
>>
>> In those functions we already take the task prio into consideration
>> (ref details at the end of this message).
>>
>> Lower nice value tasks have more chances to preempt current since they
>> will have a smaller wakeup_gran, indeed:
>
> yes, and this is there to ensure a fair distribution of running time
> and prevent a task to increase significantly its vruntime compare to
> others
>
> -1 min that se already got more runtime than current
> 0 that se's vruntime will go above current's vruntime after a runtime
> duration of sched_min_granularity
> and 1 means that se got less runtime than current and its vruntime
> will still be lower than current ones even after a runtime duration of
> sched_min_granularity
>
> IMHO, latency_nice should impact the decision only for case 0 but not
> the case -1 and 1.
> That being said, we can extend the case 0 a bit to include the
> situation where current's vruntime will become greater than se's
> vruntimes after a runtime duration of sched_min_granularity like
> below:
>
> curr->vruntime
> |<-- wakeup_gran(se) -->|<--
> wakeupgran(curr) -->|
> current range: se->vruntime +1 | 0 | -1
> new range: se->vruntime +1 | 0
> | -1
>

I assume this got messed up by line break somehow:

curr->vruntime
|<-- wakeup_gran(se) -->|<-- wakeup_gran(curr) -->|
current range: se->vruntime +1 | 0 | -1
new range: se->vruntime +1 | 0 | -1

IMHO, with the current use of wakeup_preempt_entity() I don't see what
will change with that.
We check 'wakeup_preempt_entity() == 1' in check_preempt_wakeup() and
'wakeup_preempt_entity() < 1' in pick_next_entity().

How should the mapping between se's latency_nice value to the consideration of
wakeup_gran(curr) look like?

[...]

2020-07-16 19:56:00

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus


Hi Vincent,

On Mon, Jul 13, 2020 at 14:59:51 +0200, Vincent Guittot <[email protected]> wrote...

> On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi <[email protected]> wrote:
>> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot <[email protected]> wrote...
>>
>> [...]
>>
>> >> > C) Existing control paths
>> >>
>> >> Assuming:
>> >>
>> >> C: CFS task currently running on CPUx
>> >> W: CFS task waking up on the same CPUx
>> >>
>> >> And considering the overall simplified workflow:
>> >>
>> >> core::try_to_wake_up()
>> >>
>> >> // 1) Select on which CPU W will run
>> >> core::select_task_rq()
>> >> fair::select_task_rq_fair()
>> >>
>> >> // 2) Enqueue W on the selected CPU
>> >> core::ttwu_queue()
>> >> core::ttwu_do_activate()
>> >> core::activate_task()
>> >> core::enqueue_task()
>> >> fair::enqueue_task_fair()
>> >> fair::enqueue_entity()
>> >>
>> >> // 3) Set W's vruntime bonus
>> >> fair::place_entity()
>> >> se->vruntime = ...
>> >>
>> >> // 4) Check if C can be preempted by W
>> >> core::ttwu_do_wakeup()
>> >> core::check_preempt_curr()
>> >> fair::check_preempt_curr()
>> >> fair::check_preempt_wakeup(curr, se)
>> >> fair::wakeup_preempt_entity(curr, se)
>> >> vdiff = curr.vruntime - se.vruntime
>> >> return vdiff > wakeup_gran(se)
>> >>
>> >> We see that W preempts C iff:
>> >>
>> >> vdiff > wakeup_gran(se)
>> >>
>> >> Since:
>> >>
>> >> enqueue_entity(cfs_rq, se, flags)
>> >> place_entity(cfs_rq, se, initial=0)
>> >> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>> >> vruntime = cfs_rq->min_vruntime - thresh
>> >> se->vruntime = max_vruntime(se->vruntime, vruntime)
>> >>
>> >> a waking task's W.vruntime can get a "vruntime bonus" up to:
>> >> - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS)
>> >> - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>> >>
>> >>
>> >> > D) Desired behavior
>> >>
>> >> The "vruntime bonus" (thresh) computed in place_entity() should have a
>> >> per-task definition, which defaults to the current implementation.
>> >>
>> >> A bigger vruntime bonus can be configured for latency sensitive tasks.
>> >> A smaller vruntime bonus can be configured for latency tolerant tasks.
>> >
>> > I'm not sure that adjusting what you called "vruntime bonus" is the
>> > right way to provide some latency because it doesn't only provide a
>> > wakeup latency bonus but also provides a runtime bonus.
>>
>> True, however that's what we already do but _just_ in an hard-coded way.
>>
>> A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
>> FAIR_SLEEPERS. Point is that not all tasks are the same: for some this
>
> From a nice and fair PoV, it's not a bonus but the opposite. In fact
> it's limiting how much credit, the task will keep from its sleep time.

I agree about 'limiting a credit', thus being a _credit_ IMO it's a bonus
and the limiting happens only with GENTLE_FAIR_SLEEPER.

So, in general, tasks _waking up_ get a (limited) credit, i.e. a
vruntime bonus.

Form the FAIR PoV it is even more a bonus since all the machinery AFAIU
it's designed to give some vruntime bonus to _non_ CPU bound / batch
tasks.

That's done to compensate for them being suspended and thus not having a
chance to consume all their fair CPU share in the previous activation.

> Also keep in mind that this value is clamped by its vruntime so a task
> can't get bonus

True, at wakeup we clamped it with the SE (normalized) vruntime.

But still since we do:

se->vruntime = max(se->vruntime, cfs_rq->min_vruntime-VRUNTIME_BONUS)
\---- A ----/ \--------------- B ---------------/

The bigger B is the more likely we are to "penalize" the SE vuntime.


>> bonus can be not really required, for others too small.
>>
>> Regarding the 'runtime bonus' I think it's kind of unavoidable,
>> if we really want a latency sensitive task being scheduled
>> before the others.
>
> That's where I disagree.
> 2 tasks with the same nice priority must get the same running time
> whatever their latency_nice priority.

This is granted only in the very simple case of CPU bound tasks, and the
mechanism I describe is not impacting those tasks.

For sleeping tasks, if you consider all the approximations we do (gentle
fair sleepers to begin with) we will never be "precise".
... not to speak of tasks migrations or running at different OPPs.

> The only difference should be to select which one will run 1st or make
> it easy for 1 task to preempt the other one but at the end both should
> be able to get the same running time

Agree, but I would say that we should aim at getting this result across
few sched_latencies.

IOW, granting the exact same fair CPU share within a sched_latency
period with sleeping tasks I'm not convinced is something we already
have.

>> > It means that one can impact the running time by playing with
>> > latency_nice whereas the goal is only to impact the wakeup latency.
>>
>> Well, but I'm not sure how much you can really gain considering that
>> this bonus is given only at wakeup time: the task should keep
>> suspending himself. It would get a better result by just asking for a
>> lower nice value.
>>
>> Now, asking for a reduced nice value is RLIMIT_NICE and CAP_SYS_NICE
>> protected. The same will be for latency_nice.
>>
>> Moreover, considering that by default tasks will get what we already
>> have as hard-coded or less of a bonus, I don't see how easy should be to
>> abuse.
>>
>> To the contrary we can introduce a very useful knob to allow certain
>> tasks to voluntarily demote themselves and avoid annoying a currently
>> running task.
>>
>> > Instead, it should weight the decision in wakeup_preempt_entity() and
>> > wakeup_gran()
>>
>> In those functions we already take the task prio into consideration
>> (ref details at the end of this message).
>>
>> Lower nice value tasks have more chances to preempt current since they
>> will have a smaller wakeup_gran, indeed:
>
> yes, and this is there to ensure a fair distribution of running time
> and prevent a task to increase significantly its vruntime compare to
> others

Don't follow you here :/

> -1 min that se already got more runtime than current
> 0 that se's vruntime will go above current's vruntime after a runtime
> duration of sched_min_granularity
> and 1 means that se got less runtime than current and its vruntime
> will still be lower than current ones even after a runtime duration of
> sched_min_granularity
>
> IMHO, latency_nice should impact the decision only for case 0 but not
> the case -1 and 1.
> That being said, we can extend the case 0 a bit to include the
> situation where current's vruntime will become greater than se's
> vruntimes after a runtime duration of sched_min_granularity like
> below:

(I hope I reformatted the following as you intended it)

> curr->vruntime
> | wakeup_gran(se) | wakeup_gran(curr) |
> current: se->vruntime +1 | 0 | -1
> new: se->vruntime +1 | 0 | -1
>

Does that means to penalize a waking up tasks wrt current, such that an
high latency_nice task waking up has less chances to preempt current?

>> we preempt IFF vdiff(se, current) > wakeup_gran(se)
>> \----------------/ \-------------/
>> A B
>>
>> While task's prio affects B, in this proposal, lantecy_nice works on the
>> A side of the equation above by making it a bit more task specific.
>>
>> That said, it's true that both latency_nice and prio will ultimately
>> play a role on how much CPU bandwidth a task gets.
>
> whereas nice is there to split CPU bandwidth between task according to
> their prio, latency nice should not

Of course that would be something new for which I see your concerns.

However, I still think that making a bit more tunable the task-agnostic
and hard-coded heuristics we have for A (above) could give benefits for
tasks that are very short running (compared to a sched_latency) while
still never really spoil/abuse the FAIR repartitioning of the CPU
bandwidth.

I'm thinking for example to graphics pipeline tasks which run for <1ms
every 16ms.

2020-07-17 12:23:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

On Thu, 16 Jul 2020 at 18:48, Dietmar Eggemann <[email protected]> wrote:
>
> On 13/07/2020 14:59, Vincent Guittot wrote:
> > On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi
> > <[email protected]> wrote:
> >>
> >>
> >> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot <[email protected]> wrote...
>
> [...]
>
> >>> Instead, it should weight the decision in wakeup_preempt_entity() and
> >>> wakeup_gran()
> >>
> >> In those functions we already take the task prio into consideration
> >> (ref details at the end of this message).
> >>
> >> Lower nice value tasks have more chances to preempt current since they
> >> will have a smaller wakeup_gran, indeed:
> >
> > yes, and this is there to ensure a fair distribution of running time
> > and prevent a task to increase significantly its vruntime compare to
> > others
> >
> > -1 min that se already got more runtime than current
> > 0 that se's vruntime will go above current's vruntime after a runtime
> > duration of sched_min_granularity
> > and 1 means that se got less runtime than current and its vruntime
> > will still be lower than current ones even after a runtime duration of
> > sched_min_granularity
> >
> > IMHO, latency_nice should impact the decision only for case 0 but not
> > the case -1 and 1.
> > That being said, we can extend the case 0 a bit to include the
> > situation where current's vruntime will become greater than se's
> > vruntimes after a runtime duration of sched_min_granularity like
> > below:
> >
> > curr->vruntime
> > |<-- wakeup_gran(se) -->|<--
> > wakeupgran(curr) -->|
> > current range: se->vruntime +1 | 0 | -1
> > new range: se->vruntime +1 | 0
> > | -1
> >
>
> I assume this got messed up by line break somehow:

yes

>
> curr->vruntime
> |<-- wakeup_gran(se) -->|<-- wakeup_gran(curr) -->|
> current range: se->vruntime +1 | 0 | -1
> new range: se->vruntime +1 | 0 | -1
>
> IMHO, with the current use of wakeup_preempt_entity() I don't see what
> will change with that.
> We check 'wakeup_preempt_entity() == 1' in check_preempt_wakeup() and
> 'wakeup_preempt_entity() < 1' in pick_next_entity().
>
> How should the mapping between se's latency_nice value to the consideration of
> wakeup_gran(curr) look like?

IMO, the latency_nice can be used to move the +1|0 boundary on the
right and the 0|-1 on the left with a formula that need to be defined
And we also need to review where and how wakeup_preempt_entity is used

>
> [...]

2020-07-17 14:20:02

by Vincent Guittot

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

On Thu, 16 Jul 2020 at 21:55, Patrick Bellasi
<[email protected]> wrote:
>
>
> Hi Vincent,
>
> On Mon, Jul 13, 2020 at 14:59:51 +0200, Vincent Guittot <[email protected]> wrote...
>
> > On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi <[email protected]> wrote:
> >> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot <[email protected]> wrote...
> >>
> >> [...]
> >>
> >> >> > C) Existing control paths
> >> >>
> >> >> Assuming:
> >> >>
> >> >> C: CFS task currently running on CPUx
> >> >> W: CFS task waking up on the same CPUx
> >> >>
> >> >> And considering the overall simplified workflow:
> >> >>
> >> >> core::try_to_wake_up()
> >> >>
> >> >> // 1) Select on which CPU W will run
> >> >> core::select_task_rq()
> >> >> fair::select_task_rq_fair()
> >> >>
> >> >> // 2) Enqueue W on the selected CPU
> >> >> core::ttwu_queue()
> >> >> core::ttwu_do_activate()
> >> >> core::activate_task()
> >> >> core::enqueue_task()
> >> >> fair::enqueue_task_fair()
> >> >> fair::enqueue_entity()
> >> >>
> >> >> // 3) Set W's vruntime bonus
> >> >> fair::place_entity()
> >> >> se->vruntime = ...
> >> >>
> >> >> // 4) Check if C can be preempted by W
> >> >> core::ttwu_do_wakeup()
> >> >> core::check_preempt_curr()
> >> >> fair::check_preempt_curr()
> >> >> fair::check_preempt_wakeup(curr, se)
> >> >> fair::wakeup_preempt_entity(curr, se)
> >> >> vdiff = curr.vruntime - se.vruntime
> >> >> return vdiff > wakeup_gran(se)
> >> >>
> >> >> We see that W preempts C iff:
> >> >>
> >> >> vdiff > wakeup_gran(se)
> >> >>
> >> >> Since:
> >> >>
> >> >> enqueue_entity(cfs_rq, se, flags)
> >> >> place_entity(cfs_rq, se, initial=0)
> >> >> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
> >> >> vruntime = cfs_rq->min_vruntime - thresh
> >> >> se->vruntime = max_vruntime(se->vruntime, vruntime)
> >> >>
> >> >> a waking task's W.vruntime can get a "vruntime bonus" up to:
> >> >> - 1 scheduler latency (w/ GENTLE_FAIR_SLEEPERS)
> >> >> - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
> >> >>
> >> >>
> >> >> > D) Desired behavior
> >> >>
> >> >> The "vruntime bonus" (thresh) computed in place_entity() should have a
> >> >> per-task definition, which defaults to the current implementation.
> >> >>
> >> >> A bigger vruntime bonus can be configured for latency sensitive tasks.
> >> >> A smaller vruntime bonus can be configured for latency tolerant tasks.
> >> >
> >> > I'm not sure that adjusting what you called "vruntime bonus" is the
> >> > right way to provide some latency because it doesn't only provide a
> >> > wakeup latency bonus but also provides a runtime bonus.
> >>
> >> True, however that's what we already do but _just_ in an hard-coded way.
> >>
> >> A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
> >> FAIR_SLEEPERS. Point is that not all tasks are the same: for some this
> >
> > From a nice and fair PoV, it's not a bonus but the opposite. In fact
> > it's limiting how much credit, the task will keep from its sleep time.
>
> I agree about 'limiting a credit', thus being a _credit_ IMO it's a bonus
> and the limiting happens only with GENTLE_FAIR_SLEEPER.
>
> So, in general, tasks _waking up_ get a (limited) credit, i.e. a
> vruntime bonus.

This looks like nitpicking about terms so let move one

>
> Form the FAIR PoV it is even more a bonus since all the machinery AFAIU
> it's designed to give some vruntime bonus to _non_ CPU bound / batch
> tasks.
>
> That's done to compensate for them being suspended and thus not having a
> chance to consume all their fair CPU share in the previous activation.
>
> > Also keep in mind that this value is clamped by its vruntime so a task
> > can't get bonus
>
> True, at wakeup we clamped it with the SE (normalized) vruntime.
>
> But still since we do:
>
> se->vruntime = max(se->vruntime, cfs_rq->min_vruntime-VRUNTIME_BONUS)
> \---- A ----/ \--------------- B ---------------/
>
> The bigger B is the more likely we are to "penalize" the SE vuntime.

maybe for your use case but we should not consider one to be more
likely than another one

>
>
> >> bonus can be not really required, for others too small.
> >>
> >> Regarding the 'runtime bonus' I think it's kind of unavoidable,
> >> if we really want a latency sensitive task being scheduled
> >> before the others.
> >
> > That's where I disagree.
> > 2 tasks with the same nice priority must get the same running time
> > whatever their latency_nice priority.
>
> This is granted only in the very simple case of CPU bound tasks, and the
> mechanism I describe is not impacting those tasks.
>
> For sleeping tasks, if you consider all the approximations we do (gentle
> fair sleepers to begin with) we will never be "precise".

it's not because there are some uncertainties that we should add more


> ... not to speak of tasks migrations or running at different OPPs.
>
> > The only difference should be to select which one will run 1st or make
> > it easy for 1 task to preempt the other one but at the end both should
> > be able to get the same running time
>
> Agree, but I would say that we should aim at getting this result across
> few sched_latencies.
>
> IOW, granting the exact same fair CPU share within a sched_latency
> period with sleeping tasks I'm not convinced is something we already
> have.
>
> >> > It means that one can impact the running time by playing with
> >> > latency_nice whereas the goal is only to impact the wakeup latency.
> >>
> >> Well, but I'm not sure how much you can really gain considering that
> >> this bonus is given only at wakeup time: the task should keep
> >> suspending himself. It would get a better result by just asking for a
> >> lower nice value.
> >>
> >> Now, asking for a reduced nice value is RLIMIT_NICE and CAP_SYS_NICE
> >> protected. The same will be for latency_nice.
> >>
> >> Moreover, considering that by default tasks will get what we already
> >> have as hard-coded or less of a bonus, I don't see how easy should be to
> >> abuse.
> >>
> >> To the contrary we can introduce a very useful knob to allow certain
> >> tasks to voluntarily demote themselves and avoid annoying a currently
> >> running task.
> >>
> >> > Instead, it should weight the decision in wakeup_preempt_entity() and
> >> > wakeup_gran()
> >>
> >> In those functions we already take the task prio into consideration
> >> (ref details at the end of this message).
> >>
> >> Lower nice value tasks have more chances to preempt current since they
> >> will have a smaller wakeup_gran, indeed:
> >
> > yes, and this is there to ensure a fair distribution of running time
> > and prevent a task to increase significantly its vruntime compare to
> > others
>
> Don't follow you here :/

we take into account how fast the se's vruntime will increase during
the next sched period and and compare the end result to decide which
one will run 1st without breaking fair distribution

>
> > -1 min that se already got more runtime than current
> > 0 that se's vruntime will go above current's vruntime after a runtime
> > duration of sched_min_granularity
> > and 1 means that se got less runtime than current and its vruntime
> > will still be lower than current ones even after a runtime duration of
> > sched_min_granularity
> >
> > IMHO, latency_nice should impact the decision only for case 0 but not
> > the case -1 and 1.
> > That being said, we can extend the case 0 a bit to include the
> > situation where current's vruntime will become greater than se's
> > vruntimes after a runtime duration of sched_min_granularity like
> > below:
>
> (I hope I reformatted the following as you intended it)

yes you have

>
> > curr->vruntime
> > | wakeup_gran(se) | wakeup_gran(curr) |
> > current: se->vruntime +1 | 0 | -1
> > new: se->vruntime +1 | 0 | -1
> >
>
> Does that means to penalize a waking up tasks wrt current, such that an
> high latency_nice task waking up has less chances to preempt current?

yes. If a task doesn't need short scheduling latency why it should
preempt current one with lower latency_nice as an example

>
> >> we preempt IFF vdiff(se, current) > wakeup_gran(se)
> >> \----------------/ \-------------/
> >> A B
> >>
> >> While task's prio affects B, in this proposal, lantecy_nice works on the
> >> A side of the equation above by making it a bit more task specific.
> >>
> >> That said, it's true that both latency_nice and prio will ultimately
> >> play a role on how much CPU bandwidth a task gets.
> >
> > whereas nice is there to split CPU bandwidth between task according to
> > their prio, latency nice should not
>
> Of course that would be something new for which I see your concerns.
> is
> However, I still think that making a bit more tunable the task-agnostic
> and hard-coded heuristics we have for A (above) could give benefits for
> tasks that are very short running (compared to a sched_latency) while

If the task is really short running, it doesn't use all its running
time and should easily preempt other tasks at wakeup without the need
for an extended vruntime bonus or an extra CPU bandwidth

> still never really spoil/abuse the FAIR repartitioning of the CPU
> bandwidth.
>
> I'm thinking for example to graphics pipeline tasks which run for <1ms
> every 16ms.
>

2020-07-20 09:01:51

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Skipping Idle Cores and CPU Search

On 10/07/2020 01:08, chris hyser wrote:

[...]

>> D) Desired behavior:
>
> Reduce the maximum wake-up latency of designated CFS tasks by skipping
> some or all of the idle CPU and core searches by setting a maximum idle
> CPU search value (maximum loop iterations).
>
> Searching 'ALL' as the maximum would be the default and implies the
> current code path which may or may not search up to ALL. Searching 0
> would result in the least latency (shown with experimental results to be
> included if/when patchset goes up). One of the considerations is that
> the maximum length of the search is a function of the size of the LLC
> scheduling domain and this is platform dependent. Whether 'some', i.e. a
> numerical value limiting the search can be used to "normalize" this
> latency across differing scheduling domain sizes is under investigation.
> Clearly differing hardware will have many other significant differences,
> but in different sized and dynamically sized VMs running on fleets of
> common HW this may be interesting.

I assume that this task-specific feature could coexists in
select_idle_core() and select_idle_cpu() with the already existing
runtime heuristics (test_idle_cores() and the two sched features
mentioned under E/F) to reduce the idle CPU search space on a busy system.

>> E/F) Existing knobs (and limitations):
>
> There are existing sched_feat: SIS_AVG_CPU, SIS_PROP that attempt to
> short circuit the idle cpu search path in select_idle_cpu() based on
> estimations of the current costs of searching. Neither provides a means

[...]

>> H) Range Analysis:
>
> The knob is a positive integer representing "max number of CPUs to
> search". The default would be 'ALL' which could be translated as
> INT_MAX. '0 searches' translates to 0. Other values represent a max
> limit on the search, in this case iterations of a for loop.

IMHO the opposite use case for this feature (favour high throughput over
short wakeup latency (Facebook) is already cured by the changes
introduced by commit 10e2f1acd010 ("sched/core: Rewrite and improve
select_idle_siblings()"), i.e. with the current implementation of sis().

It seems that they don't need an additional per-task feature on top of
the default system-wide runtime heuristics.

[...]

2020-07-20 15:21:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [SchedulerTaskPacking] Small background task packing

Hi Parth,

On Thu, 9 Jul 2020 at 14:09, Parth Shah <[email protected]> wrote:
>
> > A) Name:
>
> Small background task packing
>
> > B) Target behaviour:
>
> All fair task wakeup follows a procedure of finding an idle CPU and
> waking the task on this idle CPU. There are two major wakeup paths:
> 1. Slow-path: Wake up the task on an idle CPU which is in the shallowest
> idle states by searching in the highest sched_domain flagged with
> SD_BALANCE_FORK or SD_BALANCE_EXEC.
> 2. Fast-path: Wake up the task on an idle CPU in the LLC sched_domain of
> the waker CPU. There are optimization to bias task placement on prev_cpu or
> wake_cpu of the task. This path is majorly used except in few cases like
> during fork() and exec().
>
> This assumption of picking an idle CPU is fair in-order to uniformly
> consume system resources. But not all tasks deserves to wakeup an idle core
> as it can hurt power consumption. For e.g. like small background tasks
> waking up an idle core and runs only for very small duration. Reducing
> number of active cores allows busier cores to boost frequency and hence
> saving power can also result in performance benefits.
>
> There is no mechanism in existing wake up path to detect small
> background tasks which can be packed on fewer cores.
>
> > C) Existing control paths:
>
> fair:: .select_task_rq = select_task_rq_fair
>
> fair::select_task_rq_fair()
> // 1) Slow-path: find idle CPU with shallowest idle states
> find_idlest_cpu()
>
> // 2) Fast-path: find idle CPU
> fair:select_idle_sibling()
> // Wake up on idle core if available
> fair:select_idle_core()
> // Else wakeup on idle CPU if available
> fair:select_idle_cpu()
> fair:select_idle_smt()
>
>
> There are multiple ways to call fair:select_task_rq_fair();
>
> // 1) try_to_wake_up which should lead to fast-path
> core::try_to_wake_up()
> cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>
> // 2) wake_up_new_task which should lead to slow-path
> core::wake_up_new_task()
> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK,0));
>
> // 3) sched_exec which should lead to slow-path
> core::sched_exec()
> dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p),
> SD_BALANCE_EXEC, 0);
>
> > D) Desired behaviour:
>
> Latency tolerant tasks with small utilization should be packed
> on a busy core rather than waking up a new core/CPU.
>
> Upon detecting small-background tasks, different code-path can be used to
> search for a busy core as described below;
>
> sched/fair.c:
> static inline bool is_small_bg_task(struct task_struct *p)
> {
> if (is_latency_tolerant(p) &&
> (task_util(p) > (SCHED_CAPACITY_SCALE >> 3)))

This condition seems to be quite arbitrary and probably works on your
current platform but other platforms might want another threshold. Is
there a way to know how much utilization can be added to a CPU which
would not use more capacity than the extra capacity that is provided
by the turbo mode ?

Also you may want to use task_util_est() instead of task_util() to
make sure that the small background task has low chance to become a
large one and not use more capacity than the one provided by turbo at
the end

> return true;
>
> return false;
> }
>
> sched/fair.c: in select_task_rq_fair()
>
> if (sd_flag & SD_BALANCE_WAKE) {
> if (is_small_bg_task(p)) {
> // New proposed wakeup path to do task packing
> new_cpu = select_non_idle_core(p, prev_cpu);
> if (new_cpu >= 0)
> return new_cpu;
> }
> }
>
> where select_non_idle_core() searches for a busy core already running some
> tasks and selects an idle CPU in that busy core to pack the waking task.
>
> Complete code can be found on TurboSched patches [1].
>
> > E) Existing knobs (if any): reference to whatever existing tunable
>
> There are no existing knob which can hint the scheduler about the latency
> nature of task. Detecting latency nature of the task can help in
> classifying task as small background tasks to be packed on fewer number of
> cores.
>
> There are user-space hacks to do task packing for background tasks with the
> use of cpuset.cpus or sched_setaffinity() to manually affine the process to
> fewer dedicated cores.
>
> > F) Existing knobs (if any): one paragraph description of the limitations
>
> Sysadmin/user has to define cpumask for each process (aka task pinning)
> which is static in nature. There are multiple limitations to pin the small
> background tasks;
>
> - In presence of just one small background task, there is no power
> consumption benefit here. It would be preferable to pin it to busy CPU.
>
> - If a task changes the behavior in its life-cycle then sysadmin will have
> to manually pin/unpin such tasks. This is limitation of user to classify
> tasks as only "background "one and cannot say if and when it will be
> "small" in utilization.
>
> - Sysadmin cannot change the task's affinity mask based on the scheduling
> pattern to give most optimal performance and energy consumption.
>
> > G) Proportionality Analysis: check the nature of the target behavior
>
> Task packing has to be noninvasive in nature, meaning only the tasks which
> are latency tolerant should be packed on fewer cores. Providing this
> behavior needs a run-time tunable knob which can hint the scheduler on
> whether the waking task can be packed or not.
>
> Upon detecting the nature of the task, a specific wakeup path can be followed:
> 1. On latency-tolerant tasks with low utilization, a new proposed
> scheduling wakeup path will be followed to do packing
> 2. On latency-sensitive task, an exiting approach of wakeup can be used.
>
> > H) Range Analysis: identify meaningful ranges
>
> The new knob can be binary input accepting 0/1, where 0 means
> latency-sensitive and 1 means latency-tolerant task.
>
> Latency-sensitive tasks (value = 0) can search idle CPU in only the llc
> sched_domain while the latency-tolerance (value = 1) tasks can go across
> llc sched_domain (just like in slow-path) but here in-order to search for a
> busy core.
>
> Mapping analysis:
> ================
> The latency_nice proposal [2] offers a [-20, 19] range which can be
> mapped into a binary switch, e.g. using a threshold based function.
>
> However, it is possible to extend the concept of finding busy core by
> limiting the time spent on searching based on the latency_nice value from
> range [-20, 19] where value of 19 indicates searching in the whole chip for
> a busy core, whereas value of 10 could mean search for half of the cores in
> the chip.
>
> > I) System-Wide tuning: which knobs are required
>
> The latency_nice provided knobs should be enough to get the desired
> effect.
>
> > J) Per-Task tuning: which knobs are required
>
> sched_setscheduler() is sufficient.
>
> > K) Task-Group tuning: which knobs are required
>
> A single attribute classifying task-group as latency_nice or
> latency_tolerant is sufficient.
>
>
> > .:: References
> > ==============
>
> [1] [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies
> for longer durations
> https://lkml.org/lkml/2020/1/21/39
> [2] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints
> Message-ID: [email protected]
> https://lore.kernel.org/lkml/[email protected]
> [3] TurboSched: the return of small-task packing
> https://lwn.net/Articles/792471/
> https://www.phoronix.com/scan.php?page=news_item&px=Linux-TurboSched-V4

2020-07-22 18:59:45

by Chris Hyser

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Skipping Idle Cores and CPU Search

On 7/20/20 4:47 AM, Dietmar Eggemann wrote:
> On 10/07/2020 01:08, chris hyser wrote:
>
> [...]
>
>>> D) Desired behavior:
>>
>> Reduce the maximum wake-up latency of designated CFS tasks by skipping
>> some or all of the idle CPU and core searches by setting a maximum idle
>> CPU search value (maximum loop iterations).
>>
>> Searching 'ALL' as the maximum would be the default and implies the
>> current code path which may or may not search up to ALL. Searching 0
>> would result in the least latency (shown with experimental results to be
>> included if/when patchset goes up). One of the considerations is that
>> the maximum length of the search is a function of the size of the LLC
>> scheduling domain and this is platform dependent. Whether 'some', i.e. a
>> numerical value limiting the search can be used to "normalize" this
>> latency across differing scheduling domain sizes is under investigation.
>> Clearly differing hardware will have many other significant differences,
>> but in different sized and dynamically sized VMs running on fleets of
>> common HW this may be interesting.
>
> I assume that this task-specific feature could coexists in
> select_idle_core() and select_idle_cpu() with the already existing
> runtime heuristics (test_idle_cores() and the two sched features
> mentioned under E/F) to reduce the idle CPU search space on a busy system.

Yes, so perhaps a more generalized summary of the feature is that is simply places a per-task maximum number of
iterations on the various 'for_each_cpu' loops (whose max is platform dependent) in this path. Any other technique to
short circuit the loop below this max would be fine including the fact that the very first 'idle' check in a loop may
succeed and that is perfectly ok in terms of minimizing the search latency. This really only kicks in on busy systems
and while system or scheduling domain wide heuristics can reduce the cost to tasks for not doing something per-task like
this, they can't drive the loop iteration search to 0 because that is BAD policy when applied to the wrong tasks or too
many tasks.


>
>>> E/F) Existing knobs (and limitations):
>>
>> There are existing sched_feat: SIS_AVG_CPU, SIS_PROP that attempt to
>> short circuit the idle cpu search path in select_idle_cpu() based on
>> estimations of the current costs of searching. Neither provides a means
>
> [...]
>
>>> H) Range Analysis:
>>
>> The knob is a positive integer representing "max number of CPUs to
>> search". The default would be 'ALL' which could be translated as
>> INT_MAX. '0 searches' translates to 0. Other values represent a max
>> limit on the search, in this case iterations of a for loop.
>
> IMHO the opposite use case for this feature (favour high throughput over
> short wakeup latency (Facebook) is already cured by the changes
> introduced by commit 10e2f1acd010 ("sched/core: Rewrite and improve
> select_idle_siblings()"), i.e. with the current implementation of sis().
>
> It seems that they don't need an additional per-task feature on top of
> the default system-wide runtime heuristics.

Agreed and I hope I've clarified how the attribute in question should not affect that as the default for the attribute
is basically "no short cut because of this", other heuristics may apply.

-chrish

2020-07-23 09:32:52

by Dietmar Eggemann

[permalink] [raw]
Subject: [SchedulerWakeupLatency] Skip energy aware task placement

On 23/06/2020 09:29, Patrick Bellasi wrote:

> .:: Scheduler Wakeup Path Requirements Collection Template
> ==========================================================
>
> A) Name: unique one-liner name for the proposed use-case

[SchedulerWakeupLatency] Skip energy aware task placement

> B) Target behaviour: one paragraph to describe the wakeup path issue

The search for the most energy-efficient CPU over the Performance
Domains (PDs) by consulting the Energy Model (EM), i.e. the estimation
on how much energy a PD consumes if the task migrates to one of its
CPUs, adds a certain amount of latency to task placement.

For some tasks this extra latency could be too high. A good example here
are the Android display pipeline tasks, UIThread and RenderThread. They
have to be placed on idle CPUs with a faster wakeup mechanism than the
energy aware wakeup path (A) to guarantee the smallest amount of dropped
or delayed frames (a.k.a. jank).

In Linux kernel mainline there is currently no mechanism for latency
sensitive tasks to allow that the energy aware wakeup path (A) is
skipped and the fast path (B) taken instead.

> C) Existing control paths: reference to code paths to justify B)

select_task_rq_fair()
{
...

if (wakeup)
if (asym_cpucapacity && EM && schedutil)
new_cpu = find_energy_efficient_cpu(); <- (A)
if (new_cpu >= 0)
return new_cpu;

...

if (unlikely(sd))
/* slow path */
else if (sd_flag & wakeup)
/* fast path */
new_cpu = select_idle_sibling() {
if (asym_cpucapacity)
new_cpu = select_idle_capacity(); <- (B)
if (new_cpu >= 0)
return new_cpu;
}

...

return new_cpu;
}

> D) Desired behaviour: one paragraph to describe the desired update

A mechanism for a task to skip the energy aware wakeup (A) and fallback
into the fast path (B).

> E) Existing knobs (if any): reference to whatever existing tunable

There are no existing ways to control this behaviour in Linux kernel
mainline.

There is the concept of 'prefer idle' in Android which is tightly
coupled with the proprietary cgroup controller schedtune.

> F) Existing knobs (if any): one paragraph description of the limitations

Schedtune will be replaced by mainline uclamp in upcoming Android
releases. There is no per-task 'prefer idle' interface.

> G) Proportionality Analysis: check the nature of the target behavior

The use case requires that a task either cares about latency or not.

> H) Range Analysis: identify meaningful ranges

The knob can be defined as latency sensitive (i.e. prefer an idle CPU)
or as not latency sensitive.

Mapping Analysis:

If required by other use-cases, the binary range requirement can easily
be covered by a wider, more fine grained latency sensitive range.

> I) System-Wide tuning: which knobs are required

No system-wide tuning required.

> J) Per-Task tuning: which knobs are required

The proposal is a per-task flag, indicating whether the task is latency
sensitive or not.

> K) Task-Group tuning: which knobs are required

Currently Android uses the 'prefer idle' mechanism only on task-groups
and not on individual tasks.

Therefore a per task-group implementation would be required. The
implementation should respect the cgroup resource distribution models
[1], [2].

> .:: References
> ==============

[1] LWN: The many faces of "latency nice"
https://lwn.net/Articles/820659

[2] Control Group v2: Resource Distribution Models
https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst

2020-07-23 09:34:54

by Parth Shah

[permalink] [raw]
Subject: Re: [SchedulerTaskPacking] Small background task packing

Hi Vincent,

On 7/20/20 8:50 PM, Vincent Guittot wrote:
> Hi Parth,
>
> On Thu, 9 Jul 2020 at 14:09, Parth Shah <[email protected]> wrote:
>>
>>> A) Name:
>>
>> Small background task packing
>>
>>> B) Target behaviour:
>>
>> All fair task wakeup follows a procedure of finding an idle CPU and
>> waking the task on this idle CPU. There are two major wakeup paths:
>> 1. Slow-path: Wake up the task on an idle CPU which is in the shallowest
>> idle states by searching in the highest sched_domain flagged with
>> SD_BALANCE_FORK or SD_BALANCE_EXEC.
>> 2. Fast-path: Wake up the task on an idle CPU in the LLC sched_domain of
>> the waker CPU. There are optimization to bias task placement on prev_cpu or
>> wake_cpu of the task. This path is majorly used except in few cases like
>> during fork() and exec().
>>
>> This assumption of picking an idle CPU is fair in-order to uniformly
>> consume system resources. But not all tasks deserves to wakeup an idle core
>> as it can hurt power consumption. For e.g. like small background tasks
>> waking up an idle core and runs only for very small duration. Reducing
>> number of active cores allows busier cores to boost frequency and hence
>> saving power can also result in performance benefits.
>>
>> There is no mechanism in existing wake up path to detect small
>> background tasks which can be packed on fewer cores.
>>
>>> C) Existing control paths:
>>
>> fair:: .select_task_rq = select_task_rq_fair
>>
>> fair::select_task_rq_fair()
>> // 1) Slow-path: find idle CPU with shallowest idle states
>> find_idlest_cpu()
>>
>> // 2) Fast-path: find idle CPU
>> fair:select_idle_sibling()
>> // Wake up on idle core if available
>> fair:select_idle_core()
>> // Else wakeup on idle CPU if available
>> fair:select_idle_cpu()
>> fair:select_idle_smt()
>>
>>
>> There are multiple ways to call fair:select_task_rq_fair();
>>
>> // 1) try_to_wake_up which should lead to fast-path
>> core::try_to_wake_up()
>> cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>>
>> // 2) wake_up_new_task which should lead to slow-path
>> core::wake_up_new_task()
>> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK,0));
>>
>> // 3) sched_exec which should lead to slow-path
>> core::sched_exec()
>> dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p),
>> SD_BALANCE_EXEC, 0);
>>
>>> D) Desired behaviour:
>>
>> Latency tolerant tasks with small utilization should be packed
>> on a busy core rather than waking up a new core/CPU.
>>
>> Upon detecting small-background tasks, different code-path can be used to
>> search for a busy core as described below;
>>
>> sched/fair.c:
>> static inline bool is_small_bg_task(struct task_struct *p)
>> {
>> if (is_latency_tolerant(p) &&
>> (task_util(p) > (SCHED_CAPACITY_SCALE >> 3)))
>
> This condition seems to be quite arbitrary and probably works on your
> current platform but other platforms might want another threshold. Is
> there a way to know how much utilization can be added to a CPU which
> would not use more capacity than the extra capacity that is provided
> by the turbo mode ?
>

I guess not because we will have to predict if adding a task to the CPU
will hit power limit or thermal limit.
Predicting power limit is complicated, one complexity is in finding the
current idle-state of all CPUs in the DIE which is not possible in servers
(because hardware may use different C-state than the one requested by
cpuidle driver)
Other complexity is in detecting thermal limit which will require
determining the nature of the workload e.g. SIMD/FPU based workload will
burn more power than ALU workloads and heat up the core.

So this is heuristics at the end which might need to be tuned for each
system. We can either have arch specific threshold defined in kernel,
kernel parameter or something like that.

> Also you may want to use task_util_est() instead of task_util() to
> make sure that the small background task has low chance to become a
> large one and not use more capacity than the one provided by turbo at
> the end
>

I agree. using task_util_est() will be better here for estimating near
future utilization.


>> return true;
>>
>> return false;
>> }
>>
>> sched/fair.c: in select_task_rq_fair()
>>
>> if (sd_flag & SD_BALANCE_WAKE) {
>> if (is_small_bg_task(p)) {
>> // New proposed wakeup path to do task packing
>> new_cpu = select_non_idle_core(p, prev_cpu);
>> if (new_cpu >= 0)
>> return new_cpu;
>> }
>> }
>>
>> where select_non_idle_core() searches for a busy core already running some
>> tasks and selects an idle CPU in that busy core to pack the waking task.
>>
>> Complete code can be found on TurboSched patches [1].
>>
>>> E) Existing knobs (if any): reference to whatever existing tunable
>>
>> There are no existing knob which can hint the scheduler about the latency
>> nature of task. Detecting latency nature of the task can help in
>> classifying task as small background tasks to be packed on fewer number of
>> cores.
>>
>> There are user-space hacks to do task packing for background tasks with the
>> use of cpuset.cpus or sched_setaffinity() to manually affine the process to
>> fewer dedicated cores.
>>
>>> F) Existing knobs (if any): one paragraph description of the limitations
>>
>> Sysadmin/user has to define cpumask for each process (aka task pinning)
>> which is static in nature. There are multiple limitations to pin the small
>> background tasks;
>>
>> - In presence of just one small background task, there is no power
>> consumption benefit here. It would be preferable to pin it to busy CPU.
>>
>> - If a task changes the behavior in its life-cycle then sysadmin will have
>> to manually pin/unpin such tasks. This is limitation of user to classify
>> tasks as only "background "one and cannot say if and when it will be
>> "small" in utilization.
>>
>> - Sysadmin cannot change the task's affinity mask based on the scheduling
>> pattern to give most optimal performance and energy consumption.
>>
>>> G) Proportionality Analysis: check the nature of the target behavior
>>
>> Task packing has to be noninvasive in nature, meaning only the tasks which
>> are latency tolerant should be packed on fewer cores. Providing this
>> behavior needs a run-time tunable knob which can hint the scheduler on
>> whether the waking task can be packed or not.
>>
>> Upon detecting the nature of the task, a specific wakeup path can be followed:
>> 1. On latency-tolerant tasks with low utilization, a new proposed
>> scheduling wakeup path will be followed to do packing
>> 2. On latency-sensitive task, an exiting approach of wakeup can be used.
>>
>>> H) Range Analysis: identify meaningful ranges
>>
>> The new knob can be binary input accepting 0/1, where 0 means
>> latency-sensitive and 1 means latency-tolerant task.
>>
>> Latency-sensitive tasks (value = 0) can search idle CPU in only the llc
>> sched_domain while the latency-tolerance (value = 1) tasks can go across
>> llc sched_domain (just like in slow-path) but here in-order to search for a
>> busy core.
>>
>> Mapping analysis:
>> ================
>> The latency_nice proposal [2] offers a [-20, 19] range which can be
>> mapped into a binary switch, e.g. using a threshold based function.
>>
>> However, it is possible to extend the concept of finding busy core by
>> limiting the time spent on searching based on the latency_nice value from
>> range [-20, 19] where value of 19 indicates searching in the whole chip for
>> a busy core, whereas value of 10 could mean search for half of the cores in
>> the chip.
>>
>>> I) System-Wide tuning: which knobs are required
>>
>> The latency_nice provided knobs should be enough to get the desired
>> effect.
>>
>>> J) Per-Task tuning: which knobs are required
>>
>> sched_setscheduler() is sufficient.
>>
>>> K) Task-Group tuning: which knobs are required
>>
>> A single attribute classifying task-group as latency_nice or
>> latency_tolerant is sufficient.
>>
>>
>>> .:: References
>>> ==============
>>
>> [1] [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies
>> for longer durations
>> https://lkml.org/lkml/2020/1/21/39
>> [2] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints
>> Message-ID: [email protected]
>> https://lore.kernel.org/lkml/[email protected]
>> [3] TurboSched: the return of small-task packing
>> https://lwn.net/Articles/792471/
>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-TurboSched-V4

Thanks,
Parth

2020-07-29 14:02:14

by Quentin Perret

[permalink] [raw]
Subject: Re: [SchedulerWakeupLatency] Skip energy aware task placement

On Thursday 23 Jul 2020 at 11:31:39 (+0200), Dietmar Eggemann wrote:
> The search for the most energy-efficient CPU over the Performance
> Domains (PDs) by consulting the Energy Model (EM), i.e. the estimation
> on how much energy a PD consumes if the task migrates to one of its
> CPUs, adds a certain amount of latency to task placement.
>
> For some tasks this extra latency could be too high. A good example here
> are the Android display pipeline tasks, UIThread and RenderThread. They
> have to be placed on idle CPUs with a faster wakeup mechanism than the
> energy aware wakeup path (A) to guarantee the smallest amount of dropped
> or delayed frames (a.k.a. jank).
>
> In Linux kernel mainline there is currently no mechanism for latency
> sensitive tasks to allow that the energy aware wakeup path (A) is
> skipped and the fast path (B) taken instead.

FWIW, Android has indeed been using a similar concept for many years
(the 'prefer-idle' flag you mentioned below) and that proved to be
critical to meet UI requirements on a number of devices. So +1 to have
this use case covered as part of the latency-nice work.

<...>
> > K) Task-Group tuning: which knobs are required
>
> Currently Android uses the 'prefer idle' mechanism only on task-groups
> and not on individual tasks.

To add a little bit of context here, Android uses cgroups extensively
for task classification. The tasks of any currently used application are
placed in a cgroup based on their 'role'. For instance, all tasks that
belong to the currently visible application are classified as 'top-app',
and are grouped in a cgroup with 'prefer-idle' set (+ other attributes,
e.g. uclamp). The other tasks in the system are usually placed in other
cgroups (e.g. 'background'), with 'prefer-idle' cleared.

When the user switches from one app to the other, their roles are
swapped. At this point userspace needs to move all tasks of an app from
one cgroup to another, to reflect the change in role. But interestingly,
all tasks of the app belong to the same process, so userspace only needs
to write the PID of the parent task to the new cgroup to move all of
them at once. (Note: we're also experimenting having cgroups per-app
rather than per-role, and to change the cgroup parameters instead of
migrating between groups, but that doesn't change the conclusion below).

So, having 'only' a per-task API for latency-nice may not be sufficient
to cover the Android use-case, as that would force userspace to iterate
over all the tasks in an app when switching between roles, and that will
come at a certain cost (currently unknown). However, that will need to
be confirmed experimentally, and starting 'simple' with a per-task API
would definitely be a step in the right direction.

Thanks,
Quentin

2020-11-12 06:18:42

by Parth Shah

[permalink] [raw]
Subject: Re: Scheduler wakeup path tuning surface: Interface discussion

I was analyzing LPC 2020 discussion regarding Latency-nice interface and
have below points to initiate further discussion:

1. There was consensus that having interface like "Latency-nice" to
provide scheduler hints about task latency requirement can be very useful.

2. There are two use-case regarding the change in the number of CPUs to
be searched in select_idle_cpu path:
- milli-seconds optimization: Perform more scans to find idle CPUs to
reduce latency in milliseconds
- micro-seconds optimization: Perform less searches and queue it to any
CPU when system is overloaded

Both these optimization are contradictory since one requires to reduce
search space for latency sensitive task while other wants to increase
it. Though we can think about tuning select_idle_cpu path by keeping
mask of idle CPUs or applying other tricks, it will always be a
trade-off to search more or less and that's where latency-nice like
attribute can be useful for task aware decisions.

3. Using range is non-deterministic, meaning it is difficult to classify
a task should be marked with value of -18 or -19 and there will be
non-deterministic impact on system performance between these values.

And that's where should we even think of using "Flags" or other type
instead of range where we just say if a task is latency-sensitive or not
and won't classify if a task is "relatively" latency-nice than others?


Best,
Parth