2017-08-25 10:20:27

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC 0/3] Utilization estimation for FAIR tasks

This posting presents a possible solution for a problem already
discussed at last year's LPC and more recently discussed at the OSPM
Summit [1].
The aim of this proposal is to improve some PELT behaviors to make it a
better fit for the description of tasks which are common in embedded
mobile use-cases, without affecting other types of workloads.

Hereafter we start from a short description of the observed issues and then
we resume the already explored solution. Finally we introduce a possible
alternative approach which is the one proposed by this RFC. The ultimate
goal of this posting is to prepare the ground for a fruitful discussion at
the upcoming LPC.

.:: Problem description
=======================

The "Per Entity Load Tracking" (PELT) algorithm is a really good and
run-time efficient estimator of task sizes. However, it has some hardcoded
properties which makes it not completely generic to describe all possible
classes of tasks.

PELT's slow ramp-up
-------------------

For example, one such parameter is the time constant used to define the
rate of increase/decay of the geometric series. This constant is
by default defined to optimize the description of periodic tasks with a
32ms period. This means that it takes 32ms for a task's utilization to
ramp up from 0% to 50% and around 100ms to generate ~90% utilization as
shown in this figure:

https://drive.google.com/open?id=0B5oRu4ZO-uuKb3NGa2FvWC1pZzQ

which has been obtained using the simple (python based) PELT simulator
available here [2] (if you like to play with some other settings).

This behavior is usually described as PELT having a "slow ramp-up",
especially when compared to alternative (non mainline) approaches
(e.g. Qualcomm's proposed "Window Assisted Load Tracking" (WALT) [3]),
and this is mainly affecting mobile workloads.

In the mobile world, where some of the most important tasks are
synchronized with the frame buffer refresh rate, it's quite common to
run tasks on a 16ms period. This 16ms window is the time in
which everything happens for the generation of a new frame, thus it's of
paramount importance to know exactly how much CPU bandwidth is required
by every task running in such a time frame.

Potentially, PELT ramp-up time can be reduced by redefining its time
constant for example to be 8ms. Still it will take at least two frames
for PELT to better describe a task which runs for 4ms every 16ms.
However, has it can be seen from this figure:

https://drive.google.com/open?id=0B5oRu4ZO-uuKeEN0aGpLRlpOWjA

the utilization of such a task will keep oscillating in the range
~[140, 400] with just the average being the expected ~25%.

PELT's oscillations
-------------------

PELT oscillations are part of its design and they cannot be easily
removed by just tuning its parameters. To a certain extend they are also
required, for example when PELT is used to describe a CPU's
utilization. In this case, once a CPU becomes idle its utilization
represents what is considered its "blocked load", i.e. the utilization
which is potentially going to be reclaimed in a while by tasks
temporarily sleeping on that CPU.
The blocked load signal is(*) progressively decayed thus not allowing a
task to reserve indefinitely its utilization on the CPU where it
executed last time.

Despite being not avoidable, oscillations contributes to make PELT
sometimes a signal which is much more variable than required.
Because of these oscillations a scheduler decision risks to be wrong,
right after it has been taken. Moreover, oscillations increase the
changes to have not consistent views among different subsystems, i.e.
the FAIR scheduler and schedutil.

For example, let's consider the case of a task which wakes up with a
utilization just below a certain threshold value which is critical for a
proper CPU/frequency selection. In this case the scheduler can pick a
CPU which will not be anymore the optimal fit just 1ms after, as well as
schedutil can pick a lower then optimal OPP. This risk is increased by
the fact that the utilization of a task is pre-decayed before being
enqueued into a CPU, as we already do (sometimes) in mainline and we are
going to do even more after certain improvements [4] currently on
discussion on the list are going to be eventually merged.

One last issue of the current PELT implementation is that utilization
decay is never clamped, thus a usually long running task, which wakes up
once in a while, can have its utilization completely decayed.
Thus, it will take the longest time to ram-up from 0% while also
schedutil will have to traverse all the available OPP, from minimum to
maximum.

.:: Possible solutions
======================

As a possible solution for the aforementioned problems, at the previous
LPC, Paul Turner proposed to implement a "decay clamping" mechanism.
The fundamental idea is that, once a task is sleep-suspended, the
utilization of a task is decayed only for a certain (configurable) amount
of time, thus actually retaining at wakeup time part of the utilization we
built up during its last activation.

Such an approach has been implemented by Morten and the prototype
implementation used for testing and evaluation. Our results have been
presented at the last OSPM [1] and the overall impression was that this
approach is not up to the job.
Moreover, being a fundamental modification of how the PELT signal is
actually tracked/updated, there are many implementation details which
potentially open to subtle bugs.

As a possible alternative solution, this series presents a first prototype
for a new signal built "on top of PELT". The fundamental idea is to keep
PELT exactly as it is and use it as an "estimator" which feeds information
into an "aggregator".

>From a utilization standpoint, PELT provides the most complete description
of the bandwidth required by a task once the task completes an activation.
Indeed, once a task is going to sleep we know exactly how much CPU it
required since the PELT signal has build up all the corresponding
utilization.

It is pretty straightforward to keep track of this valuable piece of
information, each time a task is dequeued. The task's utilization can be
also aggregated considering its previous activations, thus building a
useful statistic on how much utilization a task is generally requiring
at each activation.

The aggregated values can be easily used by both the scheduler and
schedutil, via a new pair of {cpu,task}_util_est() methods which are simple
wrappers of the existing {cpu,task}_util() ones. The final result is
something which mimics some of the benefits of WALT but using a streamline
implementation on top of PELT.
Details on how these wrapper methods are implemented and used are described
in the changelog of the following patches.

To resume, what we propose here is a "kind of" simple low-pass filter on
top of PELT which main benefits are:

1. Simple implementation on top of PELT, thus not risking to break such
a fundamental component.

2. Better separation of concerns, where PELT is used as a fast-dynamic
estimator while util_est is a more low-dynamic aggregator.

3. The proposed simple implementation can be further extended to
introduce more advanced features related to tasks behavior profiling,
still without requiring to modify PELT.

4. It masks the fast decay of the utilization signal when PELT is
speeded-up, by tuning its time constant, to reduce utilization's
build-up time.

5. PELT oscillation are completely masked, thus providing an overall
smoothed and consistent representation of task requirements, thus
reducing the risk of suboptimal or misaligned decisions among
scheduler and schedutil.

Finally, the proposed solution is so simple since it focuses on just the
entities of interest for the major decisions involving the usage of the
utilization signal. Specifically, the new util_est signal is update just
for SE which are actual tasks and only for CPU's top-level RQs.
The first because are the only entities actually "allocated on CPUs" by the
FAIR scheduler and the last because they represents the only entities of
interest for the frequency selections operated by schedutil.
In other words, task groups related concepts are not affected at all by
the proposed implementation, which is IMHO another big advantage of the
proposed solution.

The current implementation is to be considered as an initial prototype,
which it's however already usable to run some interesting tests and
experiments as the ones which are available here:

https://gist.github.com/derkling/0d07b97ca18cc5eac25e51404e81169f

Among the different results, when util_est is in use we observed:

- A significant increase in residency on highest OPP for an example
task running for 120ms every 300 (60% duty-cycle).
Such a task, in terms of utilization, is defenitively a big one and
with util_est schedutil immediately switch to the highest OPP as
soon as the task wakes up, instead of pregressively scaling up.

- A smoother OPP reduction when a task switch from being big to being
small, which introduces an interesting hysteresis on OPP scaling
down.

- A perfect PELT's utilization follow-up when the task switch being
small to being a big one, which (eventually) allows to exploit a
reduced ramp-up PELT configuration.

- A proper migration of the CPU's estimated utilization when a task
migrate across different CPUs, which does not requires any
modification to PELT itself.

Some of the open point of discussion for potential improvements are:

- better selection of the value to aggregate
for example by considering the average value between the task's
utilization at enqueue and dequeue time (which is likely a better
representation of the real task utilization)

- extend the wrapper methods to modulate which estimation to return
depending on task requirements

- improve the definition of the estimated utilization of a RQ
since in this prototype it track simply the last value without any
kind of historical aggregation, which seems to be good enough but
maybe it can be made more smart.

.:: Patches organization
========================

The first patch of this series introduces the util_est "aggregator" to
track the estimated utilization of both SEs (Tasks) and (top-level) RQs.
The two following patches presents a possible integration with the
scheduler's LB path and the schedutil's frequency selection policies.

Cheers Patrick

.:: References
==============
[1] slides: http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
video: http://youtu.be/adnSHPBGS-w
[2] https://gist.github.com/derkling/6d1d3a3164b0291ef64fa409f61a81cb
[3] https://lkml.org/lkml/2016/10/28/84
[4] https://patchwork.kernel.org/patch/9876769/

(*) Better saying "it should be", since this is currently not granted to
happen on NO_HZ IDLE CPUs. But this is a different issue outside of the
scope of this posting.

Patrick Bellasi (3):
sched/fair: add util_est on top of PELT
sched/fair: use util_est in LB
sched/cpufreq_schedutil: use util_est for OPP selection

include/linux/sched.h | 48 ++++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 7 ++-
kernel/sched/debug.c | 8 +++
kernel/sched/fair.c | 106 +++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 4 ++
5 files changed, 169 insertions(+), 4 deletions(-)

--
2.14.1


2017-08-25 10:20:37

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC 1/3] sched/fair: add util_est on top of PELT

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can results in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the utilization of a task is, by PELT definition, a continuously
changing metrics. This contributes in making almost instantly outdated some
decisions based on the value of the PELT's utilization.

For all these reasons, a more stable signal could probably do a better job
of representing the expected/estimated utilization of a SE/RQ. Such a
signal can be easily created on top of PELT by still using it as an
estimator which produces values to be aggregated once meaningful events
happens.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

util_est(se) = max(se::util_avg, f(se::util_avg@dequeue_times))

This allows to remember how big a task has been reported to be by PELT in
its previous activations via the function: f(se::util_avg@dequeue_times).

If a task should change it's behavior and it runs even longer in a new
activation, after a certain time util_est will just track the original PELT
signal (i.e. se::util_avg).

For a (top-level) RQ, the estimated utilization is simply defined as:

util_est(rq) = max(rq::util_est, rq::util_avg)

where:

rq::util_est = sum(se::util_est), for each RUNNABLE SE on that CPU

It's worth to note that the estimated utilization is tracked only for
entities of interests, specifically:
- SE which are Tasks
since we want to better support tasks placement decisions
- top-level CPU's RQs
since we want to better support frequencies selection for a CPU

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 48 ++++++++++++++++++++++++++++++++++++
kernel/sched/debug.c | 8 ++++++
kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c28b182c9833..8d7bc55f68d5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -26,6 +26,7 @@
#include <linux/signal_types.h>
#include <linux/mm_types_task.h>
#include <linux/task_io_accounting.h>
+#include <linux/average.h>

/* task_struct member predeclarations (sorted alphabetically): */
struct audit_context;
@@ -277,6 +278,16 @@ struct load_weight {
u32 inv_weight;
};

+/**
+ * Utilizaton's Exponential Weighted Moving Average (EWMA)
+ *
+ * Support functions to track an EWMA for the utilization of SEs and RQs. New
+ * samples will be added to the moving average each time a task completes an
+ * activation. Thus the weight is chosen so that the EWMA wil be relatively
+ * insensitive to transient changes to the task's workload.
+ */
+DECLARE_EWMA(util, 0, 4);
+
/*
* The load_avg/util_avg accumulates an infinite geometric series
* (see __update_load_avg() in kernel/sched/fair.c).
@@ -336,8 +347,45 @@ struct sched_avg {
u32 period_contrib;
unsigned long load_avg;
unsigned long util_avg;
+
+ /* Utilization estimation */
+ struct ewma_util util_ewma;
+ struct util_est {
+ unsigned long ewma;
+ unsigned long last;
+ } util_est;
};

+/* Utilization estimation policies */
+#define UTIL_EST_MAX_EWMA_LAST 0 /* max(sa->util_est.ema, sa->util_est.last) */
+#define UTIL_EST_EWMA 1 /* sa->util_est.ewma */
+#define UTIL_EST_LAST 2 /* sa->util_est.last */
+
+/* Default policy used by utilization estimation */
+#define UTIL_EST_POLICY UTIL_EST_MAX_EWMA_LAST
+
+/**
+ * util_est: estimated utilization for a given entity (i.e. SE or RQ)
+ *
+ * Depending on the selected utlization estimation policy, the estimated
+ * utilization of a SE or RQ is returned by this function.
+ * Supported policies are:
+ * UTIL_EST_LAST: the value of the PELT signal the last time a SE has
+ * completed an activation, i.e. it has been dequeued because
+ * of a sleep
+ * UTIL_EST_EWMA: the exponential weighted moving average of all the past
+ * UTIL_EST_LAST samples
+ * UTIL_EST_MAX_EWMA_LAST: the maximum among the previous two metrics
+ */
+static inline unsigned long util_est(struct sched_avg *sa, int policy)
+{
+ if (likely(policy == UTIL_EST_MAX_EWMA_LAST))
+ return max(sa->util_est.ewma, sa->util_est.last);
+ if (policy == UTIL_EST_EWMA)
+ return sa->util_est.ewma;
+ return sa->util_est.last;
+}
+
struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cfd84f79e075..17e293adb7f0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -399,6 +399,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
#ifdef CONFIG_SMP
P(se->avg.load_avg);
P(se->avg.util_avg);
+ P(se->avg.util_est.ewma);
+ P(se->avg.util_est.last);
#endif

#undef PN_SCHEDSTAT
@@ -521,6 +523,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "util_est.ewma",
+ cfs_rq->avg.util_est.ewma);
+ SEQ_printf(m, " .%-30s: %lu\n", "util_est.last",
+ cfs_rq->avg.util_est.last);
SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
atomic_long_read(&cfs_rq->removed_load_avg));
SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
@@ -966,6 +972,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.util_sum);
P(se.avg.load_avg);
P(se.avg.util_avg);
+ P(se.avg.util_est.ewma);
+ P(se.avg.util_est.last);
P(se.avg.last_update_time);
#endif
P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d5868771cb3..a4ec1b8c4755 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -751,6 +751,10 @@ void init_entity_runnable_average(struct sched_entity *se)
sa->util_avg = 0;
sa->util_sum = 0;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
+
+ ewma_util_init(&sa->util_ewma);
+ sa->util_est.ewma = 0;
+ sa->util_est.last = 0;
}

static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
@@ -4880,6 +4884,21 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static inline int task_util(struct task_struct *p);
+static inline int task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+
+ /*
+ * Update (top level CFS) RQ estimated utilization.
+ * NOTE: here we assumes that we never change the
+ * utilization estimation policy at run-time.
+ */
+ cfs_rq->avg.util_est.last += task_util_est(p);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4932,9 +4951,49 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);

+ util_est_enqueue(p);
hrtick_update(rq);
}

+static inline void util_est_dequeue(struct task_struct *p, int flags)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+ int task_sleep = flags & DEQUEUE_SLEEP;
+ long util_est;
+
+ /*
+ * Update (top level CFS) RQ estimated utilization
+ *
+ * When *p is the last FAIR task then the RQ's estimated utilization
+ * is 0 by its definition.
+ *
+ * Otherwise, in removing *p's util_est from the current RQ's util_est
+ * we should account for cases where this last activation of *p was
+ * longher then the previous ones. In these cases as well we set to 0
+ * the new estimated utilization for the CPU.
+ */
+ util_est = (cfs_rq->nr_running > 1)
+ ? cfs_rq->avg.util_est.last - task_util_est(p)
+ : 0;
+ if (util_est < 0)
+ util_est = 0;
+ cfs_rq->avg.util_est.last = util_est;
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * about the task size. This is done by storing the last PELT value
+ * for this task and using this value to load another sample in the
+ * EMWA for the task.
+ */
+ if (task_sleep) {
+ p->se.avg.util_est.last = task_util(p);
+ ewma_util_add(&p->se.avg.util_ewma, task_util(p));
+ p->se.avg.util_est.ewma = ewma_util_read(&p->se.avg.util_ewma);
+ }
+}
+
static void set_next_buddy(struct sched_entity *se);

/*
@@ -4991,6 +5050,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);

+ util_est_dequeue(p, flags);
hrtick_update(rq);
}

@@ -5486,7 +5546,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return affine;
}

-static inline int task_util(struct task_struct *p);
static int cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5931,6 +5990,13 @@ static inline int task_util(struct task_struct *p)
return p->se.avg.util_avg;
}

+static inline int task_util_est(struct task_struct *p)
+{
+ struct sched_avg *sa = &p->se.avg;
+
+ return util_est(sa, UTIL_EST_POLICY);
+}
+
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
--
2.14.1

2017-08-25 10:20:41

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC 3/3] sched/cpufreq_schedutil: use util_est for OPP selection

When the schedutil looks at the CPU utlization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal building up time is function of the current
frequency, becasue of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce these kind of latencies, this patch integrates the usage
of the CPU's estimated utlization in the sugov_get_util function.

The estimated utilization of a CPU is defined to be the maximum between its
PELT's utilization and the sum of the estimated utilization of each task
currently RUNNABLE on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example, it has just got a big task running after a long sleep period,
and ultimately it allows to select the best frequency to run a task
right after it wakes up.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 910a915fef8b..aacba9f7202e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -161,7 +161,12 @@ static void sugov_get_util(unsigned long *util, unsigned long *max)

cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());

- *util = min(rq->cfs.avg.util_avg, cfs_max);
+ *util = rq->cfs.avg.util_avg;
+ if (sched_feat(UTIL_EST))
+ *util = max(*util,
+ util_est(&rq->cfs.avg, UTIL_EST_MAX_EWMA_LAST));
+ *util = min(*util, cfs_max);
+
*max = cfs_max;
}

--
2.14.1

2017-08-25 10:21:06

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC 2/3] sched/fair: use util_est in LB

When the scheduler looks at the CPU utlization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU which is still considered relatively empty.

In order to reduce these kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in some load balancer path.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example it has just got a big task running since a long sleep
period.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 4 ++++
2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4ec1b8c4755..2593da1d1465 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5985,6 +5985,32 @@ static int cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}

+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked load" in describing the
+ * potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utlization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+ struct sched_avg *sa = &cpu_rq(cpu)->cfs.avg;
+ unsigned long util = cpu_util(cpu);
+
+ if (!sched_feat(UTIL_EST))
+ return util;
+
+ return max(util, util_est(sa, UTIL_EST_LAST));
+}
+
static inline int task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
@@ -6007,11 +6033,19 @@ static int cpu_util_wake(int cpu, struct task_struct *p)

/* Task has no contribution or is new */
if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
- return cpu_util(cpu);
+ return cpu_util_est(cpu);

capacity = capacity_orig_of(cpu);
util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);

+ /*
+ * Estimated utilization tracks only tasks already enqueued, but still
+ * sometimes can return a bigger value than PELT, for example when the
+ * blocked load is negligible wrt the estimated utilization of the
+ * already enqueued tasks.
+ */
+ util = max_t(long, util, cpu_util_est(cpu));
+
return (util >= capacity) ? capacity : util;
}

@@ -7539,7 +7573,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);

sgs->group_load += load;
- sgs->group_util += cpu_util(i);
+ sgs->group_util += cpu_util_est(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

nr_running = rq->nr_running;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d3fb15555291..dadae44be2ab 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,3 +81,7 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(ATTACH_AGE_LOAD, true)

+/*
+ * UtilEstimation. Use estimated CPU utiliation.
+ */
+SCHED_FEAT(UTIL_EST, false)
--
2.14.1

2017-08-29 04:36:49

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 1/3] sched/fair: add util_est on top of PELT

Hi Patrick,

On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
<[email protected]> wrote:
> The util_avg signal computed by PELT is too variable for some use-cases.
> For example, a big task waking up after a long sleep period will have its
> utilization almost completely decayed. This introduces some latency before
> schedutil will be able to pick the best frequency to run a task.
>
> The same issue can affect task placement. Indeed, since the task
> utilization is already decayed at wakeup, when the task is enqueued in a
> CPU, this can results in a CPU running a big task as being temporarily
> represented as being almost empty. This leads to a race condition where
> other tasks can be potentially allocated on a CPU which just started to run
> a big task which slept for a relatively long period.
>
> Moreover, the utilization of a task is, by PELT definition, a continuously
> changing metrics. This contributes in making almost instantly outdated some
> decisions based on the value of the PELT's utilization.
>
> For all these reasons, a more stable signal could probably do a better job
> of representing the expected/estimated utilization of a SE/RQ. Such a
> signal can be easily created on top of PELT by still using it as an
> estimator which produces values to be aggregated once meaningful events
> happens.
>
> This patch adds a simple implementation of util_est, a new signal built on
> top of PELT's util_avg where:
>
> util_est(se) = max(se::util_avg, f(se::util_avg@dequeue_times))
>

I don't see any wrapper function in this patch that implements this
signal. You want to use this signal in the task placement path as a
replacement of task_util(), right?

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2017-08-29 04:46:03

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/fair: use util_est in LB

On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
<[email protected]> wrote:
> When the scheduler looks at the CPU utlization, the current PELT value
> for a CPU is returned straight away. In certain scenarios this can have
> undesired side effects on task placement.
>

<snip>

> +/**
> + * cpu_util_est: estimated utilization for the specified CPU
> + * @cpu: the CPU to get the estimated utilization for
> + *
> + * The estimated utilization of a CPU is defined to be the maximum between its
> + * PELT's utilization and the sum of the estimated utilization of the tasks
> + * currently RUNNABLE on that CPU.
> + *
> + * This allows to properly represent the expected utilization of a CPU which
> + * has just got a big task running since a long sleep period. At the same time
> + * however it preserves the benefits of the "blocked load" in describing the
> + * potential for other tasks waking up on the same CPU.
> + *
> + * Return: the estimated utlization for the specified CPU
> + */
> +static inline unsigned long cpu_util_est(int cpu)
> +{
> + struct sched_avg *sa = &cpu_rq(cpu)->cfs.avg;
> + unsigned long util = cpu_util(cpu);
> +
> + if (!sched_feat(UTIL_EST))
> + return util;
> +
> + return max(util, util_est(sa, UTIL_EST_LAST));
> +}
> +
> static inline int task_util(struct task_struct *p)
> {
> return p->se.avg.util_avg;
> @@ -6007,11 +6033,19 @@ static int cpu_util_wake(int cpu, struct task_struct *p)
>
> /* Task has no contribution or is new */
> if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> - return cpu_util(cpu);
> + return cpu_util_est(cpu);
>
> capacity = capacity_orig_of(cpu);
> util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
>
> + /*
> + * Estimated utilization tracks only tasks already enqueued, but still
> + * sometimes can return a bigger value than PELT, for example when the
> + * blocked load is negligible wrt the estimated utilization of the
> + * already enqueued tasks.
> + */
> + util = max_t(long, util, cpu_util_est(cpu));
> +

We are supposed to discount the task's util from its CPU. But the
cpu_util_est() can potentially return cpu_util() which includes the
task's utilization.

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2017-08-29 06:41:20

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 1/3] sched/fair: add util_est on top of PELT

On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
<[email protected]> wrote:
> The util_avg signal computed by PELT is too variable for some use-cases.
> For example, a big task waking up after a long sleep period will have its
> utilization almost completely decayed. This introduces some latency before
> schedutil will be able to pick the best frequency to run a task.
>

<snip>

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c28b182c9833..8d7bc55f68d5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -26,6 +26,7 @@
> #include <linux/signal_types.h>
> #include <linux/mm_types_task.h>
> #include <linux/task_io_accounting.h>
> +#include <linux/average.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> struct audit_context;
> @@ -277,6 +278,16 @@ struct load_weight {
> u32 inv_weight;
> };
>
> +/**
> + * Utilizaton's Exponential Weighted Moving Average (EWMA)
> + *
> + * Support functions to track an EWMA for the utilization of SEs and RQs. New
> + * samples will be added to the moving average each time a task completes an
> + * activation. Thus the weight is chosen so that the EWMA wil be relatively
> + * insensitive to transient changes to the task's workload.
> + */
> +DECLARE_EWMA(util, 0, 4);
> +
> /*

Should the factor be 1 instead of 0? i.e 25% contribution from the
recent sample.

Thanks,
Pavan


--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

2017-09-04 10:59:48

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC 1/3] sched/fair: add util_est on top of PELT

On 29-Aug 12:11, Pavan Kondeti wrote:
> On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
> <[email protected]> wrote:
> > The util_avg signal computed by PELT is too variable for some use-cases.
> > For example, a big task waking up after a long sleep period will have its
> > utilization almost completely decayed. This introduces some latency before
> > schedutil will be able to pick the best frequency to run a task.
> >
>
> <snip>
>
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index c28b182c9833..8d7bc55f68d5 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -26,6 +26,7 @@
> > #include <linux/signal_types.h>
> > #include <linux/mm_types_task.h>
> > #include <linux/task_io_accounting.h>
> > +#include <linux/average.h>
> >
> > /* task_struct member predeclarations (sorted alphabetically): */
> > struct audit_context;
> > @@ -277,6 +278,16 @@ struct load_weight {
> > u32 inv_weight;
> > };
> >
> > +/**
> > + * Utilizaton's Exponential Weighted Moving Average (EWMA)
> > + *
> > + * Support functions to track an EWMA for the utilization of SEs and RQs. New
> > + * samples will be added to the moving average each time a task completes an
> > + * activation. Thus the weight is chosen so that the EWMA wil be relatively
> > + * insensitive to transient changes to the task's workload.
> > + */
> > +DECLARE_EWMA(util, 0, 4);
> > +
> > /*
>
> Should the factor be 1 instead of 0? i.e 25% contribution from the
> recent sample.

The weight of new samples is represented by the third parameter which
assigns them 1/4 (25%) weight.

That zero you are pointing out defines the "precision" in terms of
bits for the fractional part. In this first prototype I've just
disregarded any fractional precision but, considering that we use
this, EWMA to aggregate utilization values, I should probably better
set it to SCHED_CAPACITY_SHIFT.

> Thanks,
> Pavan

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-09-04 11:13:47

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC 1/3] sched/fair: add util_est on top of PELT

On 29-Aug 10:06, Pavan Kondeti wrote:
> Hi Patrick,
>
> On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
> <[email protected]> wrote:
> > The util_avg signal computed by PELT is too variable for some use-cases.
> > For example, a big task waking up after a long sleep period will have its
> > utilization almost completely decayed. This introduces some latency before
> > schedutil will be able to pick the best frequency to run a task.
> >
> > The same issue can affect task placement. Indeed, since the task
> > utilization is already decayed at wakeup, when the task is enqueued in a
> > CPU, this can results in a CPU running a big task as being temporarily
> > represented as being almost empty. This leads to a race condition where
> > other tasks can be potentially allocated on a CPU which just started to run
> > a big task which slept for a relatively long period.
> >
> > Moreover, the utilization of a task is, by PELT definition, a continuously
> > changing metrics. This contributes in making almost instantly outdated some
> > decisions based on the value of the PELT's utilization.
> >
> > For all these reasons, a more stable signal could probably do a better job
> > of representing the expected/estimated utilization of a SE/RQ. Such a
> > signal can be easily created on top of PELT by still using it as an
> > estimator which produces values to be aggregated once meaningful events
> > happens.
> >
> > This patch adds a simple implementation of util_est, a new signal built on
> > top of PELT's util_avg where:
> >
> > util_est(se) = max(se::util_avg, f(se::util_avg@dequeue_times))
> >
>
> I don't see any wrapper function in this patch that implements this
> signal. You want to use this signal in the task placement path as a
> replacement of task_util(), right?

You right, I should update this changelog which is a bit misleading.

What I'm writing above is the way we combine a task's estimated
utilization with its util_avg. That's what you find in the code of the
following patches, but strictly speacking we do not have a wrapper
function.

> Thanks,
> Pavan

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-09-04 14:18:24

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/fair: use util_est in LB

On 29-Aug 10:15, Pavan Kondeti wrote:
> On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
> <[email protected]> wrote:
> > When the scheduler looks at the CPU utlization, the current PELT value
> > for a CPU is returned straight away. In certain scenarios this can have
> > undesired side effects on task placement.
> >
>
> <snip>
>
> > +/**
> > + * cpu_util_est: estimated utilization for the specified CPU
> > + * @cpu: the CPU to get the estimated utilization for
> > + *
> > + * The estimated utilization of a CPU is defined to be the maximum between its
> > + * PELT's utilization and the sum of the estimated utilization of the tasks
> > + * currently RUNNABLE on that CPU.
> > + *
> > + * This allows to properly represent the expected utilization of a CPU which
> > + * has just got a big task running since a long sleep period. At the same time
> > + * however it preserves the benefits of the "blocked load" in describing the
> > + * potential for other tasks waking up on the same CPU.
> > + *
> > + * Return: the estimated utlization for the specified CPU
> > + */
> > +static inline unsigned long cpu_util_est(int cpu)
> > +{
> > + struct sched_avg *sa = &cpu_rq(cpu)->cfs.avg;
> > + unsigned long util = cpu_util(cpu);
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return util;
> > +
> > + return max(util, util_est(sa, UTIL_EST_LAST));
> > +}
> > +
> > static inline int task_util(struct task_struct *p)
> > {
> > return p->se.avg.util_avg;
> > @@ -6007,11 +6033,19 @@ static int cpu_util_wake(int cpu, struct task_struct *p)
> >
> > /* Task has no contribution or is new */
> > if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> > - return cpu_util(cpu);
> > + return cpu_util_est(cpu);
> >
> > capacity = capacity_orig_of(cpu);
> > util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> >
> > + /*
> > + * Estimated utilization tracks only tasks already enqueued, but still
> > + * sometimes can return a bigger value than PELT, for example when the
> > + * blocked load is negligible wrt the estimated utilization of the
> > + * already enqueued tasks.
> > + */
> > + util = max_t(long, util, cpu_util_est(cpu));
> > +
>
> We are supposed to discount the task's util from its CPU. But the
> cpu_util_est() can potentially return cpu_util() which includes the
> task's utilization.

You right, this instead should cover all the cases:

---8<---
static int cpu_util_wake(int cpu, struct task_struct *p)
{
- unsigned long util, capacity;
+ unsigned long util_est = cpu_util_est(cpu);
+ unsigned long capacity;

/* Task has no contribution or is new */
if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
- return cpu_util(cpu);
+ return util_est;

capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ if (cpu_util(cpu) > util_est)
+ util = max_t(long, cpu_util(cpu) - task_util(p), 0);
+ else
+ util = util_est;

return (util >= capacity) ? capacity : util;
}
---8<---

Indeed:

- if *p is the only task sleeping on that CPU, then:
(cpu_util == task_util) > (cpu_util_est == 0)
and thus we return:
(cpu_util - task_util) == 0

- if other tasks are SLEEPING on the same CPU, which however is IDLE, then:
cpu_util > (cpu_util_est == 0)
and thus we discount *p's blocked load by returning:
(cpu_util - task_util) >= 0

- if other tasks are RUNNABLE on that CPU and
(cpu_util_est > cpu_util)
then we wanna use cpu_util_est since it returns a more restrictive
estimation of the spare capacity on that CPU, by just considering
the expected utilization of tasks already runnable on that CPU.

What do you think?

> Thanks,
> Pavan

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-09-04 14:59:36

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 2/3] sched/fair: use util_est in LB

On Mon, Sep 4, 2017 at 7:48 PM, Patrick Bellasi <[email protected]> wrote:
> On 29-Aug 10:15, Pavan Kondeti wrote:
>> On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
>> <[email protected]> wrote:
>> > When the scheduler looks at the CPU utlization, the current PELT value
>> > for a CPU is returned straight away. In certain scenarios this can have
>> > undesired side effects on task placement.
>> >
>>
>> <snip>
>>
>> > +/**
>> > + * cpu_util_est: estimated utilization for the specified CPU
>> > + * @cpu: the CPU to get the estimated utilization for
>> > + *
>> > + * The estimated utilization of a CPU is defined to be the maximum between its
>> > + * PELT's utilization and the sum of the estimated utilization of the tasks
>> > + * currently RUNNABLE on that CPU.
>> > + *
>> > + * This allows to properly represent the expected utilization of a CPU which
>> > + * has just got a big task running since a long sleep period. At the same time
>> > + * however it preserves the benefits of the "blocked load" in describing the
>> > + * potential for other tasks waking up on the same CPU.
>> > + *
>> > + * Return: the estimated utlization for the specified CPU
>> > + */
>> > +static inline unsigned long cpu_util_est(int cpu)
>> > +{
>> > + struct sched_avg *sa = &cpu_rq(cpu)->cfs.avg;
>> > + unsigned long util = cpu_util(cpu);
>> > +
>> > + if (!sched_feat(UTIL_EST))
>> > + return util;
>> > +
>> > + return max(util, util_est(sa, UTIL_EST_LAST));
>> > +}
>> > +
>> > static inline int task_util(struct task_struct *p)
>> > {
>> > return p->se.avg.util_avg;
>> > @@ -6007,11 +6033,19 @@ static int cpu_util_wake(int cpu, struct task_struct *p)
>> >
>> > /* Task has no contribution or is new */
>> > if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
>> > - return cpu_util(cpu);
>> > + return cpu_util_est(cpu);
>> >
>> > capacity = capacity_orig_of(cpu);
>> > util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
>> >
>> > + /*
>> > + * Estimated utilization tracks only tasks already enqueued, but still
>> > + * sometimes can return a bigger value than PELT, for example when the
>> > + * blocked load is negligible wrt the estimated utilization of the
>> > + * already enqueued tasks.
>> > + */
>> > + util = max_t(long, util, cpu_util_est(cpu));
>> > +
>>
>> We are supposed to discount the task's util from its CPU. But the
>> cpu_util_est() can potentially return cpu_util() which includes the
>> task's utilization.
>
> You right, this instead should cover all the cases:
>
> ---8<---
> static int cpu_util_wake(int cpu, struct task_struct *p)
> {
> - unsigned long util, capacity;
> + unsigned long util_est = cpu_util_est(cpu);
> + unsigned long capacity;
>
> /* Task has no contribution or is new */
> if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> - return cpu_util(cpu);
> + return util_est;
>
> capacity = capacity_orig_of(cpu);
> - util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> + if (cpu_util(cpu) > util_est)
> + util = max_t(long, cpu_util(cpu) - task_util(p), 0);
> + else
> + util = util_est;
>
> return (util >= capacity) ? capacity : util;
> }
> ---8<---
>
> Indeed:
>
> - if *p is the only task sleeping on that CPU, then:
> (cpu_util == task_util) > (cpu_util_est == 0)
> and thus we return:
> (cpu_util - task_util) == 0
>
> - if other tasks are SLEEPING on the same CPU, which however is IDLE, then:
> cpu_util > (cpu_util_est == 0)
> and thus we discount *p's blocked load by returning:
> (cpu_util - task_util) >= 0
>
> - if other tasks are RUNNABLE on that CPU and
> (cpu_util_est > cpu_util)
> then we wanna use cpu_util_est since it returns a more restrictive
> estimation of the spare capacity on that CPU, by just considering
> the expected utilization of tasks already runnable on that CPU.
>
> What do you think?
>

This looks good to me.

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project