2018-01-23 18:11:16

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v3 0/3] Utilization estimation (util_est) for FAIR tasks

Hi,

This is a respin of [1], rebased on today's tip/sche/core [2], which
gives the util_est series now working on top of Juri's series [3] integrating
SCHED_DEADLINE into schedutil.

Thanks to everyone who provided feedback on the previous version,
all of the feedbacks have been addressed.

Specifically, as Peter suggested, util_est signals for sched_entity's have been
moved into sched_entity::sched_avg. This way util_est now fits into a single 64B
cacheline along with its required util_avg signal.
On the other hand, I've kept instead the EWMA conditional update, which Peter
suggested to remove, since it turns out to save a bit more of overhead
compared to not having it.

This version has been verified to not have any noticeable overhead, despite
the sched_feat(UTIL_EST) being enabled, by using:

perf bench sched messaging --pipe --thread --group 8 --loop 50000

running 30 iterations on a 40 cores machine:

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
dual socket, 10 cores (20 threads) per socket

In the previous version, with this feature enabled, the measured overhead was
instead in the range of ~1% on the same HW/SW test configuration.

This series still keeps the sched feature disabled by default, but given the
new performance figures we could now consider to have it always enabled, or
even just covered by a dedicated KConfig option.

Additional experiments [4] have been done by back-porting these patches to the
v4.4 based kernel running on a Google's Pixel 2 phone. This allowed us to
verify that the proposed modifications contribute to the improvement of PELT by
either matching or outperforming WALT [5], an out-of-tree load tracking
solution currently used by some high-end Android devices, in a representative
set of interactive workloads and industrial benchmarks.

Changes in v3:
- rebased on today's tip/sched/core (0788116)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line
- schedutil integration using Juri's cpu_util_cfs()
- first patch dropped since it's already queued in tip/sched/core

Changes in v2:
- rebased on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est

Cheers Patrick

.:: References
==============
[1] https://lkml.org/lkml/2017/12/5/634
[email protected]
[2] git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
sched/core (commit 07881166a892)
[3] https://lkml.org/lkml/2017/12/4/173
[email protected]
[4] https://gist.github.com/derkling/e087e1d028ef200f51ca356acdf68664
[5] Window Assisted Load Tracking
https://lwn.net/Articles/704903/

Patrick Bellasi (3):
sched/fair: add util_est on top of PELT
sched/fair: use util_est in LB and WU paths
sched/cpufreq_schedutil: use util_est for OPP selection

include/linux/sched.h | 16 +++++
kernel/sched/debug.c | 4 ++
kernel/sched/fair.c | 156 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 5 ++
kernel/sched/sched.h | 8 ++-
5 files changed, 181 insertions(+), 8 deletions(-)

--
2.15.1



2018-01-23 18:10:03

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v3 3/3] sched/cpufreq_schedutil: use util_est for OPP selection

When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal build-up time is a function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce this kind of latencies, here we integrates the usage
of the CPU's estimated utilization in the sugov_get_util function.
This allows to properly consider the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period.
Ultimately this allows to select the best frequency to run a task
right after its wake-up.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v3:
- rebase on today's tip/sched/core (commit 07881166a892)
- moved into Juri's cpu_util_cfs(), which should also
address Rafael's suggestion to use a local variable.

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/sched.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b4d9750a927..08a80ee8f89e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2128,7 +2128,12 @@ static inline unsigned long cpu_util_dl(struct rq *rq)

static inline unsigned long cpu_util_cfs(struct rq *rq)
{
- return rq->cfs.avg.util_avg;
+ unsigned long util_est = rq->cfs.avg.util_avg;
+
+ if (sched_feat(UTIL_EST))
+ util_est = max(util_est, rq->cfs.util_est_runnable);
+
+ return util_est;
}

#endif
--
2.15.1


2018-01-23 18:10:51

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.

In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in cpu_util_wake as well as in update_sg_lb_stats.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 68 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0bfe94f3176e..6f2e614fd79f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6332,6 +6332,41 @@ static unsigned long cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}

+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked utilization" in
+ * describing the potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utilization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+ unsigned long util, util_est;
+ unsigned long capacity;
+ struct cfs_rq *cfs_rq;
+
+ if (!sched_feat(UTIL_EST))
+ return cpu_util(cpu);
+
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = cfs_rq->avg.util_avg;
+ util_est = cfs_rq->util_est_runnable;
+ util_est = max(util, util_est);
+
+ capacity = capacity_orig_of(cpu);
+ util_est = min(util_est, capacity);
+
+ return util_est;
+}
+
static inline unsigned long task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
@@ -6348,16 +6383,43 @@ static inline unsigned long task_util_est(struct task_struct *p)
*/
static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
- unsigned long util, capacity;
+ long util, util_est;

/* Task has no contribution or is new */
if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
- return cpu_util(cpu);
+ return cpu_util_est(cpu);

- capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ /* Discount task's blocked util from CPU's util */
+ util = cpu_util(cpu) - task_util(p);
+ util = max(util, 0L);

- return (util >= capacity) ? capacity : util;
+ if (!sched_feat(UTIL_EST))
+ return util;
+
+ /*
+ * These are the main cases covered:
+ * - if *p is the only task sleeping on this CPU, then:
+ * cpu_util (== task_util) > util_est (== 0)
+ * and thus we return:
+ * cpu_util_wake = (cpu_util - task_util) = 0
+ *
+ * - if other tasks are SLEEPING on the same CPU, which is just waking
+ * up, then:
+ * cpu_util >= task_util
+ * cpu_util > util_est (== 0)
+ * and thus we discount *p's blocked utilization to return:
+ * cpu_util_wake = (cpu_util - task_util) >= 0
+ *
+ * - if other tasks are RUNNABLE on that CPU and
+ * util_est > cpu_util
+ * then we use util_est since it returns a more restrictive
+ * estimation of the spare capacity on that CPU, by just considering
+ * the expected utilization of tasks already runnable on that CPU.
+ */
+ util_est = cpu_rq(cpu)->cfs.util_est_runnable;
+ util = max(util, util_est);
+
+ return util;
}

/*
@@ -7882,7 +7944,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);

sgs->group_load += load;
- sgs->group_util += cpu_util(i);
+ sgs->group_util += cpu_util_est(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

nr_running = rq->nr_running;
--
2.15.1


2018-01-23 18:11:07

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))

This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).

If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)

where:

cfs_rq::util_est_runnable = sum(util_est(task))
for each RUNNABLE task on that root cfs_rq

It's worth to note that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in 3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
include/linux/sched.h | 16 ++++++++++
kernel/sched/debug.c | 4 +++
kernel/sched/fair.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 5 +++
kernel/sched/sched.h | 1 +
5 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f7506712825c..5576c0c348e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -275,6 +275,21 @@ struct load_weight {
u32 inv_weight;
};

+/**
+ * Estimation Utilization for FAIR tasks.
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is
+ * chosen so that the EWMA will be relatively insensitive to transient changes
+ * to the task's workload.
+ */
+struct util_est {
+ unsigned int last;
+ unsigned int ewma;
+#define UTIL_EST_WEIGHT_SHIFT 2
+};
+
/*
* The load_avg/util_avg accumulates an infinite geometric series
* (see __update_load_avg() in kernel/sched/fair.c).
@@ -336,6 +351,7 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_load_avg;
unsigned long util_avg;
+ struct util_est util_est;
};

struct sched_statistics {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ca0130ed4f9..4ee8b3299982 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -567,6 +567,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "util_est_runnable",
+ cfs_rq->util_est_runnable);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -1018,6 +1020,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_load_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
+ P(se.avg.util_est.ewma);
+ P(se.avg.util_est.last);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1070803cb423..0bfe94f3176e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5193,6 +5193,20 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static inline unsigned long task_util(struct task_struct *p);
+static inline unsigned long task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Update root cfs_rq's estimated utilization */
+ cfs_rq->util_est_runnable += task_util_est(p);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5245,9 +5259,70 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);

+ util_est_enqueue(p);
hrtick_update(rq);
}

+static inline void util_est_dequeue(struct task_struct *p, int flags)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+ unsigned long util_last = task_util(p);
+ bool sleep = flags & DEQUEUE_SLEEP;
+ unsigned long ewma;
+ long util_est = 0;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /*
+ * Update root cfs_rq's estimated utilization
+ *
+ * If *p is the last task then the root cfs_rq's estimated utilization
+ * of a CPU is 0 by definition.
+ */
+ if (cfs_rq->nr_running) {
+ util_est = READ_ONCE(cfs_rq->util_est_runnable);
+ util_est -= min_t(long, util_est, task_util_est(p));
+ }
+ WRITE_ONCE(cfs_rq->util_est_runnable, util_est);
+
+ /*
+ * Skip update of task's estimated utilization when the task has not
+ * yet completed an activation, e.g. being migrated.
+ */
+ if (!sleep)
+ return;
+
+ /*
+ * Skip update of task's estimated utilization when its EWMA is already
+ * ~1% close to its last activation value.
+ */
+ util_est = p->util_est.ewma;
+ if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
+ return;
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * about the task size. This is done by storing the last PELT value
+ * for this task and using this value to load another sample in the
+ * exponential weighted moving average:
+ *
+ * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1)
+ * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
+ * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
+ *
+ * Where 'w' is the weight of new samples, which is configured to be
+ * 0.25, thus making w=1/4
+ */
+ p->se.avg.util_est.last = util_last;
+ ewma = READ_ONCE(p->se.avg.util_est.ewma);
+ ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
+ ewma >>= UTIL_EST_WEIGHT_SHIFT;
+ WRITE_ONCE(p->se.avg.util_est.ewma, ewma);
+}
+
static void set_next_buddy(struct sched_entity *se);

/*
@@ -5304,6 +5379,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);

+ util_est_dequeue(p, flags);
hrtick_update(rq);
}

@@ -5767,7 +5843,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return affine;
}

-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6262,6 +6337,11 @@ static inline unsigned long task_util(struct task_struct *p)
return p->se.avg.util_avg;
}

+static inline unsigned long task_util_est(struct task_struct *p)
+{
+ return max(p->se.avg.util_est.ewma, p->se.avg.util_est.last);
+}
+
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..c459a4b61544 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utilization.
+ */
+SCHED_FEAT(UTIL_EST, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..0b4d9750a927 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -470,6 +470,7 @@ struct cfs_rq {
* CFS load tracking
*/
struct sched_avg avg;
+ unsigned long util_est_runnable;
#ifndef CONFIG_64BIT
u64 load_last_update_time_copy;
#endif
--
2.15.1


2018-01-24 11:34:36

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

Hi Patrick,

On Tue, Jan 23, 2018 at 06:08:46PM +0000, Patrick Bellasi wrote:
> static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
> {
> - unsigned long util, capacity;
> + long util, util_est;
>
> /* Task has no contribution or is new */
> if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> - return cpu_util(cpu);
> + return cpu_util_est(cpu);
>
> - capacity = capacity_orig_of(cpu);
> - util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> + /* Discount task's blocked util from CPU's util */
> + util = cpu_util(cpu) - task_util(p);
> + util = max(util, 0L);
>
> - return (util >= capacity) ? capacity : util;
> + if (!sched_feat(UTIL_EST))
> + return util;

At first, It is not clear to me why you are not clamping the capacity to
CPU original capacity. It looks like it is not needed any more with
commit f453ae2200b0 ("sched/fair: Consider RT/IRQ pressure in
capacity_spare_wake()") inclusion. May be a separate patch to remove
the clamping part?

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-24 16:42:06

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On Tue, Jan 23, 2018 at 10:08 AM, Patrick Bellasi
<[email protected]> wrote:
> The util_avg signal computed by PELT is too variable for some use-cases.
> For example, a big task waking up after a long sleep period will have its
> utilization almost completely decayed. This introduces some latency before
> schedutil will be able to pick the best frequency to run a task.
[...]
> -static inline unsigned long task_util(struct task_struct *p);
> static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
>
> static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> @@ -6262,6 +6337,11 @@ static inline unsigned long task_util(struct task_struct *p)
> return p->se.avg.util_avg;
> }
>
> +static inline unsigned long task_util_est(struct task_struct *p)
> +{
> + return max(p->se.avg.util_est.ewma, p->se.avg.util_est.last);
> +}
> +
> /*
> * cpu_util_wake: Compute cpu utilization with any contributions from
> * the waking task p removed.
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 9552fd5854bf..c459a4b61544 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
> SCHED_FEAT(WA_IDLE, true)
> SCHED_FEAT(WA_WEIGHT, true)
> SCHED_FEAT(WA_BIAS, true)
> +
> +/*
> + * UtilEstimation. Use estimated CPU utilization.
> + */
> +SCHED_FEAT(UTIL_EST, false)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e95505e23c6..0b4d9750a927 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -470,6 +470,7 @@ struct cfs_rq {
> * CFS load tracking
> */
> struct sched_avg avg;
> + unsigned long util_est_runnable;

Since struct sched_avg would now have util_est, cfs_rq gets it too.
Then can we not try to reuse that struct and avoid having to expand
cfs_rq more than needed?

I went through previous conversations and couldn't find a reason, if I
missed something I appreciate if you can explain the rationale.

thanks,

- Joel

2018-01-24 19:17:02

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On 24-Jan 08:40, Joel Fernandes wrote:
> On Tue, Jan 23, 2018 at 10:08 AM, Patrick Bellasi
> <[email protected]> wrote:
> > The util_avg signal computed by PELT is too variable for some use-cases.
> > For example, a big task waking up after a long sleep period will have its
> > utilization almost completely decayed. This introduces some latency before
> > schedutil will be able to pick the best frequency to run a task.
> [...]
> > -static inline unsigned long task_util(struct task_struct *p);
> > static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
> >
> > static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> > @@ -6262,6 +6337,11 @@ static inline unsigned long task_util(struct task_struct *p)
> > return p->se.avg.util_avg;
> > }
> >
> > +static inline unsigned long task_util_est(struct task_struct *p)
> > +{
> > + return max(p->se.avg.util_est.ewma, p->se.avg.util_est.last);
> > +}
> > +
> > /*
> > * cpu_util_wake: Compute cpu utilization with any contributions from
> > * the waking task p removed.
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 9552fd5854bf..c459a4b61544 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
> > SCHED_FEAT(WA_IDLE, true)
> > SCHED_FEAT(WA_WEIGHT, true)
> > SCHED_FEAT(WA_BIAS, true)
> > +
> > +/*
> > + * UtilEstimation. Use estimated CPU utilization.
> > + */
> > +SCHED_FEAT(UTIL_EST, false)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 2e95505e23c6..0b4d9750a927 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -470,6 +470,7 @@ struct cfs_rq {
> > * CFS load tracking
> > */
> > struct sched_avg avg;
> > + unsigned long util_est_runnable;
>
> Since struct sched_avg would now have util_est, cfs_rq gets it too.
> Then can we not try to reuse that struct and avoid having to expand
> cfs_rq more than needed?

Yes, that's possible now... the main issue is that for RQ's we do not
track an EWMA, but still we can use the util_est::last field or maybe
use a union just to use a better name when used from the RQ side.

> I went through previous conversations and couldn't find a reason, if I
> missed something I appreciate if you can explain the rationale.

I've used a separate filed just because SE's util_est was not part of
sched_avg, and missed the opportunity to consolidate better this now
that we moved it. Thanks for pointing this out ;-)

>
> thanks,
>
> - Joel

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-01-24 19:32:26

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

On 24-Jan 17:03, Pavan Kondeti wrote:
> Hi Patrick,

Hi Pavan,


> On Tue, Jan 23, 2018 at 06:08:46PM +0000, Patrick Bellasi wrote:
> > static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
> > {
> > - unsigned long util, capacity;
> > + long util, util_est;
> >
> > /* Task has no contribution or is new */
> > if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> > - return cpu_util(cpu);
> > + return cpu_util_est(cpu);
> >
> > - capacity = capacity_orig_of(cpu);
> > - util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> > + /* Discount task's blocked util from CPU's util */
> > + util = cpu_util(cpu) - task_util(p);
> > + util = max(util, 0L);
> >
> > - return (util >= capacity) ? capacity : util;
> > + if (!sched_feat(UTIL_EST))
> > + return util;
>
> At first, It is not clear to me why you are not clamping the capacity to
> CPU original capacity. It looks like it is not needed any more with
> commit f453ae2200b0 ("sched/fair: Consider RT/IRQ pressure in
> capacity_spare_wake()") inclusion.

Mainly because the above code now uses only cpu_util() which is already clamped
by capacity_orig_of().

However, you made me notice that in the few lines which follows, where I do:

> > + /*
> > + * These are the main cases covered:
> > + * - if *p is the only task sleeping on this CPU, then:
> > + * cpu_util (== task_util) > util_est (== 0)
> > + * and thus we return:
> > + * cpu_util_wake = (cpu_util - task_util) = 0
> > + *
> > + * - if other tasks are SLEEPING on the same CPU, which is just waking
> > + * up, then:
> > + * cpu_util >= task_util
> > + * cpu_util > util_est (== 0)
> > + * and thus we discount *p's blocked utilization to return:
> > + * cpu_util_wake = (cpu_util - task_util) >= 0
> > + *
> > + * - if other tasks are RUNNABLE on that CPU and
> > + * util_est > cpu_util
> > + * then we use util_est since it returns a more restrictive
> > + * estimation of the spare capacity on that CPU, by just considering
> > + * the expected utilization of tasks already runnable on that CPU.
> > + */
> > + util_est = cpu_rq(cpu)->cfs.util_est_runnable;
> > + util = max(util, util_est);
> > +
> > + return util;

I should instead clamp util before returning it! ;-)

> May be a separate patch to remove the clamping part?

No, I think we should keep cpu_util_wake clamped to not affect the existing
call sites. I just need to remove it where not needed (done) and add it where
needed (will do on the next iteration).

> Thanks,
> Pavan

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-01-24 22:07:12

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On Wed, Jan 24, 2018 at 11:16 AM, Patrick Bellasi
<[email protected]> wrote:
> On 24-Jan 08:40, Joel Fernandes wrote:
>> On Tue, Jan 23, 2018 at 10:08 AM, Patrick Bellasi
>> <[email protected]> wrote:
>> > The util_avg signal computed by PELT is too variable for some use-cases.
>> > For example, a big task waking up after a long sleep period will have its
>> > utilization almost completely decayed. This introduces some latency before
>> > schedutil will be able to pick the best frequency to run a task.
>> [...]
>> > -static inline unsigned long task_util(struct task_struct *p);
>> > static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
>> >
>> > static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>> > @@ -6262,6 +6337,11 @@ static inline unsigned long task_util(struct task_struct *p)
>> > return p->se.avg.util_avg;
>> > }
>> >
>> > +static inline unsigned long task_util_est(struct task_struct *p)
>> > +{
>> > + return max(p->se.avg.util_est.ewma, p->se.avg.util_est.last);
>> > +}
>> > +
>> > /*
>> > * cpu_util_wake: Compute cpu utilization with any contributions from
>> > * the waking task p removed.
>> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>> > index 9552fd5854bf..c459a4b61544 100644
>> > --- a/kernel/sched/features.h
>> > +++ b/kernel/sched/features.h
>> > @@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
>> > SCHED_FEAT(WA_IDLE, true)
>> > SCHED_FEAT(WA_WEIGHT, true)
>> > SCHED_FEAT(WA_BIAS, true)
>> > +
>> > +/*
>> > + * UtilEstimation. Use estimated CPU utilization.
>> > + */
>> > +SCHED_FEAT(UTIL_EST, false)
>> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> > index 2e95505e23c6..0b4d9750a927 100644
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -470,6 +470,7 @@ struct cfs_rq {
>> > * CFS load tracking
>> > */
>> > struct sched_avg avg;
>> > + unsigned long util_est_runnable;
>>
>> Since struct sched_avg would now have util_est, cfs_rq gets it too.
>> Then can we not try to reuse that struct and avoid having to expand
>> cfs_rq more than needed?
>
> Yes, that's possible now... the main issue is that for RQ's we do not
> track an EWMA, but still we can use the util_est::last field or maybe
> use a union just to use a better name when used from the RQ side.

Yes I think its good to make use of the space we're adding in cfs_rq
that's not used for other things.

>
>> I went through previous conversations and couldn't find a reason, if I
>> missed something I appreciate if you can explain the rationale.
>
> I've used a separate filed just because SE's util_est was not part of
> sched_avg, and missed the opportunity to consolidate better this now
> that we moved it. Thanks for pointing this out ;-)

You're welcome :)

thanks,

- Joel

2018-01-25 14:41:06

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

On Wed, Jan 24, 2018 at 07:31:38PM +0000, Patrick Bellasi wrote:
>
> > > + /*
> > > + * These are the main cases covered:
> > > + * - if *p is the only task sleeping on this CPU, then:
> > > + * cpu_util (== task_util) > util_est (== 0)
> > > + * and thus we return:
> > > + * cpu_util_wake = (cpu_util - task_util) = 0
> > > + *
> > > + * - if other tasks are SLEEPING on the same CPU, which is just waking
> > > + * up, then:
> > > + * cpu_util >= task_util
> > > + * cpu_util > util_est (== 0)
> > > + * and thus we discount *p's blocked utilization to return:
> > > + * cpu_util_wake = (cpu_util - task_util) >= 0
> > > + *
> > > + * - if other tasks are RUNNABLE on that CPU and
> > > + * util_est > cpu_util
> > > + * then we use util_est since it returns a more restrictive
> > > + * estimation of the spare capacity on that CPU, by just considering
> > > + * the expected utilization of tasks already runnable on that CPU.
> > > + */
> > > + util_est = cpu_rq(cpu)->cfs.util_est_runnable;
> > > + util = max(util, util_est);
> > > +
> > > + return util;
>
> I should instead clamp util before returning it! ;-)
>
> > May be a separate patch to remove the clamping part?
>
> No, I think we should keep cpu_util_wake clamped to not affect the existing
> call sites. I just need to remove it where not needed (done) and add it where
> needed (will do on the next iteration).

cpu_util_wake() is called only from capacity_spare_wake(). There are no other
callsites. The capacity_spare_wake() is clamping the return value of
cpu_util_wake() to CPU capacity. The clamping is not needed, I think.

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-29 16:38:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On Tue, Jan 23, 2018 at 06:08:45PM +0000, Patrick Bellasi wrote:
> +static inline void util_est_dequeue(struct task_struct *p, int flags)
> +{
> + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> + unsigned long util_last = task_util(p);
> + bool sleep = flags & DEQUEUE_SLEEP;
> + unsigned long ewma;
> + long util_est = 0;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /*
> + * Update root cfs_rq's estimated utilization
> + *
> + * If *p is the last task then the root cfs_rq's estimated utilization
> + * of a CPU is 0 by definition.
> + */
> + if (cfs_rq->nr_running) {
> + util_est = READ_ONCE(cfs_rq->util_est_runnable);
> + util_est -= min_t(long, util_est, task_util_est(p));
> + }
> + WRITE_ONCE(cfs_rq->util_est_runnable, util_est);
> +
> + /*
> + * Skip update of task's estimated utilization when the task has not
> + * yet completed an activation, e.g. being migrated.
> + */
> + if (!sleep)
> + return;
> +

Since you only use sleep once, you might as well write it out there.

Also, does GCC lower the task_util() eval to here?

> + /*
> + * Skip update of task's estimated utilization when its EWMA is already
> + * ~1% close to its last activation value.
> + */
> + util_est = p->util_est.ewma;
> + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> + return;

Aside from that being whitespace challenged, did you also try:

if ((unsigned)((util_est - util_last) + LIM - 1) < (2 * LIM - 1))

Also, since we only care about the absolute value; we could use:

util_last - ewma

here (note the above also forgets to use READ_ONCE), and reuse the result:

> +
> + /*
> + * Update Task's estimated utilization
> + *
> + * When *p completes an activation we can consolidate another sample
> + * about the task size. This is done by storing the last PELT value
> + * for this task and using this value to load another sample in the
> + * exponential weighted moving average:
> + *
> + * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1)
> + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> + * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
> + *
> + * Where 'w' is the weight of new samples, which is configured to be
> + * 0.25, thus making w=1/4
> + */
> + p->se.avg.util_est.last = util_last;
> + ewma = READ_ONCE(p->se.avg.util_est.ewma);
> + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;

here.

> + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> + WRITE_ONCE(p->se.avg.util_est.ewma, ewma);
> +}


So something along these lines:

ewma = READ_ONCE(p->se.avg.util_est.ewma);
diff = util_last - ewma;
if ((unsigned)(diff + LIM - 1) < (2 * LIM - 1))
return;

p->se.avg.util_est.last = util_last;
ewma = (diff + (ewma << EWMA_SHIFT)) >> EWMA_SHIFT;
WRITE_ONCE(p->se.avg.util_est.ewma, ewma);

Make sense?

2018-01-30 12:47:31

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On 29-Jan 17:36, Peter Zijlstra wrote:
> On Tue, Jan 23, 2018 at 06:08:45PM +0000, Patrick Bellasi wrote:
> > +static inline void util_est_dequeue(struct task_struct *p, int flags)
> > +{
> > + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> > + unsigned long util_last = task_util(p);
> > + bool sleep = flags & DEQUEUE_SLEEP;
> > + unsigned long ewma;
> > + long util_est = 0;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /*
> > + * Update root cfs_rq's estimated utilization
> > + *
> > + * If *p is the last task then the root cfs_rq's estimated utilization
> > + * of a CPU is 0 by definition.
> > + */
> > + if (cfs_rq->nr_running) {
> > + util_est = READ_ONCE(cfs_rq->util_est_runnable);
> > + util_est -= min_t(long, util_est, task_util_est(p));
> > + }
> > + WRITE_ONCE(cfs_rq->util_est_runnable, util_est);
> > +
> > + /*
> > + * Skip update of task's estimated utilization when the task has not
> > + * yet completed an activation, e.g. being migrated.
> > + */
> > + if (!sleep)
> > + return;
> > +
>
> Since you only use sleep once, you might as well write it out there.

Right, will move the flag check right here.

> Also, does GCC lower the task_util() eval to here?

Good point, kind-of... on ARM64 it generates a register load just before
the above if condition. I guess it does that to speculatively trigger a
load from memory in case the above check should pass?

Anyway, looks more it can be also a micro-arch optimization.
Thus, even just for better readability of the following chunk, it's
better to move the util_last definition here.

> > + /*
> > + * Skip update of task's estimated utilization when its EWMA is already
> > + * ~1% close to its last activation value.
> > + */
> > + util_est = p->util_est.ewma;
> > + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > + return;
>
> Aside from that being whitespace challenged, did you also try:
>
> if ((unsigned)((util_est - util_last) + LIM - 1) < (2 * LIM - 1))

No, since the above code IMO is so much "easy to parse for humans" :)
But, mainly because since the cache alignment update, also while testing on a
"big" Intel machine I cannot see regressions on hackbench.

This is the code I get on my Xeon E5-2690 v2:

if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
6ba0: 8b 86 7c 02 00 00 mov 0x27c(%rsi),%eax
6ba6: 48 29 c8 sub %rcx,%rax
6ba9: 48 99 cqto
6bab: 48 31 d0 xor %rdx,%rax
6bae: 48 29 d0 sub %rdx,%rax
6bb1: 48 83 f8 0a cmp $0xa,%rax
6bb5: 7e 1d jle 6bd4 <dequeue_task_fair+0x7e4>

Does it look so bad?

> Also, since we only care about the absolute value; we could use:
>
> util_last - ewma
>
> here (note the above also forgets to use READ_ONCE), and reuse the result:
>
> > +
> > + /*
> > + * Update Task's estimated utilization
> > + *
> > + * When *p completes an activation we can consolidate another sample
> > + * about the task size. This is done by storing the last PELT value
> > + * for this task and using this value to load another sample in the
> > + * exponential weighted moving average:
> > + *
> > + * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1)
> > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> > + * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
> > + *
> > + * Where 'w' is the weight of new samples, which is configured to be
> > + * 0.25, thus making w=1/4
> > + */
> > + p->se.avg.util_est.last = util_last;
> > + ewma = READ_ONCE(p->se.avg.util_est.ewma);
> > + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
>
> here.

Right! +1

>
> > + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > + WRITE_ONCE(p->se.avg.util_est.ewma, ewma);
> > +}
>
> So something along these lines:
>
> ewma = READ_ONCE(p->se.avg.util_est.ewma);
> diff = util_last - ewma;
> if ((unsigned)(diff + LIM - 1) < (2 * LIM - 1))
> return;
>
> p->se.avg.util_est.last = util_last;
> ewma = (diff + (ewma << EWMA_SHIFT)) >> EWMA_SHIFT;
> WRITE_ONCE(p->se.avg.util_est.ewma, ewma);
>
> Make sense?

Looks ok to me, I will for sure update to reuse the difference.

Regarding the comparison, I'll try your formula to check again if there is any
noticeable difference on hackbench.

Thanks Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-01-30 13:05:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On Tue, Jan 30, 2018 at 12:46:33PM +0000, Patrick Bellasi wrote:
> > Aside from that being whitespace challenged, did you also try:
> >
> > if ((unsigned)((util_est - util_last) + LIM - 1) < (2 * LIM - 1))
>
> No, since the above code IMO is so much "easy to parse for humans" :)

Heh, true. Although that's fixable by wrapping it in some helper with a
comment.

> But, mainly because since the cache alignment update, also while testing on a
> "big" Intel machine I cannot see regressions on hackbench.
>
> This is the code I get on my Xeon E5-2690 v2:
>
> if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> 6ba0: 8b 86 7c 02 00 00 mov 0x27c(%rsi),%eax
> 6ba6: 48 29 c8 sub %rcx,%rax
> 6ba9: 48 99 cqto
> 6bab: 48 31 d0 xor %rdx,%rax
> 6bae: 48 29 d0 sub %rdx,%rax
> 6bb1: 48 83 f8 0a cmp $0xa,%rax
> 6bb5: 7e 1d jle 6bd4 <dequeue_task_fair+0x7e4>
>
> Does it look so bad?

Its not terrible, and I think your GCC is far more clever than the one I
used at the time. But that's 4 dependent instructions (cqto,xor,sub,cmp)
whereas the one I proposed uses only 2 (add,cmp).

Now, my proposal is, as you say, somewhat hard to read, and it also
doesn't work right when our values are 'big' (which they will not be in
our case, because util has a very definite bound), and I suspect you're
right that ~2 cycles here will not be measurable.

So yeah.... whatever ;-)

2018-01-30 14:25:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On Tue, Jan 30, 2018 at 02:04:32PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:46:33PM +0000, Patrick Bellasi wrote:
> > > Aside from that being whitespace challenged, did you also try:
> > >
> > > if ((unsigned)((util_est - util_last) + LIM - 1) < (2 * LIM - 1))
> >
> > No, since the above code IMO is so much "easy to parse for humans" :)
>
> Heh, true. Although that's fixable by wrapping it in some helper with a
> comment.
>
> > But, mainly because since the cache alignment update, also while testing on a
> > "big" Intel machine I cannot see regressions on hackbench.
> >
> > This is the code I get on my Xeon E5-2690 v2:
> >
> > if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > 6ba0: 8b 86 7c 02 00 00 mov 0x27c(%rsi),%eax
> > 6ba6: 48 29 c8 sub %rcx,%rax
> > 6ba9: 48 99 cqto
> > 6bab: 48 31 d0 xor %rdx,%rax
> > 6bae: 48 29 d0 sub %rdx,%rax
> > 6bb1: 48 83 f8 0a cmp $0xa,%rax
> > 6bb5: 7e 1d jle 6bd4 <dequeue_task_fair+0x7e4>
> >
> > Does it look so bad?
>
> Its not terrible, and I think your GCC is far more clever than the one I

To clarify; my GCC at the time generated conditional branches to compute
the absolute value; and in that case the thing I proposed wins hands
down because its unconditional.

However the above is also unconditional and then the difference is much
less important.

> used at the time. But that's 4 dependent instructions (cqto,xor,sub,cmp)
> whereas the one I proposed uses only 2 (add,cmp).
>
> Now, my proposal is, as you say, somewhat hard to read, and it also
> doesn't work right when our values are 'big' (which they will not be in
> our case, because util has a very definite bound), and I suspect you're
> right that ~2 cycles here will not be measurable.
>
> So yeah.... whatever ;-)

2018-01-31 15:33:21

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

On 25-Jan 20:03, Pavan Kondeti wrote:
> On Wed, Jan 24, 2018 at 07:31:38PM +0000, Patrick Bellasi wrote:
> >
> > > > + /*
> > > > + * These are the main cases covered:
> > > > + * - if *p is the only task sleeping on this CPU, then:
> > > > + * cpu_util (== task_util) > util_est (== 0)
> > > > + * and thus we return:
> > > > + * cpu_util_wake = (cpu_util - task_util) = 0
> > > > + *
> > > > + * - if other tasks are SLEEPING on the same CPU, which is just waking
> > > > + * up, then:
> > > > + * cpu_util >= task_util
> > > > + * cpu_util > util_est (== 0)
> > > > + * and thus we discount *p's blocked utilization to return:
> > > > + * cpu_util_wake = (cpu_util - task_util) >= 0
> > > > + *
> > > > + * - if other tasks are RUNNABLE on that CPU and
> > > > + * util_est > cpu_util
> > > > + * then we use util_est since it returns a more restrictive
> > > > + * estimation of the spare capacity on that CPU, by just considering
> > > > + * the expected utilization of tasks already runnable on that CPU.
> > > > + */
> > > > + util_est = cpu_rq(cpu)->cfs.util_est_runnable;
> > > > + util = max(util, util_est);
> > > > +
> > > > + return util;
> >
> > I should instead clamp util before returning it! ;-)
> >
> > > May be a separate patch to remove the clamping part?
> >
> > No, I think we should keep cpu_util_wake clamped to not affect the existing
> > call sites. I just need to remove it where not needed (done) and add it where
> > needed (will do on the next iteration).
>
> cpu_util_wake() is called only from capacity_spare_wake(). There are no other
> callsites.

True...

> The capacity_spare_wake() is clamping the return value of
> cpu_util_wake() to CPU capacity.

... actually it's clamping negative numbers with:

max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);

thus, by having cpu_util_wake returning potentially a value which is
bigger then capacity_of or capacity_orig_of we should not have issues
from a capacity_spare_wake() usage standpoint.

> The clamping is not needed, I think.

However, we can still argue that the cpu_util_wake() should never
return something bigger then the maximum possible capacity of a CPU.
At least that's the feature so fare.

Thus, even just for the sake of consistency, with previous returns
paths (e.g. when we bail out returning cpu_util), I would say that
it's worth to maintain this semantics.

With a clamping, all these functions:
- cpu_util
- cpu_util_est
- cpu_util_wake
will always return a signal which is never bigger then the maximum
possible CPU capacity.

--
#include <best/regards.h>

Patrick Bellasi

2018-02-05 17:50:30

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: add util_est on top of PELT

On 30-Jan 15:01, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 02:04:32PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 30, 2018 at 12:46:33PM +0000, Patrick Bellasi wrote:
> > > > Aside from that being whitespace challenged, did you also try:
> > > >
> > > > if ((unsigned)((util_est - util_last) + LIM - 1) < (2 * LIM - 1))
> > >
> > > No, since the above code IMO is so much "easy to parse for humans" :)
> >
> > Heh, true. Although that's fixable by wrapping it in some helper with a
> > comment.
> >
> > > But, mainly because since the cache alignment update, also while testing on a
> > > "big" Intel machine I cannot see regressions on hackbench.
> > >
> > > This is the code I get on my Xeon E5-2690 v2:
> > >
> > > if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > > 6ba0: 8b 86 7c 02 00 00 mov 0x27c(%rsi),%eax
> > > 6ba6: 48 29 c8 sub %rcx,%rax
> > > 6ba9: 48 99 cqto
> > > 6bab: 48 31 d0 xor %rdx,%rax
> > > 6bae: 48 29 d0 sub %rdx,%rax
> > > 6bb1: 48 83 f8 0a cmp $0xa,%rax
> > > 6bb5: 7e 1d jle 6bd4 <dequeue_task_fair+0x7e4>
> > >
> > > Does it look so bad?
> >
> > Its not terrible, and I think your GCC is far more clever than the one I
>
> To clarify; my GCC at the time generated conditional branches to compute
> the absolute value; and in that case the thing I proposed wins hands
> down because its unconditional.
>
> However the above is also unconditional and then the difference is much
> less important.

I've finally convinced myself that we can live with the "parsing
complexity" of your proposal... and wrapped into an inline it turned
out to be not so bad.

> > used at the time. But that's 4 dependent instructions (cqto,xor,sub,cmp)
> > whereas the one I proposed uses only 2 (add,cmp).

The ARM64 generated code is also simpler.

> > Now, my proposal is, as you say, somewhat hard to read, and it also
> > doesn't work right when our values are 'big' (which they will not be in
> > our case, because util has a very definite bound), and I suspect you're
> > right that ~2 cycles here will not be measurable.

Indeed, I cannot see noticeable differences if not just a slightly
improvement...

> >
> > So yeah.... whatever ;-)

... I'm going to post a v4 using your proposal ;-)

Thanks Patrick

--
#include <best/regards.h>

Patrick Bellasi