2018-02-22 17:03:28

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v5 0/4] Utilization estimation (util_est) for FAIR tasks

Hi, here is an update of [1], based on today's tip/sched/core [2], which
targets two main things:

1) Add the required/missing {READ,WRITE}_ONCE compiler barriers

AFAIU, we wanted those barriers for lock-less synchronization between:
- enqueue/dequeue calls: which are serialized by the RQ lock but where we
read/modify util_est signals which can be _read concurrently_ from other
code paths.
- load balancer related functions: which are not serialized by the RQ lock
but read util_est signals updated by the enqueue/dequeue calls.

However, after noticing this commit:

7bd3e239d6c6 locking: Remove atomicy checks from {READ,WRITE}_ONCE

I'm still a bit confused on the real need of these calls mainly because:
a) they are not the proper mechanism to grant atomic load/stores, for
example when we need to access u64 values while running on a 32bit
target.
b) apart from possible load/store tearing issues, I was not able to see
other scenarios, among the ones described in [3] which potentially
apply to the code of these patches.

Thus, to avoid load/store tearing, in principle I would have used:
- WRITE_ONCE only on RQ-locked serialized code.
Where we read/modify util_est signals, in order to properly publish it to
concurrently running load balancer code.
- READ_ONCE only from _non_ RQ-lock serialized code.
Where we read only util_est signals.

To my understanding this should be just good enough also to document the
concurrent access to some shared variables while still allowing the compiler
to optimize some load from the RQ-lock serialized code.

All that considered, my last question on this point is: can we remove the
READ_ONCE()s from RQ-lock serialized code?

2) Ensure the feature can be safely turned on by default

Estimated utilization of Tasks and RQs is a feature which can benefit mainly
lightly utilized systems, where you can have tasks which sleep for
relatively long time but we still want to be fast on ramping-up the OPPs
once they wake up.

However, since Peter proposed to have this scheduler feature turned on by
default, we did spent a bit more time focusing on hackbench to verify it
will not hurt server/HPC classes of workloads.

The main discovery has been that, if we properly configure hackbench
to have an high rate of enqueue/dequeue events, despite the few instructions
util_est adds, the overheads starts to become more noticeable.
That's the reason of the last patch we added to this series, which changelog
should be good enough to describe the issue and the proposed solution.

Experiments including this patch have been run on a dual socket
Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, using precisely this
configuration:

- cpusets to isolate one single socket for the execution of hackbench on
just 10 of the available 20 cores.
This allows to avoid NUMA load balancer side effects which we noticed
affect quite a lot the variance across multiple experiments.

- CPUFreq powersave policy, with the intel_pstate driver configured in
passive mode and the scaling_max_freq set to the scaling_min_freq.
This allows to rule out thermal and/or turbo boost side effects.

- hackbench configured to run 120 iterations of:

perf bench sched messaging --pipe --thread --group 8 --loop 5000

which, in the above setup, corresponds to ~11s completion time for each
iteration using 320 tasks.

In the above setup, this configuration seems to maximize the rate of
wakeup/sleep events thus better stressing the enqueue/dequeue code paths.
Here are the stats we collected for the completion times:

count mean std min 50% 95% 99% max
before 120.0 11.010342 0.375753 10.104 11.046 11.54155 11.69629 11.751
after 120.0 11.041117 0.375429 10.015 11.072 11.59070 11.67720 11.692

after vs before: +0.3% on mean
after vs before: -0.2% on 99% percentile

Results on ARM (Android) devices have been collected and reported in a previous
posting [4] and they showed negligible overhead compared to the corresponding
power/performance benefits.

Changes in v5:
- rebased on today's tip/sched/core (commit 083c6eeab2cc, based on v4.16-rc2)
- update util_est only on util_avg updates
- add documentation for "struct util_est"
- always use int instead of long whenever possible (Peter)
- pass cfs_rq to util_est_{en,de}queue (Peter)
- pass task_sleep to util_est_dequeue
- use singe WRITE_ONCE at dequeue time
- add some missing {READ,WRITE}_ONCE
- add task_util_est() for code consistency

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- renamed util_est's "last" into "enqueued"
- using util_est's "enqueued" for both se and cfs_rqs (Joel)
- update margin check to use more ASM friendly code (Peter)
- optimize EWMA updates (Peter)
- ensure cpu_util_wake() is cpu_capacity_orig()'s clamped (Pavan)
- simplify cpu_util_cfs() integration (Dietmar)

Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line
- schedutil integration using Juri's cpu_util_cfs()
- first patch dropped since it's already queued in tip/sched/core

Changes in v2:
- rebased on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est

Cheers Patrick

.:: References
==============
[1] https://lkml.org/lkml/2018/2/6/356
[email protected]
[2] git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
sched/core (commit 083c6eeab2cc)
[3] Documentation/memory-barriers.txt (Line 1508)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/memory-barriers.txt?h=v4.16-rc2#n1508
[4] https://lkml.org/lkml/2018/1/23/645
[email protected]

Patrick Bellasi (4):
sched/fair: add util_est on top of PELT
sched/fair: use util_est in LB and WU paths
sched/cpufreq_schedutil: use util_est for OPP selection
sched/fair: update util_est only on util_avg updates

include/linux/sched.h | 29 +++++++
kernel/sched/debug.c | 4 +
kernel/sched/fair.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 5 ++
kernel/sched/sched.h | 7 +-
5 files changed, 259 insertions(+), 7 deletions(-)

--
2.15.1



2018-02-22 17:03:31

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))

This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).

If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)

where:

cfs_rq::util_est_runnable = sum(util_est(task))
for each RUNNABLE task on that root cfs_rq

It's worth to note that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v5:
- add documentation for "struct util_est"
- always use int instead of long whenever possible (Peter)
- pass cfs_rq to util_est_{en,de}queue (Peter)
- pass task_sleep to util_est_dequeue
- use singe WRITE_ONCE at dequeue time
- add some other missing {READ,WRITE}_ONCE
- add task_util_est() for code consistency

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- renamed util_est's "last" into "enqueued"
- using util_est's "enqueued" for both se and cfs_rqs (Joel)
- update margin check to use more ASM friendly code (Peter)
- optimize EWMA updates (Peter)

Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
include/linux/sched.h | 29 +++++++++++++
kernel/sched/debug.c | 4 ++
kernel/sched/fair.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 5 +++
4 files changed, 146 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a902e..14b1a6ce6298 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -275,6 +275,34 @@ struct load_weight {
u32 inv_weight;
};

+/**
+ * Estimation Utilization for FAIR tasks.
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is
+ * chosen so that the EWMA will be relatively insensitive to transient changes
+ * to the task's workload.
+ *
+ * @enqueued: instantaneous estimated utilization of a task/cpu
+ * task: the task's util_avg at last task dequeue time
+ * cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
+ *
+ * Thus, the util_est.enqueued of a task represents the contribution on the
+ * estimated utilization of the CPU where that task is currently enqueued.
+ *
+ * @ewma: the Exponential Weighted Moving Average (EWMA) utilization of a task
+ * Only for tasks we track a moving average of the past instantaneous
+ * estimated utilization. This allows to absorb sporadic drops in
+ * utilization of an otherwise almost periodic task.
+ *
+ */
+struct util_est {
+ unsigned int enqueued;
+ unsigned int ewma;
+#define UTIL_EST_WEIGHT_SHIFT 2
+};
+
/*
* The load_avg/util_avg accumulates an infinite geometric series
* (see __update_load_avg() in kernel/sched/fair.c).
@@ -336,6 +364,7 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_load_avg;
unsigned long util_avg;
+ struct util_est util_est;
};

struct sched_statistics {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ca0130ed4f9..d4eb5532ea6b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -567,6 +567,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
+ cfs_rq->avg.util_est.enqueued);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -1018,6 +1020,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_load_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
+ P(se.avg.util_est.ewma);
+ P(se.avg.util_est.enqueued);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e1febd252a84..c8526687f107 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5205,6 +5205,23 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static inline unsigned long task_util(struct task_struct *p);
+static inline unsigned long _task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
+ struct task_struct *p)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Update root cfs_rq's estimated utilization */
+ enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ enqueued += _task_util_est(p);
+ WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5257,9 +5274,86 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);

+ util_est_enqueue(&rq->cfs, p);
hrtick_update(rq);
}

+/*
+ * Check if a (signed) value is within a specified (unsigned) margin,
+ * based on the observation that:
+ * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
+ *
+ * NOTE: this only works when value + maring < INT_MAX.
+ */
+static inline bool within_margin(int value, int margin)
+{
+ return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
+}
+
+static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
+ struct task_struct *p,
+ bool task_sleep)
+{
+ long last_ewma_diff;
+ struct util_est ue;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /*
+ * Update root cfs_rq's estimated utilization
+ *
+ * If *p is the last task then the root cfs_rq's estimated utilization
+ * of a CPU is 0 by definition.
+ */
+ ue.enqueued = 0;
+ if (cfs_rq->nr_running) {
+ ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ ue.enqueued -= min_t(unsigned int, ue.enqueued,
+ _task_util_est(p));
+ }
+ WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
+
+ /*
+ * Skip update of task's estimated utilization when the task has not
+ * yet completed an activation, e.g. being migrated.
+ */
+ if (!task_sleep)
+ return;
+
+ /*
+ * Skip update of task's estimated utilization when its EWMA is
+ * already ~1% close to its last activation value.
+ */
+ ue = READ_ONCE(p->se.avg.util_est);
+ ue.enqueued = task_util(p);
+ last_ewma_diff = ue.enqueued - ue.ewma;
+ if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
+ return;
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * of the task size. This is done by storing the current PELT value
+ * as ue.enqueued and by using this value to update the Exponential
+ * Weighted Moving Average (EWMA):
+ *
+ * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
+ * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
+ * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
+ * = w * ( last_ewma_diff ) + ewma(t-1)
+ * = w * (last_ewma_diff + ewma(t-1) / w)
+ *
+ * Where 'w' is the weight of new samples, which is configured to be
+ * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
+ */
+ ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
+ ue.ewma += last_ewma_diff;
+ ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
+ WRITE_ONCE(p->se.avg.util_est, ue);
+}
+
static void set_next_buddy(struct sched_entity *se);

/*
@@ -5316,6 +5410,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);

+ util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}

@@ -5834,7 +5929,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return target;
}

-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6356,6 +6450,19 @@ static inline unsigned long task_util(struct task_struct *p)
return p->se.avg.util_avg;
}

+
+static inline unsigned long _task_util_est(struct task_struct *p)
+{
+ struct util_est ue = READ_ONCE(p->se.avg.util_est);
+
+ return max(ue.ewma, ue.enqueued);
+}
+
+static inline unsigned long task_util_est(struct task_struct *p)
+{
+ return max(READ_ONCE(p->se.avg.util_avg), _task_util_est(p));
+}
+
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..c459a4b61544 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utilization.
+ */
+SCHED_FEAT(UTIL_EST, false)
--
2.15.1


2018-02-22 17:03:35

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates

The estimated utilization of a task is currently updated every time the
task is dequeued. However, to keep overheads under control, PELT signals
are effectively updated at maximum once every 1ms.

Thus, for really short running tasks, it can happen that their util_avg
value has not been updates since their last enqueue. If such tasks are
also frequently running tasks (e.g. the kind of workload generated by
hackbench) it can also happen that their util_avg is updated only every
few activations.

This means that updating util_est at every dequeue potentially introduces
not necessary overheads and it's also conceptually wrong if the util_avg
signal has never been updated during a task activation.

Let's introduce a throttling mechanism on task's util_est updates
to sync them with util_avg updates. To make the solution memory
efficient, both in terms of space and load/store operations, we encode a
synchronization flag into the LSB of util_est.enqueued.
This makes util_est an even values only metric, which is still
considered good enough for its purpose.
The synchronization bit is (re)set by __update_load_avg_se() once the
PELT signal of a task has been updated during its last activation.

Such a throttling mechanism allows to keep under control util_est
overheads in the wakeup hot path, thus making it a suitable mechanism
which can be enabled also on high-intensity workload systems.
Thus, this now switches on by default the estimation utilization
scheduler feature.

Suggested-by: Chris Redpath <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>

---
Changes in v5:
- set SCHED_FEAT(UTIL_EST, true) as default (Peter)
---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 2 +-
2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8364771f7301..1bf9a86ebc39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3047,6 +3047,29 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
}
}

+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_EST_NEED_UPDATE_FLAG 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+ if (sched_feat(UTIL_EST)) {
+ struct util_est ue = READ_ONCE(avg->util_est);
+
+ if (!(ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG))
+ return;
+
+ /* Reset flag to report util_avg has been updated */
+ ue.enqueued &= ~UTIL_EST_NEED_UPDATE_FLAG;
+ WRITE_ONCE(avg->util_est, ue);
+ }
+}
+
#ifdef CONFIG_SMP
/*
* Approximate:
@@ -3308,6 +3331,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entit
cfs_rq->curr == se)) {

___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ cfs_se_util_change(&se->avg);
return 1;
}

@@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,

/* Update root cfs_rq's estimated utilization */
enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
- enqueued += _task_util_est(p);
+ enqueued += (_task_util_est(p) | 0x1);
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
}

@@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
if (cfs_rq->nr_running) {
ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
ue.enqueued -= min_t(unsigned int, ue.enqueued,
- _task_util_est(p));
+ (_task_util_est(p) | UTIL_EST_NEED_UPDATE_FLAG));
}
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);

@@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
if (!task_sleep)
return;

+ /*
+ * Skip update of task's estimated utilization if the PELT signal has
+ * never been updated (at least once) since last enqueue time.
+ */
+ ue = READ_ONCE(p->se.avg.util_est);
+ if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)
+ return;
+
/*
* Skip update of task's estimated utilization when its EWMA is
* already ~1% close to its last activation value.
*/
- ue = READ_ONCE(p->se.avg.util_est);
- ue.enqueued = task_util(p);
+ ue.enqueued = (task_util(p) | UTIL_EST_NEED_UPDATE_FLAG);
last_ewma_diff = ue.enqueued - ue.ewma;
if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
return;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c459a4b61544..85ae8488039c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,4 +89,4 @@ SCHED_FEAT(WA_BIAS, true)
/*
* UtilEstimation. Use estimated CPU utilization.
*/
-SCHED_FEAT(UTIL_EST, false)
+SCHED_FEAT(UTIL_EST, true)
--
2.15.1


2018-02-22 17:04:17

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v5 3/4] sched/cpufreq_schedutil: use util_est for OPP selection

When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal build-up time is a function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce this kind of latencies, we integrate the usage
of the CPU's estimated utilization in the sugov_get_util function.
This allows to properly consider the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period.
Ultimately this allows to select the best frequency to run a task
right after its wake-up.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v5:
- add missing READ_ONCE() barrieres
- add acked-by Rafael tag

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- use util_est.enqueued for cfs_rq's util_est (Joel)
- simplify cpu_util_cfs() integration (Dietmar)

Changes in v3:
- rebase on today's tip/sched/core (commit 07881166a892)
- moved into Juri's cpu_util_cfs(), which should also
address Rafael's suggestion to use a local variable.

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/sched.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc6c8b5a24ad..ce33a5649bf2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2122,7 +2122,12 @@ static inline unsigned long cpu_util_dl(struct rq *rq)

static inline unsigned long cpu_util_cfs(struct rq *rq)
{
- return rq->cfs.avg.util_avg;
+ if (!sched_feat(UTIL_EST))
+ return READ_ONCE(rq->cfs.avg.util_avg);
+
+ return max_t(unsigned long,
+ READ_ONCE(rq->cfs.avg.util_avg),
+ READ_ONCE(rq->cfs.avg.util_est.enqueued));
}

#endif
--
2.15.1


2018-02-22 17:04:35

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v5 2/4] sched/fair: use util_est in LB and WU paths

When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.

In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in cpu_util_wake as well as in update_sg_lb_stats.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v5:
- always use int instead of long whenever possible (Peter)
- add missing READ_ONCE barriers (Peter)

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- ensure cpu_util_wake() is cpu_capacity_orig()'s clamped (Pavan)

Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 76 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8526687f107..8364771f7301 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6445,6 +6445,41 @@ static unsigned long cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}

+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked utilization" in
+ * describing the potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utilization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+ unsigned int util, util_est;
+ unsigned int capacity;
+ struct cfs_rq *cfs_rq;
+
+ if (!sched_feat(UTIL_EST))
+ return cpu_util(cpu);
+
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = READ_ONCE(cfs_rq->avg.util_avg);
+ util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ util_est = max(util, util_est);
+
+ capacity = capacity_orig_of(cpu);
+ util_est = min(util_est, capacity);
+
+ return util_est;
+}
+
static inline unsigned long task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
@@ -6469,16 +6504,52 @@ static inline unsigned long task_util_est(struct task_struct *p)
*/
static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
- unsigned long util, capacity;
+ unsigned int util, util_est;
+ unsigned int capacity;

/* Task has no contribution or is new */
if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
- return cpu_util(cpu);
+ return cpu_util_est(cpu);
+
+ /* Discount task's blocked util from CPU's util */
+ util = cpu_util(cpu);
+ util -= min_t(unsigned int, util, task_util(p));
+
+ if (!sched_feat(UTIL_EST))
+ return util;
+
+ /*
+ * Covered cases:
+ * - if *p is the only task sleeping on this CPU, then:
+ * cpu_util (== task_util) > util_est (== 0)
+ * and thus we return:
+ * cpu_util_wake = (cpu_util - task_util) = 0
+ *
+ * - if other tasks are SLEEPING on the same CPU, which is just waking
+ * up, then:
+ * cpu_util >= task_util
+ * cpu_util > util_est (== 0)
+ * and thus we discount *p's blocked utilization to return:
+ * cpu_util_wake = (cpu_util - task_util) >= 0
+ *
+ * - if other tasks are RUNNABLE on that CPU and
+ * util_est > cpu_util
+ * then we use util_est since it returns a more restrictive
+ * estimation of the spare capacity on that CPU, by just considering
+ * the expected utilization of tasks already runnable on that CPU.
+ */
+ util_est = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued);
+ util = max(util, util_est);

+ /*
+ * Estimated utilization can exceed the CPU capacity, thus let's clamp
+ * to the maximum CPU capacity to ensure consistency with other
+ * cpu_util[_est] calls.
+ */
capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ util = min(util, capacity);

- return (util >= capacity) ? capacity : util;
+ return util;
}

/*
@@ -8005,7 +8076,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);

sgs->group_load += load;
- sgs->group_util += cpu_util(i);
+ sgs->group_util += cpu_util_est(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

nr_running = rq->nr_running;
--
2.15.1


2018-02-26 04:05:40

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] sched/cpufreq_schedutil: use util_est for OPP selection

On 22-02-18, 17:01, Patrick Bellasi wrote:
> When schedutil looks at the CPU utilization, the current PELT value for
> that CPU is returned straight away. In certain scenarios this can have
> undesired side effects and delays on frequency selection.
>
> For example, since the task utilization is decayed at wakeup time, a
> long sleeping big task newly enqueued does not add immediately a
> significant contribution to the target CPU. This introduces some latency
> before schedutil will be able to detect the best frequency required by
> that task.
>
> Moreover, the PELT signal build-up time is a function of the current
> frequency, because of the scale invariant load tracking support. Thus,
> starting from a lower frequency, the utilization build-up time will
> increase even more and further delays the selection of the actual
> frequency which better serves the task requirements.
>
> In order to reduce this kind of latencies, we integrate the usage
> of the CPU's estimated utilization in the sugov_get_util function.
> This allows to properly consider the expected utilization of a CPU which,
> for example, has just got a big task running after a long sleep period.
> Ultimately this allows to select the best frequency to run a task
> right after its wake-up.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Acked-by: Rafael J. Wysocki <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v5:
> - add missing READ_ONCE() barrieres
> - add acked-by Rafael tag
>
> Changes in v4:
> - rebased on today's tip/sched/core (commit 460e8c3340a2)
> - use util_est.enqueued for cfs_rq's util_est (Joel)
> - simplify cpu_util_cfs() integration (Dietmar)
>
> Changes in v3:
> - rebase on today's tip/sched/core (commit 07881166a892)
> - moved into Juri's cpu_util_cfs(), which should also
> address Rafael's suggestion to use a local variable.
>
> Changes in v2:
> - rebase on top of v4.15-rc2
> - tested that overhauled PELT code does not affect the util_est
> ---
> kernel/sched/sched.h | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index dc6c8b5a24ad..ce33a5649bf2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2122,7 +2122,12 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>
> static inline unsigned long cpu_util_cfs(struct rq *rq)
> {
> - return rq->cfs.avg.util_avg;
> + if (!sched_feat(UTIL_EST))
> + return READ_ONCE(rq->cfs.avg.util_avg);
> +
> + return max_t(unsigned long,
> + READ_ONCE(rq->cfs.avg.util_avg),
> + READ_ONCE(rq->cfs.avg.util_est.enqueued));
> }

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-03-01 17:43:52

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

This is missing the below #ifdef guards, adding here has a note for
the next resping on list.

On 22-Feb 17:01, Patrick Bellasi wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e1febd252a84..c8526687f107 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5205,6 +5205,23 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>

#ifdef CONFIG_SMP

> +static inline unsigned long task_util(struct task_struct *p);
> +static inline unsigned long _task_util_est(struct task_struct *p);
> +
> +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> + struct task_struct *p)
> +{
> + unsigned int enqueued;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /* Update root cfs_rq's estimated utilization */
> + enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + enqueued += _task_util_est(p);
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> +}
> +

#else
static inline void util_est_enqueue(struct cfs_rq *cfs_rq
struct task_struct *p)
{
}
#endif /* CONFIG_SMP */

> /*
> * The enqueue_task method is called before nr_running is
> * increased. Here we update the fair scheduling stats and
> @@ -5257,9 +5274,86 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se)
> add_nr_running(rq, 1);
>
> + util_est_enqueue(&rq->cfs, p);
> hrtick_update(rq);
> }
>
> +/*
> + * Check if a (signed) value is within a specified (unsigned) margin,
> + * based on the observation that:
> + * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
> + *
> + * NOTE: this only works when value + maring < INT_MAX.
> + */
> +static inline bool within_margin(int value, int margin)
> +{
> + return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
> +}
> +
> +static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> + struct task_struct *p,
> + bool task_sleep)
> +{

#ifdef CONFIG_SMP

> + long last_ewma_diff;
> + struct util_est ue;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /*
> + * Update root cfs_rq's estimated utilization
> + *
> + * If *p is the last task then the root cfs_rq's estimated utilization
> + * of a CPU is 0 by definition.
> + */
> + ue.enqueued = 0;
> + if (cfs_rq->nr_running) {
> + ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + ue.enqueued -= min_t(unsigned int, ue.enqueued,
> + _task_util_est(p));
> + }
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
> +
> + /*
> + * Skip update of task's estimated utilization when the task has not
> + * yet completed an activation, e.g. being migrated.
> + */
> + if (!task_sleep)
> + return;
> +
> + /*
> + * Skip update of task's estimated utilization when its EWMA is
> + * already ~1% close to its last activation value.
> + */
> + ue = READ_ONCE(p->se.avg.util_est);
> + ue.enqueued = task_util(p);
> + last_ewma_diff = ue.enqueued - ue.ewma;
> + if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> + return;
> +
> + /*
> + * Update Task's estimated utilization
> + *
> + * When *p completes an activation we can consolidate another sample
> + * of the task size. This is done by storing the current PELT value
> + * as ue.enqueued and by using this value to update the Exponential
> + * Weighted Moving Average (EWMA):
> + *
> + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> + * = w * ( last_ewma_diff ) + ewma(t-1)
> + * = w * (last_ewma_diff + ewma(t-1) / w)
> + *
> + * Where 'w' is the weight of new samples, which is configured to be
> + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> + */
> + ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
> + ue.ewma += last_ewma_diff;
> + ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
> + WRITE_ONCE(p->se.avg.util_est, ue);

#endif /* CONFIG_SMP */

> +}
> +
> static void set_next_buddy(struct sched_entity *se);
>
> /*
> @@ -5316,6 +5410,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se)
> sub_nr_running(rq, 1);
>
> + util_est_dequeue(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
> }
>

--
#include <best/regards.h>

Patrick Bellasi

2018-03-01 17:48:58

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates

The changelog is missing the below CCs. :(

Since that's a new patch in this series, I expect some feedbacks and
thus I'll add them on the next respin.

On 22-Feb 17:01, Patrick Bellasi wrote:
> The estimated utilization of a task is currently updated every time the
> task is dequeued. However, to keep overheads under control, PELT signals
> are effectively updated at maximum once every 1ms.
>
> Thus, for really short running tasks, it can happen that their util_avg
> value has not been updates since their last enqueue. If such tasks are
> also frequently running tasks (e.g. the kind of workload generated by
> hackbench) it can also happen that their util_avg is updated only every
> few activations.
>
> This means that updating util_est at every dequeue potentially introduces
> not necessary overheads and it's also conceptually wrong if the util_avg
> signal has never been updated during a task activation.
>
> Let's introduce a throttling mechanism on task's util_est updates
> to sync them with util_avg updates. To make the solution memory
> efficient, both in terms of space and load/store operations, we encode a
> synchronization flag into the LSB of util_est.enqueued.
> This makes util_est an even values only metric, which is still
> considered good enough for its purpose.
> The synchronization bit is (re)set by __update_load_avg_se() once the
> PELT signal of a task has been updated during its last activation.
>
> Such a throttling mechanism allows to keep under control util_est
> overheads in the wakeup hot path, thus making it a suitable mechanism
> which can be enabled also on high-intensity workload systems.
> Thus, this now switches on by default the estimation utilization
> scheduler feature.
>
> Suggested-by: Chris Redpath <[email protected]>
> Signed-off-by: Patrick Bellasi <[email protected]>

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]

[...]

--
#include <best/regards.h>

Patrick Bellasi

2018-03-06 18:57:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> +/**
> + * Estimation Utilization for FAIR tasks.
> + *
> + * Support data structure to track an Exponential Weighted Moving Average
> + * (EWMA) of a FAIR task's utilization. New samples are added to the moving
> + * average each time a task completes an activation. Sample's weight is
> + * chosen so that the EWMA will be relatively insensitive to transient changes
> + * to the task's workload.
> + *
> + * @enqueued: instantaneous estimated utilization of a task/cpu
> + * task: the task's util_avg at last task dequeue time
> + * cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
> + *
> + * Thus, the util_est.enqueued of a task represents the contribution on the
> + * estimated utilization of the CPU where that task is currently enqueued.
> + *
> + * @ewma: the Exponential Weighted Moving Average (EWMA) utilization of a task
> + * Only for tasks we track a moving average of the past instantaneous
> + * estimated utilization. This allows to absorb sporadic drops in
> + * utilization of an otherwise almost periodic task.
> + *
> + */

The above comment appears to have whitespace issues, the paragraph
starting with "Thus" looks indented by one character for exmaple.

2018-03-06 19:00:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> + struct task_struct *p)
> +{
> + unsigned int enqueued;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /* Update root cfs_rq's estimated utilization */
> + enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + enqueued += _task_util_est(p);
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> +}

> +static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> + struct task_struct *p,
> + bool task_sleep)
> +{
> + long last_ewma_diff;
> + struct util_est ue;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /*
> + * Update root cfs_rq's estimated utilization
> + *
> + * If *p is the last task then the root cfs_rq's estimated utilization
> + * of a CPU is 0 by definition.
> + */
> + ue.enqueued = 0;
> + if (cfs_rq->nr_running) {
> + ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + ue.enqueued -= min_t(unsigned int, ue.enqueued,
> + _task_util_est(p));
> + }
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);

It appears to me this isn't a stable situation and completely relies on
the !nr_running case to recalibrate. If we ensure that doesn't happen
for a significant while the sum can run-away, right?

Should we put a max in enqueue to avoid this?

2018-03-06 19:04:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> +struct util_est {
> + unsigned int enqueued;
> + unsigned int ewma;
> +#define UTIL_EST_WEIGHT_SHIFT 2
> +};

> + ue = READ_ONCE(p->se.avg.util_est);

> + WRITE_ONCE(p->se.avg.util_est, ue);

That is actually quite dodgy... and relies on the fact that we have the
8 byte case in __write_once_size() and __read_once_size()
unconditionally. It then further relies on the compiler DTRT for 32bit
platforms, which is generating 2 32bit loads/stores.

The advantage is of course that it will use single u64 loads/stores
where available.


2018-03-07 09:41:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On Tue, Mar 06, 2018 at 07:58:51PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> > +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > + struct task_struct *p)
> > +{
> > + unsigned int enqueued;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /* Update root cfs_rq's estimated utilization */
> > + enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > + enqueued += _task_util_est(p);
> > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > +}

> It appears to me this isn't a stable situation and completely relies on
> the !nr_running case to recalibrate. If we ensure that doesn't happen
> for a significant while the sum can run-away, right?
>
> Should we put a max in enqueue to avoid this?

Thinking about this a bit more; would it make sense to adjust the
running sum/avg on migration? Something along the lines of:

util_avg = se->load_avg / (cfs_rq->load_avg + se->load_avg);

(which disregards cgroups), because that should more or less be the time
it ends up running, given the WFQ rule.

That way the disparity between tasks migrating into the CPU at u=1 and
them going to sleep at u<1 is much smaller and the above sum doesn't run
away nearly as wild (it still needs some upper bound though).

2018-03-07 10:13:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] sched/cpufreq_schedutil: use util_est for OPP selection

On Thu, Feb 22, 2018 at 05:01:52PM +0000, Patrick Bellasi wrote:
> static inline unsigned long cpu_util_cfs(struct rq *rq)
> {
> + if (!sched_feat(UTIL_EST))
> + return READ_ONCE(rq->cfs.avg.util_avg);
> +
> + return max_t(unsigned long,
> + READ_ONCE(rq->cfs.avg.util_avg),
> + READ_ONCE(rq->cfs.avg.util_est.enqueued));
> }

static inline unsigned long cpu_util_cfs(struct rq *rq)
{
unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

if (sched_feat(UTIL_EST)) {
util = max_t(unsigned long, util,
READ_ONCE(rq->cfs.avg.util_est.enqueued));
}

return util;
}

Seems like a more readable variant.

2018-03-07 10:40:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates

On Thu, Feb 22, 2018 at 05:01:53PM +0000, Patrick Bellasi wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8364771f7301..1bf9a86ebc39 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3047,6 +3047,29 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
> }
> }
>
> +/*
> + * When a task is dequeued, its estimated utilization should not be update if
> + * its util_avg has not been updated at least once.
> + * This flag is used to synchronize util_avg updates with util_est updates.
> + * We map this information into the LSB bit of the utilization saved at
> + * dequeue time (i.e. util_est.dequeued).
> + */
> +#define UTIL_EST_NEED_UPDATE_FLAG 0x1
> +
> +static inline void cfs_se_util_change(struct sched_avg *avg)
> +{
if (!sched_feat(UTIL_EST))
return;

> + if (sched_feat(UTIL_EST)) {
> + struct util_est ue = READ_ONCE(avg->util_est);
> +
> + if (!(ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG))
> + return;
> +
> + /* Reset flag to report util_avg has been updated */
> + ue.enqueued &= ~UTIL_EST_NEED_UPDATE_FLAG;
> + WRITE_ONCE(avg->util_est, ue);
> + }

and loose the indent. Also, since we only update the enqueued value, we
don't need to load/store the entire util_est thing here.

> +}
> +
> #ifdef CONFIG_SMP
> /*
> * Approximate:
> @@ -3308,6 +3331,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entit
> cfs_rq->curr == se)) {
>
> ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
> + cfs_se_util_change(&se->avg);
> return 1;
> }
>

So we only clear the bit for @se updates.

> @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
>
> /* Update root cfs_rq's estimated utilization */
> enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> - enqueued += _task_util_est(p);
> + enqueued += (_task_util_est(p) | 0x1);

UTIL_EST_NEED_UPDATE_FLAG, although I do agree that 0x1 is much easier
to type ;-)

But you set it for the cfs_rq value ?! That doesn't seem right.

> WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> }
>
> @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> if (cfs_rq->nr_running) {
> ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> ue.enqueued -= min_t(unsigned int, ue.enqueued,
> - _task_util_est(p));
> + (_task_util_est(p) | UTIL_EST_NEED_UPDATE_FLAG));
> }
> WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
>

Would it be really horrible if you separate the value and flag using a
bitfield/shifts?

> @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> if (!task_sleep)
> return;
>
> + /*
> + * Skip update of task's estimated utilization if the PELT signal has
> + * never been updated (at least once) since last enqueue time.
> + */
> + ue = READ_ONCE(p->se.avg.util_est);
> + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)
> + return;
> +
> /*
> * Skip update of task's estimated utilization when its EWMA is
> * already ~1% close to its last activation value.
> */
> + ue.enqueued = (task_util(p) | UTIL_EST_NEED_UPDATE_FLAG);
> last_ewma_diff = ue.enqueued - ue.ewma;
> if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> return;


I see what you do, but Yuck! that's really nasty. Then again, I've not
actually got a better suggestion.



2018-03-07 11:33:06

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On 06-Mar 19:58, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> > +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > + struct task_struct *p)
> > +{
> > + unsigned int enqueued;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /* Update root cfs_rq's estimated utilization */
> > + enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > + enqueued += _task_util_est(p);
> > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > +}
>
> > +static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> > + struct task_struct *p,
> > + bool task_sleep)
> > +{
> > + long last_ewma_diff;
> > + struct util_est ue;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /*
> > + * Update root cfs_rq's estimated utilization
> > + *
> > + * If *p is the last task then the root cfs_rq's estimated utilization
> > + * of a CPU is 0 by definition.
> > + */
> > + ue.enqueued = 0;
> > + if (cfs_rq->nr_running) {
> > + ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > + ue.enqueued -= min_t(unsigned int, ue.enqueued,
> > + _task_util_est(p));
> > + }
> > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
>
> It appears to me this isn't a stable situation and completely relies on
> the !nr_running case to recalibrate. If we ensure that doesn't happen
> for a significant while the sum can run-away, right?

By away you mean go over 1024 or overflow the unsigned int storage?

In the first case, I think we don't care about exceeding 1024 since:
- we cap to capacity_orig_of in cpu_util_est
- by directly reading the cfs_rq->avg.util_est.enqueued we can
actually detect conditions in which a CPU is over-saturated.

In the second case, with an unsigned int we can enqueue up to few
millions of 100% tasks on a single CPU without overflowing.

> Should we put a max in enqueue to avoid this?

IMO the capping from the cpu_util_est getter should be enough...

Maybe I'm missing your point here?

--
#include <best/regards.h>

Patrick Bellasi

2018-03-07 11:48:23

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On 06-Mar 20:02, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> > +struct util_est {
> > + unsigned int enqueued;
> > + unsigned int ewma;
> > +#define UTIL_EST_WEIGHT_SHIFT 2
> > +};
>
> > + ue = READ_ONCE(p->se.avg.util_est);
>
> > + WRITE_ONCE(p->se.avg.util_est, ue);
>
> That is actually quite dodgy... and relies on the fact that we have the
> 8 byte case in __write_once_size() and __read_once_size()
> unconditionally. It then further relies on the compiler DTRT for 32bit
> platforms, which is generating 2 32bit loads/stores.
>
> The advantage is of course that it will use single u64 loads/stores
> where available.

Yes, that's mainly an "optimization" for 64bit targets... but perhaps
the benefits are negligible.

Do you prefer to keep more "under control" the generated code by using
two {READ,WRITE}_ONCEs?

IMO here we can also go with just the WRITE_ONCEs. I don't see a case
for the compiler to mangle load/store. While the WRITE_ONCE are still
required to sync with non rq-lock serialized code.
But... maybe I'm missing something... ?

--
#include <best/regards.h>

Patrick Bellasi

2018-03-07 12:26:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On Wed, Mar 07, 2018 at 11:31:49AM +0000, Patrick Bellasi wrote:
> > It appears to me this isn't a stable situation and completely relies on
> > the !nr_running case to recalibrate. If we ensure that doesn't happen
> > for a significant while the sum can run-away, right?
>
> By away you mean go over 1024 or overflow the unsigned int storage?

the later, I think you can make it arbitrarily large. Have a busy task
on CPU0, this ensure !nr_running never happens.

Start a busy task on CPU1, wait for it to hit u=1, then migrate it to
CPU0, then wait for it to hit u=.5 then kill it, this effectively adds
.5 to the enqueued value, repeat indefinitely.

2018-03-07 12:27:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On Wed, Mar 07, 2018 at 11:47:11AM +0000, Patrick Bellasi wrote:
> On 06-Mar 20:02, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> > > +struct util_est {
> > > + unsigned int enqueued;
> > > + unsigned int ewma;
> > > +#define UTIL_EST_WEIGHT_SHIFT 2
> > > +};
> >
> > > + ue = READ_ONCE(p->se.avg.util_est);
> >
> > > + WRITE_ONCE(p->se.avg.util_est, ue);
> >
> > That is actually quite dodgy... and relies on the fact that we have the
> > 8 byte case in __write_once_size() and __read_once_size()
> > unconditionally. It then further relies on the compiler DTRT for 32bit
> > platforms, which is generating 2 32bit loads/stores.
> >
> > The advantage is of course that it will use single u64 loads/stores
> > where available.
>
> Yes, that's mainly an "optimization" for 64bit targets... but perhaps
> the benefits are negligible.
>
> Do you prefer to keep more "under control" the generated code by using
> two {READ,WRITE}_ONCEs?
>
> IMO here we can also go with just the WRITE_ONCEs. I don't see a case
> for the compiler to mangle load/store. While the WRITE_ONCE are still
> required to sync with non rq-lock serialized code.
> But... maybe I'm missing something... ?

I'm not sure we rely on READ/WRITE_ONCE() of 64bit variables on 32bit
targets to be sane anywhere else (we could be, I just dont know).

I suspect it all works as expected... but its a tad tricky.

2018-03-07 12:33:37

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On 06-Mar 19:56, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> > +/**
> > + * Estimation Utilization for FAIR tasks.
> > + *
> > + * Support data structure to track an Exponential Weighted Moving Average
> > + * (EWMA) of a FAIR task's utilization. New samples are added to the moving
> > + * average each time a task completes an activation. Sample's weight is
> > + * chosen so that the EWMA will be relatively insensitive to transient changes
> > + * to the task's workload.
> > + *
> > + * @enqueued: instantaneous estimated utilization of a task/cpu
> > + * task: the task's util_avg at last task dequeue time
> > + * cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
> > + *
> > + * Thus, the util_est.enqueued of a task represents the contribution on the
> > + * estimated utilization of the CPU where that task is currently enqueued.
> > + *
> > + * @ewma: the Exponential Weighted Moving Average (EWMA) utilization of a task
> > + * Only for tasks we track a moving average of the past instantaneous
> > + * estimated utilization. This allows to absorb sporadic drops in
> > + * utilization of an otherwise almost periodic task.
> > + *
> > + */
>
> The above comment appears to have whitespace issues, the paragraph
> starting with "Thus" looks indented by one character for exmaple.

That was actually intentional... I wanted to keep it aligned after the
"@" to better mark paragraphs describing the struct members.

However, I've just notice the overall format is not sphinx valid.
Thus, I'll update it to ensure also that the documentation is properly
generated.

--
#include <best/regards.h>

Patrick Bellasi

2018-03-07 15:19:16

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On 07-Mar 13:26, Peter Zijlstra wrote:
> On Wed, Mar 07, 2018 at 11:47:11AM +0000, Patrick Bellasi wrote:
> > On 06-Mar 20:02, Peter Zijlstra wrote:
> > > On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> > > > +struct util_est {
> > > > + unsigned int enqueued;
> > > > + unsigned int ewma;
> > > > +#define UTIL_EST_WEIGHT_SHIFT 2
> > > > +};
> > >
> > > > + ue = READ_ONCE(p->se.avg.util_est);
> > >
> > > > + WRITE_ONCE(p->se.avg.util_est, ue);
> > >
> > > That is actually quite dodgy... and relies on the fact that we have the
> > > 8 byte case in __write_once_size() and __read_once_size()
> > > unconditionally. It then further relies on the compiler DTRT for 32bit
> > > platforms, which is generating 2 32bit loads/stores.
> > >
> > > The advantage is of course that it will use single u64 loads/stores
> > > where available.
> >
> > Yes, that's mainly an "optimization" for 64bit targets... but perhaps
> > the benefits are negligible.
> >
> > Do you prefer to keep more "under control" the generated code by using
> > two {READ,WRITE}_ONCEs?

Any specific preference on this previous point?

> > IMO here we can also go with just the WRITE_ONCEs. I don't see a case
> > for the compiler to mangle load/store. While the WRITE_ONCE are still
> > required to sync with non rq-lock serialized code.
> > But... maybe I'm missing something... ?
>
> I'm not sure we rely on READ/WRITE_ONCE() of 64bit variables on 32bit
> targets to be sane anywhere else (we could be, I just dont know).

My understating is that, since here we are in an rq-lock protected
section, and only in this section we can write these vars, then the
load is a dependency for the store and the compiler cannot screw up...

> I suspect it all works as expected... but its a tad tricky.

Then let's keep them for the time being... meanwhile I try to get
some more "internal" feedback before next posting.

--
#include <best/regards.h>

Patrick Bellasi

2018-03-07 15:39:06

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On 07-Mar 10:39, Peter Zijlstra wrote:
> On Tue, Mar 06, 2018 at 07:58:51PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 22, 2018 at 05:01:50PM +0000, Patrick Bellasi wrote:
> > > +static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > > + struct task_struct *p)
> > > +{
> > > + unsigned int enqueued;
> > > +
> > > + if (!sched_feat(UTIL_EST))
> > > + return;
> > > +
> > > + /* Update root cfs_rq's estimated utilization */
> > > + enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > > + enqueued += _task_util_est(p);
> > > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > > +}
>
> > It appears to me this isn't a stable situation and completely relies on
> > the !nr_running case to recalibrate. If we ensure that doesn't happen
> > for a significant while the sum can run-away, right?
> >
> > Should we put a max in enqueue to avoid this?
>
> Thinking about this a bit more; would it make sense to adjust the
> running sum/avg on migration? Something along the lines of:
>
> util_avg = se->load_avg / (cfs_rq->load_avg + se->load_avg);
>
> (which disregards cgroups), because that should more or less be the time
> it ends up running, given the WFQ rule.

I would say it makes sense from a purely mechanism stanpoing, but I'm
not entirely convinced it can be useful from a practical stanpoint.

First of all, that should be applied only when we migrate to a more
saturated CPU. Otherwise, when migrating on an empty CPU we would set
util_avg = 100%

Secondly, when we migrate to a saturated CPU, this adjustment will
contribute to under-estimate the task utilization.
Let say the task was running on a completely empty CPU, and thus we
was able to ramp up without being preempted. This value represents a
good estimation of the (most recent) task CPU demands.

Now, if on a following activation, we wakeup the task on an IDLE CPU
with a lot of blocked load, then we will scale down its util_avg
and assume the task will be smaller.
But:

a) if the blocked load does not turns into some task waking up again,
underestimated the task introduces only further ramp-up latencies

b) if the load it due to really active tasks, the task will be
preempted and it's utilization smaller... but we are already in a
domain where utilization does not tell us anything useful for a
task... and thus, why bothering to make it converging sooner?


> That way the disparity between tasks migrating into the CPU at u=1 and
> them going to sleep at u<1 is much smaller and the above sum doesn't run
> away nearly as wild (it still needs some upper bound though).

--
#include <best/regards.h>

Patrick Bellasi

2018-03-07 17:20:29

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On 07-Mar 13:24, Peter Zijlstra wrote:
> On Wed, Mar 07, 2018 at 11:31:49AM +0000, Patrick Bellasi wrote:
> > > It appears to me this isn't a stable situation and completely relies on
> > > the !nr_running case to recalibrate. If we ensure that doesn't happen
> > > for a significant while the sum can run-away, right?
> >
> > By away you mean go over 1024 or overflow the unsigned int storage?
>
> the later, I think you can make it arbitrarily large. Have a busy task
> on CPU0, this ensure !nr_running never happens.
>
> Start a busy task on CPU1, wait for it to hit u=1, then migrate it to
> CPU0,

At this point util_est(CPU0) = 2048, which is:

+1024 for the busy running task
assuming it has been enqueued with the utilization since the beginning
+1024 for the newly migrated task from CPU1
which is enqueued with the value he reached at dequeued time
from CPU1

> then wait for it to hit u=.5 then kill it,

... but when we kill it, the task is dequeued, and thus we remove
1024.

Maybe that's the tricky bit: we remove the value we enqueued, _not_
the current util_avg. Notice we use _task_util_est(p)... with the
leading "_".

> this effectively adds
> .5 to the enqueued value, repeat indefinitely.

Thus this should not happen.

Basically, the RQ's util_est is the sum of the RUNNABLE tasks's
util_est at their enqueue time... which has been update at their last
dequeue time, hence the usage of name "dequeued" for both tasks and
rqs.

Does it make sense now?

--
#include <best/regards.h>

Patrick Bellasi

2018-03-07 17:36:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] sched/fair: add util_est on top of PELT

On Wed, Mar 07, 2018 at 03:24:58PM +0000, Patrick Bellasi wrote:

> Maybe that's the tricky bit: we remove the value we enqueued, _not_
> the current util_avg. Notice we use _task_util_est(p)... with the
> leading "_".

ARGH, ok let me try that again ;-)

2018-03-08 09:17:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates

On Wed, Mar 07, 2018 at 11:38:52AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:53PM +0000, Patrick Bellasi wrote:

> > @@ -5218,7 +5242,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> >
> > /* Update root cfs_rq's estimated utilization */
> > enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > - enqueued += _task_util_est(p);
> > + enqueued += (_task_util_est(p) | 0x1);
>
> UTIL_EST_NEED_UPDATE_FLAG, although I do agree that 0x1 is much easier
> to type ;-)
>
> But you set it for the cfs_rq value ?! That doesn't seem right.
>
> > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > }
> >
> > @@ -5310,7 +5334,7 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> > if (cfs_rq->nr_running) {
> > ue.enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > ue.enqueued -= min_t(unsigned int, ue.enqueued,
> > - _task_util_est(p));
> > + (_task_util_est(p) | UTIL_EST_NEED_UPDATE_FLAG));
> > }
> > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
> >

OK, so you unconditionally set that bit here to make the add/sub match.
Clearly I wasn't having a good day yesterday.

2018-03-08 09:49:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates

On Thu, Feb 22, 2018 at 05:01:53PM +0000, Patrick Bellasi wrote:
> +#define UTIL_EST_NEED_UPDATE_FLAG 0x1

> @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> if (!task_sleep)
> return;
>
> + /*
> + * Skip update of task's estimated utilization if the PELT signal has
> + * never been updated (at least once) since last enqueue time.
> + */
> + ue = READ_ONCE(p->se.avg.util_est);
> + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)
> + return;

The name and function seem inverted, if the flag is set, we do _NOT_
update util_est.

How about something like UTIL_EST_UNCHANGED ? That would give:

/*
* If the PELT values haven't changed since enqueue time,
* skip the util_est update.
*/
if (enqueue & UTIL_EST_UNCHANGED)
return;



2018-03-08 10:39:36

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] sched/fair: update util_est only on util_avg updates

On 08-Mar 10:48, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 05:01:53PM +0000, Patrick Bellasi wrote:
> > +#define UTIL_EST_NEED_UPDATE_FLAG 0x1
>
> > @@ -5321,12 +5345,19 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> > if (!task_sleep)
> > return;
> >
> > + /*
> > + * Skip update of task's estimated utilization if the PELT signal has
> > + * never been updated (at least once) since last enqueue time.
> > + */
> > + ue = READ_ONCE(p->se.avg.util_est);
> > + if (ue.enqueued & UTIL_EST_NEED_UPDATE_FLAG)
> > + return;
>
> The name and function seem inverted, if the flag is set, we do _NOT_
> update util_est.

My reading was along the line "when the flag is set we need an
update of util_avg to collect a new util_est sample"...

... but I agree that's confusing... and unnecessary long.

> How about something like UTIL_EST_UNCHANGED ? That would give:

I would prefer UTIL_AVG_UNCHANGED, since the flags is reset when we
have a change in util_avg, thus enabling util_est updates.

> /*
> * If the PELT values haven't changed since enqueue time,
> * skip the util_est update.
> */
> if (enqueue & UTIL_EST_UNCHANGED)
> return;

--
#include <best/regards.h>

Patrick Bellasi